Saturday, August 8, 2009

Real time search (particularly of the chans) just got feasible

After discussing it with my room mate, I've made some breakthroughs here:

the basic concept is, real time search for the *chans. 4chan, 7chan, 420chan, 711chan, 99chan - whatever ones float your boat. You archive about the front 4 pages of every one of their boards and drop the threads as they get beyond the 10-15 minute stale mark, with some obvious tweaking in less active boards. Then you present a nice search interface into it which returns links to and previews of the threads, and the user clicks the link to open a new tab or summat to the chan where the thread is taking place.

Originally the notion was "scrape the data off the pages and hope the chans don't notice you're doing it or change the structure of their pages much" - this meant dealing with the various chan softwares out there and crafting scrapers to get data from all of them. It also meant needing to update (perhaps frantically) at slight changes. It was a scary thought that they might start doing it maliciously, because you aren't really their friend. You're just someone hitting their servers kind of hard and not generating ad revenue.

The reason for all this? Think about it. You want a market? You got it. Around a million or two users who are predominantly male, single, 18-30, have a lot of time on their hands, and are probably aroused and impulsive over it. This should have marketing people salivating like mad - so much money to be made! Of course it's not a very "clean" or "safe" environment, but jesus.

The flipside for the people? What are people on chans for? To waste time? To not be doing anything? Maybe. What's likelier is that there are pieces of the chans that give them something they're looking for. A particular style of humor. A particular sexual fetish. A particular type of story - maybe creepy threads. They go to a chan they know and scour it - reading and looking, waiting for one of the less-than-a-dozen really interesting things (to them) pops up, then they read and follow it, and are either satisfied or continue. If they use a chan up they move to another, and often just work back and forth.

That's a large amount of unfocused viewing with little gratification. The thrill of the hunt I guess but - let's not bother with that. It would be much nicer if you could capture from all streams and grep out what you like. So you go to one interface, type in "creepy thread", and then it shows you threads on 10 or 20 different chans in the past 15 minutes that have had the term "creepy thread" come up in them, and bam. You've just made a way, way better experience for the end user, and dramatically cut down on server load for the people running the chan.

Why that cut-down on server load matters is the magic. This is as far as I'd gotten beforehand - page scraping and thread passing, hoping the chan owners wouldn't notice you was the plan. But why not make their deal sweeter too? The only source of revenue for a chan owner is advertisements. But ads are going to be ineffective unless they're broad, meaning they have to be blandly pornographic, and that's still ineffective. What this service offers is focus, and a reduction in server load. At optimal usage, you might have, say, half the server hits per month, but 10-30% higher probability of clickthrough rates because you can start targeting your ads. Someone comes to your chan because they searched "creepy thread"? Well, show them horror movie or scary story ads. They're way likelier to bite on that than a skimpily clad woman advertising "for-pay porn" on a site where half the users are present to get porn free.

The providers of the interface can target and present ads as well - maybe just unobtrusive google ads, whatever. The very fact that the user is showing their interests is enough to allow some level of targetting, meaning a higher clickthrough rate, and better revenue. It also ensures lower costs and a faster, easier experience for the end user. Literally, everyone wins.

Given this, the new idea is to work with the chan runners and get them to run some software for us, that will do the scraping on the server side and send an efficient package something like once a minute to our servers for inclusion in the search. This would make the job easier for us, and less painful for them.

Anyway, yeah. I think this should be done, and might work on it when I'm not actively engaged elsewhere. If you'd like to do it instead, go ahead - but this only really becomes effective if the chans play nice together on it. You'd need one service covering at least the big 4 or 5 for it to really be useful.


edit: there's the worry of "what if it's that stupid random unfocused time that allows you to run into new things and to spawn new things, which keeps the chans from becoming stale and dead" - and my response is that you can still go surf them regularly. Hopefully this'll clear out the leeches and the seekers, and leave mostly oldfags and trolls actually sticking to a chan. It'll also likely improve content of specialized threads, because people with a vested interest can be guaranteed to find them.

And the interface for this is crucial - you make the interface suck, it won't work.

..okay I'm done

Thursday, August 6, 2009

Search for Programmers

inurl: "\.(avi|mpe?g|wmv|mp(3|4))$"

Wouldn't it be great if this is all you had to do to reliably find media files on the internet?

Some of you out there will think "But wyatt that doesn't look simple or reasonable at all your crazy!". And the first thing I'll do is scold you for not correctly writing "you're". Then I'll tell you about the magic that Regular Expressions possess, try to show you how simple and powerful they are, and perhaps rant a bit, as I'm about to.

Search is an imperfect science, no doubt.

This isn't a matter that Google (or Bing, or Yahoo (, or Ask (man where's jeeves) help, because they all do something that the end user is quite happy about. They generalize and abstract your search, in the hopes of providing you more relevant (and numerous!) results, faster. This is why punctuation doesn't get noticed, why you need to specify +sptember to not have Google check for "September", and many other things. These applications exist to serve the general populace and give people who don't know how to properly spell a break when they search for "maek baby", or "christ born" (when they meant 'chris brown').

So what about the rest of us? Those of us who know what we're typing, have a bit of a programmatic bent, and know that if we could only feed google the right pattern, we'd get precisely the material we want? To the best of my ability, I haven't been able to find anything which provides this sort of service.

That's sensible, I guess. Real text search is space and time expensive. On the order of the entire internet, no one would want to take on that extra problem. Why should Google, which already rules the universe, bother? No one is hammering at their door demanding this. How could anyone else try? It would be prohibitively expensive for a newcomer, or even a large competitor (Microsoft) to do, and provide relatively little gain*.

This presents a problem then: how can we make it occur? I'll tell you what I think: abuse the crowd.

Write the search database / server software to be as modular, self-contained, and high-latency scalable as possible. Pass it out free, and evangelize the hell out of it to the people who care. The more people who run the servers, the better the search gets, and all at a low cost of some space and processing time from each user.

Of course there's the problem of the front-end and making the queries get passed out reasonably, ensuring the right level of redundancy (searches should always work, but you shouldn't be too redundant. On top of that there's maintaining reasonable coverage with a small network, how (if at all) this 'makes money', and then actually writing the thing. And maintaining it.

It's a lot to do, but I want real text search (including complex regular expressions) on the web. I also want it in a billion other places - music sites/applications, code databases, booktext searches, forum searches, email. I'd especially like having it in Firefox's find command (this one probably exists already).

Anyway, that's the idea.

* This could have the unmeasurable and intangible benefit of providing a "we have _the most powerful search_" type of statement from whoever adopted it. The "way cool but only used by 20% of the users (or less)" type of feature.

and if you'd like to learn more about regular expressions... This is actually a pretty good site to get you started. I like that I used google's flexibility, of typing "regular expressions" into Firefox's URL bar, to immediately get there? I say yes.

Sunday, August 2, 2009

No Profit Allowed

"For Profit" should not exist

everyone should be "not for profit", and operate on a model very similar to the way that those do now.

The government provides a funding grant, you pay them back say, 115% of starting funds, maybe 150%, 1000%, I don't know.

Then you run the business. Pay yourself, your employees, pay back the government (or if you wish, some alternate investor - I'm just using the government because I'm picturing this as a heavily socialized thing) - and make your product / provide your service. Once you're paid up, go on as you like.

But you have to use / donate 100% of your revenue. So you pay yourself and your employees, invest in growth/new ventures, and donate the rest to charities or somesuch. No one gets crazy wealthy, everyone still gets paid, the charity portion of the nonprofit sector (which provides no pay services or products) gets a boon of money, everyone wins?

I think the greed of the investor, wanting perpetual returns, that's our problem. I likely just don't understand, that's all.