Thursday, August 6, 2009

Search for Programmers

inurl: "\.(avi|mpe?g|wmv|mp(3|4))$"

Wouldn't it be great if this is all you had to do to reliably find media files on the internet?

Some of you out there will think "But wyatt that doesn't look simple or reasonable at all your crazy!". And the first thing I'll do is scold you for not correctly writing "you're". Then I'll tell you about the magic that Regular Expressions possess, try to show you how simple and powerful they are, and perhaps rant a bit, as I'm about to.

Search is an imperfect science, no doubt.

This isn't a matter that Google (or Bing, or Yahoo (...bing), or Ask (man where's jeeves) help, because they all do something that the end user is quite happy about. They generalize and abstract your search, in the hopes of providing you more relevant (and numerous!) results, faster. This is why punctuation doesn't get noticed, why you need to specify +sptember to not have Google check for "September", and many other things. These applications exist to serve the general populace and give people who don't know how to properly spell a break when they search for "maek baby", or "christ born" (when they meant 'chris brown').

So what about the rest of us? Those of us who know what we're typing, have a bit of a programmatic bent, and know that if we could only feed google the right pattern, we'd get precisely the material we want? To the best of my ability, I haven't been able to find anything which provides this sort of service.

That's sensible, I guess. Real text search is space and time expensive. On the order of the entire internet, no one would want to take on that extra problem. Why should Google, which already rules the universe, bother? No one is hammering at their door demanding this. How could anyone else try? It would be prohibitively expensive for a newcomer, or even a large competitor (Microsoft) to do, and provide relatively little gain*.

This presents a problem then: how can we make it occur? I'll tell you what I think: abuse the crowd.

Write the search database / server software to be as modular, self-contained, and high-latency scalable as possible. Pass it out free, and evangelize the hell out of it to the people who care. The more people who run the servers, the better the search gets, and all at a low cost of some space and processing time from each user.

Of course there's the problem of the front-end and making the queries get passed out reasonably, ensuring the right level of redundancy (searches should always work, but you shouldn't be too redundant. On top of that there's maintaining reasonable coverage with a small network, how (if at all) this 'makes money', and then actually writing the thing. And maintaining it.

It's a lot to do, but I want real text search (including complex regular expressions) on the web. I also want it in a billion other places - music sites/applications, code databases, booktext searches, forum searches, email. I'd especially like having it in Firefox's find command (this one probably exists already).

Anyway, that's the idea.



* This could have the unmeasurable and intangible benefit of providing a "we have _the most powerful search_" type of statement from whoever adopted it. The "way cool but only used by 20% of the users (or less)" type of feature.

and if you'd like to learn more about regular expressions... This is actually a pretty good site to get you started. I like that I used google's flexibility, of typing "regular expressions" into Firefox's URL bar, to immediately get there? I say yes.