Wyatt's Ideas

Tuesday, November 17, 2009

A Hard Problem

Probably not a "very hard" problem.

I've put in a bit of thought about the code style switcher - drawn some diagrams and such. The thing which is irking me at the moment is style switches in a revision controlled system.

Inevitably, there are going to be conflicts. Lots of conflicts. You can do nice things like make whitespace not count, but the different styles mentioned earlier don't all become erased in an "all whitespace is the same" world.

So... either you would need better diffing tools, a different means of describing the formatting of code-data, or a revision system that doesn't handle conflicts the same way

the other option is to have everyone "throw away" their repositories overnight, while the code is refactored to a different style, then they all come back and check out new copies. This could work, but it's not really feasible. Just think about how much code Googlers and Microserfs would have to check out, working on Office or Google Docs. Just think about how many little things they'd have going on as personal branches, all the half-finished code. Things that would have to be meticulously gone over and backed up, or lost. This option is nigh impossible for other reasons.

So, better diffing tools, a different means of describing the formatting of code-data (maybe partially compiled semi-object files, which your editor interprets? Oooh let's zoom in on that.

You could build a symbol tree and go through a large glomp of the compilation process but pull in a ton of data about the code itself, and then store those partially-compiled files. Then in your editor, you have specific stylesets which get applied to the code, and you pull the code out of the object file by interpreting it. You want to change styles? Just change styles and click. The program does the best-compile-possible (either takes the old object file and uses that, or compiles yours, or compiles as much of yours as is compilable and leaves chunks out) and then reinterprets the code outward. Oh that's fancy.

I'm taking compilers next semester; this is my new Awesome.

Wednesday, November 4, 2009

code style switcher

It can't be hard to write a linter that will go through code and make

if(){
}

into

if()
{
}

or (this is messy, I don't want to fight with blogspot's syntax today, I'm sure you get it)

if()
    {
    }

and all combinatorial groupings therein

well, maybe it can be hard, but I'm going to do it. I'm sick of reading people's code that doesn't agree to preconformed styles, and it shouldn't be a human's job to fix it.

Also, control over what styles for while, switch, if, methods, classes, would be useful.

Probably exists.

OOOH and convert spaces to tabs or vice versa, with control of how many spaces a tabstop is!

generally it just prettifies your code! I'm sure this exists. But it would be so fun to make! #isgoingtodoitanyway

Monday, November 2, 2009

bash on windows update

I looked into it this morning, and many people have tried this out. I went with the last of those three links, because it spoke of Unix Utilities and it was just a "download, extract, set PATH" sort of install process, which is one I grok.

Unfortunately, it didn't exactly pan out. I posted at TechSutram's blog (where I got the instructions from):

I just followed these instructions - I'm quite happy to have some additional unix utils on my command line (cat, less, rm) - but I could not get bash itself to work.

It will run fine, but reports "bash: warning: could not find /tmp, please create!" - I tried creating a "tmp" within C:\Users\myusername (where I set my HOME environment to), and also within C:\bash, but this did not stop the error.

Not so bad though, right? Just a warning. Well, actually it's worse: immediately after opening bash, a command like "ls" will display files in the good old fashioned style, then dump this to me: "[sig] C:\bash\bash.exe 1000 (0) call_handler: couldn't get context of main thread, error 998", and hang. After a CTRL+C, you end up with "[sig] bash 1000 (28) call_handler: couldn't get context of main thread, error 998", and now the terminal is entirely nonresponsive.

I'm running 64-bit Win7 Ultimate on an Intel Core2, about 4gb of ram and the rest is unimportant. I'm guessing this is a 64-bit issue or just a "Win7 is different" issue, which is causing some DLL call to fail or return differently and muck things up.

And I'm happy with the situation to some extent. I've now got the ability to string some bashfu together on my command line, and I'm closer to having UNIX available on this crazy Windows stuff, but I'm not quite there.

I'll keep looking, and if I find something that's not 2+ years old or broken, I'll report back.

Sunday, November 1, 2009

some more

instead of AIs, use people playing videogames:

I read the headline "more US residents playing farmville than there are farmers in the US"

Why not have those decisions impact the real world? Risk and reward come into play, in a good way. We don't need heavy mental lifting to be done by machines which aren't yet well equipped, but we can advance the state of the art of robotic machinery and workers, and let swarms of game players control the machinery through a heavily abstracted system of interaction and reward. Offer a small stipend, which increases based on personal success, and you have some seriously cool potential. Not just farms - anything labour intensive that could be roboticised except for the difficult decisions and analyses occasionally required.

is there an optimal svn usage?

if there's a set of operations which should pretty much always be done in order to have consistent backups and a well maintained repository, that should be automated. This is quite likely to already have been done / not be as useful as it sounded in my head when I first thought it.

why isn't everything under revision control and on/offline accessible what the hell

also, all apps should be networked/backed up/offline-networked. ex: your word documents are automatically backed up to your server and when you turn on your work computer they are automatically beamed down, with a full revision history

this is probably already done - is it done well?

windows bash

windows needs the linux command line in a fashion that integrates fully into windows. The only thing I really dislike about windows is not having scp, ssh, grep, cat, vim, vimdiff, ps, forward slashes, that sort of thing in my command line. I am aware that cygwin effectively provides this, but it does not (in my experience) integrate into the actual windows system. You are provided with a way to use linux while you are booted into windows. It's like a really shitty VM.

I want to be able to use my Windows system, use its interfaces and files and utilities and windowsisms, but make it so that when I pop open a command line I'm using what looks like UNIX. This should not be difficult. I intend to write a simple shell / collection of windows ported utilities and, after using it a bit, make it available to other people. I really want this.

Could already be done - I'll probably glance around a bit first.

That's all for this week.

Friday, October 9, 2009

a few for the bin

I want fast regex in pdf viewers

the NAC on campus relies on the user-agent string, apparently. If you attempt a DNS resolution while passing a legal non-windows user agent from your browser, you get redirected to the linux/mac login and are granted access.

This suggests that the process can be fully scripted - you should be able to send some custom packets, fill in the username and password (if the user wishes), and then access one further page, to auto-enable access.

I think you could write a simple script and then either get it to run whenever uog-wifi is connected to, or you could get it to run firefox as well, and just have an alternate icon to use when you first open firefox after connecting, while on campus.

a quick program that will create minimalist raw text documents perceiving much of the style of .doc, .docx, .odf, etc. I bet this exists... I should find it

Saturday, August 8, 2009

Real time search (particularly of the chans) just got feasible

After discussing it with my room mate, I've made some breakthroughs here:

the basic concept is, real time search for the *chans. 4chan, 7chan, 420chan, 711chan, 99chan - whatever ones float your boat. You archive about the front 4 pages of every one of their boards and drop the threads as they get beyond the 10-15 minute stale mark, with some obvious tweaking in less active boards. Then you present a nice search interface into it which returns links to and previews of the threads, and the user clicks the link to open a new tab or summat to the chan where the thread is taking place.

Originally the notion was "scrape the data off the pages and hope the chans don't notice you're doing it or change the structure of their pages much" - this meant dealing with the various chan softwares out there and crafting scrapers to get data from all of them. It also meant needing to update (perhaps frantically) at slight changes. It was a scary thought that they might start doing it maliciously, because you aren't really their friend. You're just someone hitting their servers kind of hard and not generating ad revenue.

The reason for all this? Think about it. You want a market? You got it. Around a million or two users who are predominantly male, single, 18-30, have a lot of time on their hands, and are probably aroused and impulsive over it. This should have marketing people salivating like mad - so much money to be made! Of course it's not a very "clean" or "safe" environment, but jesus.

The flipside for the people? What are people on chans for? To waste time? To not be doing anything? Maybe. What's likelier is that there are pieces of the chans that give them something they're looking for. A particular style of humor. A particular sexual fetish. A particular type of story - maybe creepy threads. They go to a chan they know and scour it - reading and looking, waiting for one of the less-than-a-dozen really interesting things (to them) pops up, then they read and follow it, and are either satisfied or continue. If they use a chan up they move to another, and often just work back and forth.

That's a large amount of unfocused viewing with little gratification. The thrill of the hunt I guess but - let's not bother with that. It would be much nicer if you could capture from all streams and grep out what you like. So you go to one interface, type in "creepy thread", and then it shows you threads on 10 or 20 different chans in the past 15 minutes that have had the term "creepy thread" come up in them, and bam. You've just made a way, way better experience for the end user, and dramatically cut down on server load for the people running the chan.

Why that cut-down on server load matters is the magic. This is as far as I'd gotten beforehand - page scraping and thread passing, hoping the chan owners wouldn't notice you was the plan. But why not make their deal sweeter too? The only source of revenue for a chan owner is advertisements. But ads are going to be ineffective unless they're broad, meaning they have to be blandly pornographic, and that's still ineffective. What this service offers is focus, and a reduction in server load. At optimal usage, you might have, say, half the server hits per month, but 10-30% higher probability of clickthrough rates because you can start targeting your ads. Someone comes to your chan because they searched "creepy thread"? Well, show them horror movie or scary story ads. They're way likelier to bite on that than a skimpily clad woman advertising "for-pay porn" on a site where half the users are present to get porn free.

The providers of the interface can target and present ads as well - maybe just unobtrusive google ads, whatever. The very fact that the user is showing their interests is enough to allow some level of targetting, meaning a higher clickthrough rate, and better revenue. It also ensures lower costs and a faster, easier experience for the end user. Literally, everyone wins.

Given this, the new idea is to work with the chan runners and get them to run some software for us, that will do the scraping on the server side and send an efficient package something like once a minute to our servers for inclusion in the search. This would make the job easier for us, and less painful for them.

Anyway, yeah. I think this should be done, and might work on it when I'm not actively engaged elsewhere. If you'd like to do it instead, go ahead - but this only really becomes effective if the chans play nice together on it. You'd need one service covering at least the big 4 or 5 for it to really be useful.

Wootsauce.

edit: there's the worry of "what if it's that stupid random unfocused time that allows you to run into new things and to spawn new things, which keeps the chans from becoming stale and dead" - and my response is that you can still go surf them regularly. Hopefully this'll clear out the leeches and the seekers, and leave mostly oldfags and trolls actually sticking to a chan. It'll also likely improve content of specialized threads, because people with a vested interest can be guaranteed to find them.

And the interface for this is crucial - you make the interface suck, it won't work.

..okay I'm done

Thursday, August 6, 2009

Search for Programmers

inurl: "\.(avi|mpe?g|wmv|mp(3|4))$"

Wouldn't it be great if this is all you had to do to reliably find media files on the internet?

Some of you out there will think "But wyatt that doesn't look simple or reasonable at all your crazy!". And the first thing I'll do is scold you for not correctly writing "you're". Then I'll tell you about the magic that Regular Expressions possess, try to show you how simple and powerful they are, and perhaps rant a bit, as I'm about to.

Search is an imperfect science, no doubt.

This isn't a matter that Google (or Bing, or Yahoo (...bing), or Ask (man where's jeeves) help, because they all do something that the end user is quite happy about. They generalize and abstract your search, in the hopes of providing you more relevant (and numerous!) results, faster. This is why punctuation doesn't get noticed, why you need to specify +sptember to not have Google check for "September", and many other things. These applications exist to serve the general populace and give people who don't know how to properly spell a break when they search for "maek baby", or "christ born" (when they meant 'chris brown').

So what about the rest of us? Those of us who know what we're typing, have a bit of a programmatic bent, and know that if we could only feed google the right pattern, we'd get precisely the material we want? To the best of my ability, I haven't been able to find anything which provides this sort of service.

That's sensible, I guess. Real text search is space and time expensive. On the order of the entire internet, no one would want to take on that extra problem. Why should Google, which already rules the universe, bother? No one is hammering at their door demanding this. How could anyone else try? It would be prohibitively expensive for a newcomer, or even a large competitor (Microsoft) to do, and provide relatively little gain*.

This presents a problem then: how can we make it occur? I'll tell you what I think: abuse the crowd.

Write the search database / server software to be as modular, self-contained, and high-latency scalable as possible. Pass it out free, and evangelize the hell out of it to the people who care. The more people who run the servers, the better the search gets, and all at a low cost of some space and processing time from each user.

Of course there's the problem of the front-end and making the queries get passed out reasonably, ensuring the right level of redundancy (searches should always work, but you shouldn't be too redundant. On top of that there's maintaining reasonable coverage with a small network, how (if at all) this 'makes money', and then actually writing the thing. And maintaining it.

It's a lot to do, but I want real text search (including complex regular expressions) on the web. I also want it in a billion other places - music sites/applications, code databases, booktext searches, forum searches, email. I'd especially like having it in Firefox's find command (this one probably exists already).

Anyway, that's the idea.

* This could have the unmeasurable and intangible benefit of providing a "we have _the most powerful search_" type of statement from whoever adopted it. The "way cool but only used by 20% of the users (or less)" type of feature.

and if you'd like to learn more about regular expressions... This is actually a pretty good site to get you started. I like that I used google's flexibility, of typing "regular expressions" into Firefox's URL bar, to immediately get there? I say yes.