Saturday, November 06, 2010

I want to make a map/reduce logistic regression machine in December. Who's on board?

Because of my work, logistic regression has become one of my favourite analytic tools. But now I've crossed the point where everything looks like a nail (which it does) and I'm at a point where I want to make my own hammer (for all these nails that are piling up). So why not write this to work in the space where the future of large dataset analysis is probably going to happen: the Hadoop map/reduce world.

I figure I will have new found time in December (I write the first CFA exam on December 4th) so I might as well try to do this. First step will be to make a simple weighted linear regression machine. If I can implement one in Excel surely it can't be so hard to implement one anywhere else. Then figuring out the actual algorithm will be a combination of digging into the R source code, using some common sense, and talking to a friend who has actually built one of these before.

But I'd love help if anyone's game. Even just to answer questions. Like, how do I actually set up a test Hadoop server? Or more importantly, is this a silly exercise?


tdunning said...

Take a look at the Mahout logistic regression code. It is a blazing fast stochastic gradient descent implementation that makes map-reduce implementations much less interesting by being *really* fast.

Steven H. Noble said...

Thanks Ted. I'll take a look.