Saturday, March 22, 2014

How often should you retrain your model?

One of my favourite modeling managers taught me that it wasn't his job to determine if my judgement or his judgement was right. You wouldn't go to him with an argument, and he'd stroke his chin, think a bunch, then say whether he'd agree or not. If our judgements disagreed we'd come up with what data we could collect that would determine whose opinion lined up closer to our business.

I think about this when I see competing approaches to model training.

When I worked in credit risk, predictive models tended to be trained and built with a life expectancy of 6 to 36 months. During the lifetime of the model, distributions of score and model fitness are closely watched. When problems are discovered adjustments and realignements are made along the way rather than scrapping and retraining.

Many software engineers seem to have come to a different conclusion. Here the ultimate solution seems to be build a tool that just keeps retraining monthly/weekly/daily. (Ok, I don’t actually believe anyone is advocating daily. But the hyperbole makes the point.) Online training is regarded as being obviously superior to all other solutions.

The most common argument you might have for rebuilding your model regularly is that you are worried about your model becoming stale. The world changes over time so the factors that were predictive aren't going to predict as well as time goes on. This is certainly eventually true.

But consider what you're actually saying here. Say you are trying to predict 30 day attrition. Then your data has to be at least 30 days old to begin with. After all, how can you tell me if a user from two days ago will churn in 30 days. In order to amass some volume your observation period probably goes back another 15 days (and maybe as much as 100). Putting this together what you're saying is at 45 days the model is fresh. But at 75 it is unacceptably stale? I'm skeptical that there's going to be a shift in society that was observable 45 days ago, not observable 75 days ago, and is still relevant now.

The other reason to keep rebuilding is you're building up a larger observation set. If this is your argument clearly you're not worried about staleness because you're probably making your observation period as wide as possible to capture as many observations as possible. But again, I'm not convinced you're going to get that many wins. Maybe you're starting very early and the first month you retrain you double your observations. That's probably going to make some difference. But after that you're increasing by a third, then a quarter. These returns seem to be diminishing pretty quickly.

But of course the big argument to retrain is what's the harm?

Some reasons against auto-retraining:

  • You end up building rigid structures that you won't modify. That is by automating the process you have to do a little bit more work for each choice than by just doing it. If you have a thought for a post processing of your model output, if you're just writing the code for your model you just need to write the actual post processor and test the code. You probably wouldn't try this if you're doing automatic training because you need a post processor that's dealing with a moving model. Or you might have a framework where this post processor just doesn't fit.

  • Your model interactions won't be consistent. You're going to end up with multiple models in the long run because different data will arrive at different times and you need to make the right choices at the right time. You might even have models optimized for different targets. Every time you retrain your model you can check if it improves on predicting the given target. But what do you do if model A says 80% and model B say 30%? You want to know that this cohort isn't constantly changing in personality. But every time you retrain you lose knowledge on model interactions.

  • You'll lose out on gradual model improvements. Or let me put this a scarier way: you will be constantly running with mistakes.

  • You end up with a more average model. The quality of a model of course is just a sampling from a distribution. And the observed performance on a validation set is a sampling of a distribution that's dependent on the model quality. So what happens if you keep re-sampling? You end up with the expected outcome.

Convinced? You shouldn't be. No really... you really really shouldn't be.

That's because if you're in the position where you are guiding model based strategies you should be pretty impervious to arguments. You should understand that the best sounding argument is frequently wrong.

In the case of deciding how and when to retrain which of these arguments makes sense for your situation depends a lot on… well... your situation. In the case of advertising and spam where your performance periods are short and the users change quickly very few of my reasons against auto-retraining make sense. However, for modelling churn and fraud, the situation is pretty different. The performance periods are longer and the behaiviours change slower. How do you know what situation you’re in?

That’s why you work with the data instead. Figure out what the data tells you.

You can build the composite model that you would have had if you retrained every week, or every month, or kept the single model. Here's a simple task. Take your history and train a model for every week. Then evaluate every model for every future week graphing the AUC of the model on the y-axis. You basically end up with what looks like a cohort chart for model age. It'll become really clear the rate at which your model is degrading.

This exercise of course isn't perfect. In reality the data you would have collected had you updated your model regularly wouldn't look like the model you currently have. But if these issues are large enough to skew your data that it would change your conclusion you have even bigger problems.

I was reminded in a recent conversation with a very skilled modeler of the old adage the proof of the pudding is in the eating. Ultimately, you are going to be far happier if you incorporate these changes with champion/challenger strategies (really just A/B testing). When it comes to building code that’s generating models that is then affecting users it is far harder to tell what work is actually adding value and what work is only adding debt. It is far too easy to not understand the actual effects your users are experiencing.

I think it’s reasonable to say that in the field of modeling you should be spending at least half your time measuring rather than building. At first this seems disappointing because this means you will only be able to build half as many things that you are excited about. But when look back over the things I’ve worked on it is the things I’ve measured are the things that I’m most proud of. It is only the things I’ve measured that has allowed me to gain knowledge from my experience.

Friday, December 20, 2013

Successful data teams hustle

There seems to be a lot of ways to start a data team at a startup. One popular technique is for you to be an internal consultancy within the company. The rest of the company is supposed to come up with needs that requires a data specialist and you are supposed to prioritize and respond to those needs while building tooling to solve those needs.

Unfortunately, this often ends up producing a team of data cops. A team more interested in enforcing how others should use their tools rather than producing value with data.

I think there's a much more effective approach.

Instead, consider your new data team to be a startup inside the company: making and selling a product. Your product is the data; and intuition and judgement are your entrenched competition.

You're probably the sort of person to whom it's obvious that people should be using more data in their decisions. You probably shiver every time you hear someone say they are basing their decisions on their strongly held beliefs that have no evidence to support them. But you may not realize how un-obvious this is to everyone else.

In order to succeed as a data team, you're going to have to learn to be operate like a successful startup.

That means that just like any founder, you're not just a developer. You're sales. You're support. You're the number one advocate.

And you're going to have to hustle.

Don't make people's lives harder. Don't be confused in thinking that the rest of the company (your customers) are going to put in extra effort to deliver your data ready to be consumed. Don't try to start putting impositions on product development to make your life easier. Startups that make a product that puts demands on its users rarely survive. To put it simply, you work for them.

Make people's lives easier. Adopt work that existed before there was a data team, where that makes sense (eg: take on log maintenance). This is what all other teams do when they form. A new design team is responsible for the landing page design even if it was originally designed by a developer. Startups that solve a previously unsolved problem rarely take off. Take on the schleps.

Anticipate opportunities for data to be the answer and then have it ready. A friend of mine recently told a story how when he was in the t-shirt business he'd respond to potential contracts with custom shirts in the bid. When you see an opportunity for data to make your business better build it, don't argue for it. Big pitches don't sell. Big pitches that don't even have a screen shot really don't sell. Having the product ready and pre-configured sells.

The word no isn't in your vocabulary anymore. When you have succeeded in gaining some interest, don't turn around and tell them "well, that's not actually what I'm building". Successful companies pivot in response to demand; and so do you. I'm not saying you have to be a GIGO machine that answers every question you are given. But every request is a lead; and every lead is gold.

Communication will make you or destroy you. What's worse than having bad data? Having to discover for yourself that the data is bad. You will make mistakes. But you also need to earn trust. People will learn to trust you are providing good answer when you pro-actively and aggressively communicate where things have gone wrong.

Learn to take the blame. In general, learn how to provide customer service. For example when someone has a data need that your tooling can't handle instead of responding by saying "well, you can't really do that because that request is kind of unreasonable" try
That's a totally reasonable request and I can understand why you'd want that. Embarrassingly, the tools we've setup don't actually support that yet. But let me come up with something that will solve your problem for now.

Try to remember you're not telling people to eat their vegetables. It's very easy to be seen as the doctor saying "if only you were to eat all your vegetables you will eventually appreciate them". But you're not offering vegetables. You're offering pie. The pleasures of using data is almost immediate and it never gets old (just like pie). So while you are competing with an entrenched product (intuition) your competition doesn't have what you have. You have pie.

Tuesday, November 26, 2013

My hair on fire rule of metrics

I feel since I talk about this rule a fair amount I should have it published somewhere. A hair on fire rule is one which when noticed you don't wait for arguments to weigh the pros and cons. You just put out the fire. I have one rule like this for metrics.
A metric for a time period can't change after it has been reported.
This doesn't mean you have to be able to report the metric immediately after the time period has ended. And it doesn't mean you can't fix errors later. But it means that the definition of the metric shouldn't be affected by future events.

Monday, May 27, 2013

Some podcasts you should listen to if you're involved in A/B testing

Statistics has been in the news recently which has made for some really thoughtful content being made on the topic. I started compiling a list of people who I thought would enjoy listening to these podcasts on the topic and that list got pretty long so I'll use this blog to broadcast instead. I'll resist giving commentary or critiques on the actual conclusions of the speakers except to say they are interesting.

First was Frakt on Medicaid and the Oregon Medicaid Study on EconTalk which is a great discussion of the statistical power of studies.

Second is Paul Bloom and Joseph Simmons on Bloggingheads.tv which really illustrates how getting fake results from bad statistical practices isn't just a theoretical problem and how you can demonstrate this with simulations.


And finally, back on EconTalk, is Jim Manzi on the Oregon Medicaid Study, Experimental Evidence, and Causality which gets into some more subtle analysis flaws that can destroy the value of A/B testing and really drives home the point that it is a failing endeavour to try to harvest a lot of confidence out of any single experiment. That confidence is gained through an iterative process that comes out of a lot of simple experiments that are constantly updating your priors.

I'll break my no commentary promise a little here. One thing I find quite interesting is how Simmons and Manzi essentially come to the same conclusion on the problem of gaining knowledge from a single experiment while using modern data mining techniques; but they offer different cures. Simmons recommends not allowing yourself to search over your data over lots of dimensions as that will surely lead to false positives. Where as Manzi seems to say you should never be too positive about the results of any single experiment. So iterate over a series of small experiments instead; each one informing the next. Perhaps this is a reflection of their industries (academic vs business) but then this too may be overfit. They both agree that we have to accept that we can't gain truths as quickly as we currently think we can.

Sunday, February 10, 2013

Piece wise linear trends in the browser

Somehow I never blogged about the Javascript implementation of l1tf released by my friend Avi Bryant and myself. l1tf is a way to find piece wise linear trends in time series data.

Monday, February 04, 2013

Simple cross domain tracking

I hear of some really complicated schemes from time to time to track users across multiple domains that belong to a single site. While I'm sure they mostly work it seems like there's a simple way to do this that I assume many people are already using but is probably too boring to comment on. So, let's be boring for a moment.

Let us say you own eggs.com, bacon.com, and coffee.ca. When a user visits eggs.com he is assigned a unique tracking token in the eggs.com cookie (we'll call it [tracking-token-eggs]). At some point after that token is assigned, include it in the page requests to //bacon.com/tracking.gif?token=[tracking-token-eggs]&domain=eggs.com, and //coffee.ca/tracking.gif?token=[tracking-token-eggs]&domain=eggs.com. (Create the same setup for visitors to bacon.com and coffee.ca).

If the browser already has a token stored in the bacon.com or coffee.ca cookies you will now have a request that includes both domains and both tokens; both domains are in the url, one token is in the url and the other token is in the cookie of the request. The first domain is also in the referrer/referer. This works even if 3rd party cookies are blocked (at least in the browsers I've tried). Now you can store this request in a database table or just a log file.

If you want to do something slightly more complicated that involves javascript you can alter the technique to use iframes instead of gifs. Just don't try to create or store any new tokens in the iframe from the foreign domain because this is when techniques fail.

[Edit: I should add that this is a technique for when you have half a dozen domains or so. Not for hundreds of domains.]

Monday, January 28, 2013

On calculating Fibonacci numbers in C

A few months ago Evan Miller wrote an essay called The Mathematical Hacker. While an interesting post he does make a mistake when he gives the "proper way" to calculate the Fibonacci numbers.

The essay claims that you shouldn't use the tail-recursive method you would learn in a CS class to compute the Fibonacci numbers because, as any mathematician knows, an exact analytical solution exists. His C example looks like:
But there are actually a few more optimizations I picked up while studying linear recurrence sequences that I thought I'd share. The first drops the time almost by half:
The reason why this works is because the last part of the expression (the (0.5 - 0.5 × sqrt(5.0))/sqrt(5.0) part) has a magnitude less than 0.5 so you get the same result just by using the round.

Note I benchmarked these with: using gcc fibonacci.c -O2 -o fibonacci

Using these benchmarks I get 12006 ms vs 7123 ms. And the validation number matches as 0x6744

But there's yet another optimization:
That's right, we can do even better by using the tail-call recursion method dismissed in the essay. Now we get a time of 2937 ms.

For the observant of you you'll notice that what my benchmark does is just recalculates the first 40 Fibonacci numbers over and over again while summing them and taking the last 4 hex digits of the sum for validation. (It's not just for validation. We also do this because if you gcc with -O2 and you don't do anything with the output gcc is smart enough to skip the whole step. We need -O2 so gcc will recognize the tail-call.)

You could call foul on me right now. After all the reason why the analytic approach is slower is because of pow, and pow gets way more efficient with larger exponents.

Alright, fair enough. Let us run the test again except we'll sum the first 90 Fibonacci numbers instead (not much point going much past 90 since the 96th Fibonacci number is the largest to fit in an unsigned long int). So we update the code to r = (r + f(i % 90)) % 0x10000;

Now we get 7795 ms for the recursive solutions and 12840 ms and 7640 ms for the analytical solutions. I ran the benchmark a few times and the recursive method is consistently faster statistically but I think that 2% faster has to be within the gcc optimization margin of error.

But there's something else to notice. For the two analytic solutions the validation number is 0x2644 but for the recursive solution it is 0x2f9c. Two against one right? Well, votes don't count in math and dictatorships.

What happened is at the 71st Fibonacci number both analytical solutions lost precision. This is because C doesn't check what we're trying to do. It does what we tell it to do. And we told it to take a float approximation of an irrational number, with only the precision a float has, and take it to a power.

I do want to stop here a moment and say I'm not pointing out this error as a gotcha moment or as evidence that Evan Miller is poor at math. I think How Not To Run An A/B Test is an incredibly important essay and should be understood by anyone who is using A/B test results. Also if you are doing statistics on a Mac you probably should have bought Wizard by now.

However, I do think this mistake illustrates an important lesson. If we tell programmers that the more math (or should that be Math) they use the better programmer they are, we are setting up unnecessary disasters.

One reason is because virtually no programmer spends a majority of their time doing things that look like Math. Most spend 99.5% doing things that don't look like Math. If a programmer takes this message to heart then they are going to spend a lot of time feeling like they aren't a true programmer; which is silly and bad for its own sake.

The other issue is that a focus on better programming looking like Math can be a major distraction. And it can lead to really silly time wasting debates (eg https://github.com/twitter/bijection/issues/41).

But most dreadfully if we tell programmers that they should give more weight to the more mathematical solutions they will often not choose the best solution for their context. I've certainly not given the best solution for finding Fibonacci numbers for all contexts. Heck, I'd bet you could get better results for my own benchmark by using memoization (for the record there's a memoization technique for both recursion and the analytical solution -- but it's easier with the recursion solution).

My solution wouldn't be that all programmers learn more Math. My solution would be that it is good to be part of a group where different people know different things. And we should take advantage of this knowledge base by not being embarrassed to ask questions. I have a number of friends who send me emails from time to time that just has the subject "math question." And I all the time send emails with the subject "programming questions," "statistics question," "writing question," or even "math question." I find it works really well for me.

So no, I don't think every programmer needs to be taught more Math. Except for linear algebra of course. Everyone should learn more linear algebra.

(You can download the full source for these examples from github.)