One of my favourite modeling managers taught me that it wasn't his job to determine if my judgement or his judgement was right. You wouldn't go to him with an argument, and he'd stroke his chin, think a bunch, then say whether he'd agree or not. If our judgements disagreed we'd come up with what data we could collect that would determine whose opinion lined up closer to our business.
I think about this when I see competing approaches to model training.
When I worked in credit risk, predictive models tended to be trained and built with a life expectancy of 6 to 36 months. During the lifetime of the model, distributions of score and model fitness are closely watched. When problems are discovered adjustments and realignements are made along the way rather than scrapping and retraining.
Many software engineers seem to have come to a different conclusion. Here the ultimate solution seems to be build a tool that just keeps retraining monthly/weekly/daily. (Ok, I don’t actually believe anyone is advocating daily. But the hyperbole makes the point.) Online training is regarded as being obviously superior to all other solutions.
The most common argument you might have for rebuilding your model regularly is that you are worried about your model becoming stale. The world changes over time so the factors that were predictive aren't going to predict as well as time goes on. This is certainly eventually true.
But consider what you're actually saying here. Say you are trying to predict 30 day attrition. Then your data has to be at least 30 days old to begin with. After all, how can you tell me if a user from two days ago will churn in 30 days. In order to amass some volume your observation period probably goes back another 15 days (and maybe as much as 100). Putting this together what you're saying is at 45 days the model is fresh. But at 75 it is unacceptably stale? I'm skeptical that there's going to be a shift in society that was observable 45 days ago, not observable 75 days ago, and is still relevant now.
The other reason to keep rebuilding is you're building up a larger observation set. If this is your argument clearly you're not worried about staleness because you're probably making your observation period as wide as possible to capture as many observations as possible. But again, I'm not convinced you're going to get that many wins. Maybe you're starting very early and the first month you retrain you double your observations. That's probably going to make some difference. But after that you're increasing by a third, then a quarter. These returns seem to be diminishing pretty quickly.
But of course the big argument to retrain is what's the harm?
Some reasons against auto-retraining:
- You end up building rigid structures that you won't modify. That is by automating the process you have to do a little bit more work for each choice than by just doing it. If you have a thought for a post processing of your model output, if you're just writing the code for your model you just need to write the actual post processor and test the code. You probably wouldn't try this if you're doing automatic training because you need a post processor that's dealing with a moving model. Or you might have a framework where this post processor just doesn't fit.
- Your model interactions won't be consistent. You're going to end up with multiple models in the long run because different data will arrive at different times and you need to make the right choices at the right time. You might even have models optimized for different targets. Every time you retrain your model you can check if it improves on predicting the given target. But what do you do if model A says 80% and model B say 30%? You want to know that this cohort isn't constantly changing in personality. But every time you retrain you lose knowledge on model interactions.
- You'll lose out on gradual model improvements. Or let me put this a scarier way: you will be constantly running with mistakes.
- You end up with a more average model. The quality of a model of course is just a sampling from a distribution. And the observed performance on a validation set is a sampling of a distribution that's dependent on the model quality. So what happens if you keep re-sampling? You end up with the expected outcome.
Convinced? You shouldn't be. No really... you really really shouldn't be.
That's because if you're in the position where you are guiding model based strategies you should be pretty impervious to arguments. You should understand that the best sounding argument is frequently wrong.
In the case of deciding how and when to retrain which of these arguments makes sense for your situation depends a lot on… well... your situation. In the case of advertising and spam where your performance periods are short and the users change quickly very few of my reasons against auto-retraining make sense. However, for modelling churn and fraud, the situation is pretty different. The performance periods are longer and the behaiviours change slower. How do you know what situation you’re in?
That’s why you work with the data instead. Figure out what the data tells you.
You can build the composite model that you would have had if you retrained every week, or every month, or kept the single model. Here's a simple task. Take your history and train a model for every week. Then evaluate every model for every future week graphing the AUC of the model on the y-axis. You basically end up with what looks like a cohort chart for model age. It'll become really clear the rate at which your model is degrading.
This exercise of course isn't perfect. In reality the data you would have collected had you updated your model regularly wouldn't look like the model you currently have. But if these issues are large enough to skew your data that it would change your conclusion you have even bigger problems.
I was reminded in a recent conversation with a very skilled modeler of the old adage the proof of the pudding is in the eating. Ultimately, you are going to be far happier if you incorporate these changes with champion/challenger strategies (really just A/B testing). When it comes to building code that’s generating models that is then affecting users it is far harder to tell what work is actually adding value and what work is only adding debt. It is far too easy to not understand the actual effects your users are experiencing.
I think it’s reasonable to say that in the field of modeling you should be spending at least half your time measuring rather than building. At first this seems disappointing because this means you will only be able to build half as many things that you are excited about. But when look back over the things I’ve worked on it is the things I’ve measured are the things that I’m most proud of. It is only the things I’ve measured that has allowed me to gain knowledge from my experience.
Post a Comment