If you have had a conversation with me about statistics you've probably gathered that I am a little hostile by the frequent use of p-values. This is partly because it is very easy to confuse their meanings. And I'm very happy that the Bayesianists are out there are fighting this battle and trying to get people to accept something better. But this isn't my real issue.
And there's the issue that many people aren't reporting adjusted p-values when using data mining techniques which make their results appear a lot more significant than they are (as brilliantly illustrated by xkcd). This is a major problem. But it's also not the problem I'm going to talk about here.
My issue is that even in the simplest of A/B tests that you may be running, your p-value is probably much less meaningful than your stats class taught you.
Consider that you are working for Blinko Laboratory supplies. You are trying to perfect a new automatic diagnosis device that detects the presence of a particular bacteria in blood samples. Right now you are testing a series of code patches to the detection algorithm to see if you can improve your accuracy. So you start off the device with 1,000 positive samples and it misses 156 of them. Pretty high failure right but that's why you're trying to improve the situation.
So you put in your first code patch that uses a fancy new elliptical curve technique for pattern detection (are elliptical curve techniques still the hotness; or am I dating myself here?). You put through another 1,000 samples and now the device only misses 104. Sweet! I mean you're still too high to release to the market but this result has a p-value of 0.0003 (using a quick normal approximation and the sample means to compute variance). You must have done something. So you apply the next patch. And the error count goes up to 150. Shoot, that was a bad patch. The p-value of 0.001 suggests this sway wasn't random noise. This patch must have caused harm. Well, you undo that last patch and your error count drops to 141.
Wait, what? The statistics are suggesting that it is far more likely that your last patch never got undone. So you start debugging and you notice that the device has been disconnected from the network the whole time. But how is that possible?! Surely something changed between your first test and your second test. As well as between your second test and your third test. You have the p-values to prove it.
These p-values must guarantee you something, right?
But here's what you didn't know. Every 200 trials the machine performs a quick self cleaning and scrubs its lens. But this process has a 20% chance of leaving a streak on the lens. When there's no streak your device is able to detect 90% of the infections but when there is a streak it only detects 60% of the infections (the numbers used were generated using this distribution). So while you thought the proper n to use for your calculation was 1,000 there was also this other n of 5 hiding in the machine. Not to say 5 is the proper n to use either, you actually have a completely different distribution. (Is this a convolution? I can never get that definition straight.)
Sure, there are simple techniques that could have caught this. But how often are these actually done? I know I can check if my errors are auto-regressive in serial testing? How often do I do it? How many simple techniques do I need to make me impervious?
Clearly what I'm pointing out here is nothing new. Some people call this systemic noise. Or this is the issue of assuming independence. This isn't exactly a Black Swan I'm talking about here because this isn't a low incident high impact event. This is more like a medium incident medium impact event.
But think how many hidden low n's are affecting the results in your business. Maybe you are in the credit card business and you want to forecast how many proposals you will receive in a month. How many TV ad buys do you think are made in Toronto by credit attorneys? Under 20? What would the impact be on your Toronto proposals if their were two more?
Or you are trying to predict traffic to your website. How many blog posts are written on your site each month? 10? What if there was one less? (And surely you know that the number of blog posts written about your site isn't Poisson.)
This isn't to say you shouldn't have any understanding of the naive p-values implied by your results. You should be able to recognize when a result falls with in your statistical noise. But please don't spend a huge amount of time trying to sharpen the precision of your p-value and your standard error. If the extra work you are doing doesn't increase the precision of your p-value by at least an order of magnitude there are probably better ways you could be spending your time. Lower your error; don't sharpen it.
For example consider Nate Silver's approach. Just before an election that he has predicted the results for he writes about creative scenarios that would lead to his predictions being radically wrong. He's not computing probabilities for these events; just noting them as possibilities and considering their impacts. He's spending less time computing and more time thinking. Thinking about the impacts of events that are completely outside your model is something most of us spend far too little time doing.
Thursday, May 19, 2011
Subscribe to:
Post Comments (Atom)
3 comments:
Hey Steven,
I feel like you were talking to me. That noise is exactly what I am afraid of in my study. How can I reduce the noise? I got highly significant p-values (>0.001), but when I look at the actual numbers, I know for a fact that the difference between groups is not clinically significant. I don't know how to explain that.
Hey Steven,
I feel like you were talking to me. That noise is exactly what I am afraid of in my study. How can I reduce the noise? I got highly significant p-values (>0.001), but when I look at the actual numbers, I know for a fact that the difference between groups is not clinically significant. I don't know how to explain that.
Thanks Hadil. Sorry I've taken so long to respond.
I was going to see if I could answer with a gambit of statistical recipes for you to run through that would make you safe but then I realized that I would just be repeating the same mistake that I was complaining about. Plus there are people who did take a lot of stats classes. which I didn't (I just like spouting off), who could do a far better job.
Instead I would just say keep in mind what a p-value is really saying. It's saying that if the assumptions or your model are correct then the chance of seeing an effect of this size, when there is no effect in the general population, is very small. So if one of the assumptions of the model is that assumption that your observations are independent then you should test that assumption. Particularly if you are uncomfortable with the confidence that the p-value seems to be giving you. How you do this may involve more creativity than rigor.
Personally, I'm always a big fan of segmentation. Do you have extra variables on your observation that you aren't currently using? Like the day of the week, or time of the day, or height of the subject, etc. If so break up your observations into different groups by way of these variables and see if you can't make your effect disappear for all but a couple of groups. This is a data mining technique though so you'll want to see pretty sizable differences in effect between groups to be sure you found something. When you look at the same observations across many dimensions eventually you'll find some outliers just by chance.
Alternatively, if your observations are orderable in some way, like by time or location, you can see if your dependent variable is autoregressive at all in your control group or in your experimental group. R has an ARIMA function to test for exactly this (you'll want to play with different parameters but you probably want to see something from a model of the form of (n,0,0) or (0,0,n) where n is small compared to your number of observations). Or you can just do a linear regression with the series against itself with the first observation removed.
And of course the gold standard, repeat the whole experiment over again and see if the effect is about the same size as you had before. Even if the effect grows that can be a warning sign that you're p-value is misleading you.
I hope that's of some help. In the stuff I tend to look at most effects aren't small. Small effects are usually either big effects happening to a small group within a larger group, or two big effects on two big groups that are cancelling each other out mostly. But then that's probably just observational bias on my part. I'm probably forgetting about the small effects because they aren't as fun.
Please share if you end up finding any great tools that helped you with your concerns.
Post a Comment