If you have had a conversation with me about statistics you've probably gathered that I am a little hostile by the frequent use of p-values. This is partly because it is very easy to confuse their meanings. And I'm very happy that the Bayesianists are out there are fighting this battle and trying to get people to accept something better. But this isn't my real issue.
And there's the issue that many people aren't reporting adjusted p-values when using data mining techniques which make their results appear a lot more significant than they are (as brilliantly illustrated by xkcd). This is a major problem. But it's also not the problem I'm going to talk about here.
My issue is that even in the simplest of A/B tests that you may be running, your p-value is probably much less meaningful than your stats class taught you.
Consider that you are working for Blinko Laboratory supplies. You are trying to perfect a new automatic diagnosis device that detects the presence of a particular bacteria in blood samples. Right now you are testing a series of code patches to the detection algorithm to see if you can improve your accuracy. So you start off the device with 1,000 positive samples and it misses 156 of them. Pretty high failure right but that's why you're trying to improve the situation.
So you put in your first code patch that uses a fancy new elliptical curve technique for pattern detection (are elliptical curve techniques still the hotness; or am I dating myself here?). You put through another 1,000 samples and now the device only misses 104. Sweet! I mean you're still too high to release to the market but this result has a p-value of 0.0003 (using a quick normal approximation and the sample means to compute variance). You must have done something. So you apply the next patch. And the error count goes up to 150. Shoot, that was a bad patch. The p-value of 0.001 suggests this sway wasn't random noise. This patch must have caused harm. Well, you undo that last patch and your error count drops to 141.
Wait, what? The statistics are suggesting that it is far more likely that your last patch never got undone. So you start debugging and you notice that the device has been disconnected from the network the whole time. But how is that possible?! Surely something changed between your first test and your second test. As well as between your second test and your third test. You have the p-values to prove it.
These p-values must guarantee you something, right?
But here's what you didn't know. Every 200 trials the machine performs a quick self cleaning and scrubs its lens. But this process has a 20% chance of leaving a streak on the lens. When there's no streak your device is able to detect 90% of the infections but when there is a streak it only detects 60% of the infections (the numbers used were generated using this distribution). So while you thought the proper n to use for your calculation was 1,000 there was also this other n of 5 hiding in the machine. Not to say 5 is the proper n to use either, you actually have a completely different distribution. (Is this a convolution? I can never get that definition straight.)
Sure, there are simple techniques that could have caught this. But how often are these actually done? I know I can check if my errors are auto-regressive in serial testing? How often do I do it? How many simple techniques do I need to make me impervious?
Clearly what I'm pointing out here is nothing new. Some people call this systemic noise. Or this is the issue of assuming independence. This isn't exactly a Black Swan I'm talking about here because this isn't a low incident high impact event. This is more like a medium incident medium impact event.
But think how many hidden low n's are affecting the results in your business. Maybe you are in the credit card business and you want to forecast how many proposals you will receive in a month. How many TV ad buys do you think are made in Toronto by credit attorneys? Under 20? What would the impact be on your Toronto proposals if their were two more?
Or you are trying to predict traffic to your website. How many blog posts are written on your site each month? 10? What if there was one less? (And surely you know that the number of blog posts written about your site isn't Poisson.)
This isn't to say you shouldn't have any understanding of the naive p-values implied by your results. You should be able to recognize when a result falls with in your statistical noise. But please don't spend a huge amount of time trying to sharpen the precision of your p-value and your standard error. If the extra work you are doing doesn't increase the precision of your p-value by at least an order of magnitude there are probably better ways you could be spending your time. Lower your error; don't sharpen it.
For example consider Nate Silver's approach. Just before an election that he has predicted the results for he writes about creative scenarios that would lead to his predictions being radically wrong. He's not computing probabilities for these events; just noting them as possibilities and considering their impacts. He's spending less time computing and more time thinking. Thinking about the impacts of events that are completely outside your model is something most of us spend far too little time doing.