A mathematician at risk

A LLR for significance of interaction of two variables upon an effect (or how to compare A/B test effects in different segments)

2021-09-22T12:01:00.000-07:00

I should warn you, this is a blog post about a formula. And a formula that's really a work in progress. But I think a useful formula none the less. But before I reveal it let me give a motivating example of why you might find this formula interesting.

Consider an online store who decides they are going to increase conversions with an extra incentive; with every purchase they are going to include company stickers. But the problem is they only have a fixed inventory of stickers so they can't give them to everyone. So how do they decide whom to offer the stickers. When do you say "if you buy today we'll throw in a sticker!"

So the company enlists two data scientists to solve the problem. And, as data scientists are wont to do, they take a look at the funnel and make a conversion model. And after cleaning the data and understanding the results they have a segmentation to which breaks up by conversion rate. So they schedule a meeting to present their results.

Scientist 1: After combing through the data we've figured out your best and worst converters.
Scientist 2: Visitors from Texas who are on the site before 9am are your highest converters. And visitors from western Canada who visit after 3pm are your lowest.
Scientist 1: So when can we begin offering these stickers to the Texans?
Scientist 2: You mean the Canadians? The Texans are already converting high. It's the Canadians that have the low conversion rate we have to raise.
Scientist 1: You want to give an incentive to our least receptive audience? Clearly the Canadians don't like us. We're not going to be able to make them like us with just some stickers. At least we know the Texans like us and the non-converters just need a little bit more of a nudge.
Scientist 2: But if we give an extra promotion to the Texans we're just going to be cannibalizing our own sales. We know we'll be giving out a ton of our limited stickers to visitors who were already going to convert.

And this is when the data scientists look around and realize everyone else has already left the meeting. The problem is they didn't think about the actual problem they wanted to solve. What they needed to find out is what's the incremental impact of offering stickers to which visitors. Do they gain more conversions on the margin by offering stickers to those after 3pm Western Canadians or the before 9am Texans. Or is that even the right divide. Perhaps the best sting for the sticker is to offer them to Firefox users referred by organic search. What they want to model is what is the incremental impact of offering a sticker to any given visitor.

Unsurprisingly, this will require an A/B test. What's more tricky is how to handle the results. Usually when you compare whether two segments are different (Chrome users vs non-Chrome users) you want to find out if the different segments convert at different rates. But remember our data scientists solved that but it didn't help them solve the problem they were given. What they want to solve is whether there's an interaction between their segmentation variable and their assignment variable when applied to conversion.

You can imagine this is interesting in general anytime you have an A/B test. You may know that treatment B converted better than treatment A. But it would be nice to know was there a particular cohort who particular preferred B. Perhaps there's a small cohort who preferred treatment A.

The simplest model that you can ever look at is just stating the global results without segmentation: group with treatment A converted at 20% and group with treatment B converted at 25%, so treatment B increases conversion by 5%. The second simplest model is to take a single split of the population and determine if the two subpopulations are different with statistical significance with respect the property you care about. Segment 1 had conversion rates 15% and 20% for A and B respectively while Segment 2 had conversion rates 30% and 40%. Did B have a bigger effect over A in Segment 1 or Segment 2? (This will vary depending on what we mean by "effect".)

Consider what data you need to collect. For each of the two segments you need the number of conversions (here to forth referred to as target) for each treatment, as well as the number of non-conversions (non-target) for each treatment. This might look something like:

A/B	segment	target	non-target	total	target:non-target odds
A	1	45	455	500	0.10
B	1	54	446	500	0.12
A	2	130	870	1,000	0.15
B	2	138	862	1,000	0.16

From this you may make some charts to try to tell what is going on:

Well, segment 2 certainly has more targets than segment 1 in both A and B. But it also has more non-targets. And did B go up more in segment 2? Maybe it's better to look at some percentages.

Ok, now we can tell segment 2 definitely has a higher target rate. And treatment B has a higher target rate in both segments. But did treatment B have more marginal impact in segment 1 or segment 2? For that we look at the ratio of the target odds between B and A in each segment.

Great, we can see treatment B has more of an impact on target rate over treatment A in Segment one than in Segment 2. In Segment 1, the odds of conversion one increased 20% when going from treatment A to treatment B. But only increased 7% in Segment 2.

But there are so many moving parts. We started with 8 numbers (targest/non-target, segment, treatment) and kept dividing in various ways. Each of those 8 numbers are just a sampling with error bounds. How do we know if we have enough volume? Can we state any conclusion with any confidence at this point?

And that's where the promised formula comes in. It's not pretty to look at so I'm going to make you click through if you want to see it. What this technically is, is the maximal log likelihood of a given segment, where an impact or effect is provided. In this case the impact is the difference of the ln odds observed in treatment B and the ln odds in treatment A. Given this you can come up with a log likelihood ratio (LLR) where you compare the assumption that both segments have the same impact vs having distinct impacts.

That's quite a mouthful but for those interested I've provided the derivation using sagemath (which is only a handful of lines given that sagemath does most of the heavy lifting for us).

What's actually useful is using this derivation I've written a simple calculator to calculate these LLR values. Feel free to take and modify this code.

For the above example the calculator gives us an LLR of 0.118, which is quite small so we would say there is no difference in the strength of the impact from treatment B over A in segment 2 compared to the same in segment 1. That is to say we can not say with confidence there is an interaction between the assignment variable and the segmentation variable upon our target.

It's interesting to note that there is a significantly different target rate for our two targets. The target rate goes from 10% to 13.4%. If we were to run a simple t-test between our two segments with respect to target we would find they are different with statistical significance. But remember, we're not interested in separating the high converters from the low converters.

But what if we tweak the numbers a bit and increase the volume:

A/B	segment	target	non-target	total	target:non-target odds
A	1	455	4,545	5,000	0.10
B	1	652	4,348	5,000	0.15
A	2	1,304	8,696	10,000	0.15
B	2	1,228	8,772	10,000	0.14

Now we get an LLR of 19.07. This is starting to get high enough where it's starting to be convincing there is an effect, despite the graphs looking very similar:

The difference is there's more volume and a slight more effect. But the actual conversion rate between the segments got closer at 11% and 12.6%.

However, this isn't just a measurement of volume. We can tweak the numbers again with similar volumes and end up with an LLR of 0.048.

A/B	segment	target	non-target	total	target:non-target odds
A	1	455	4,545	5,000	0.10
B	1	652	4,348	5,000	0.15
A	2	1,304	8,696	10,000	0.15
B	2	1,803	8,197	10,000	0.22

The LLR dropped because now the ratio of odds is very close. So while we have lots of volume, and again the target ratio is very different, the impact of B over A is the same in both segments. It increases the odds by 50% in both segments. So the segments are not different by this measure even though the segment target rate is the most different we've seen so far: 13% and 18%. (You can find all the data above in a Google sheet.)

None of this is that novel in the world of statistics. All of this can be done by looking at the significance variable of the interaction coefficient in an ANOVA table for a two feature logistic regression. But using the simple code in this calculator you can automate the hunt for segmentations that you care about. Or you can take the formula, and a decision tree modeler like brushfire, and build a tree based on segments where the impact is highest or lowest. Using this model our data scientists may be able to solve the actual problem they were asked and be able to give away these stickers.

Try the calculator below and let me know if you get any surprising results.

How often should you retrain your model?

2014-03-22T17:46:00.000-07:00

One of my favourite modeling managers taught me that it wasn't his job to determine if my judgement or his judgement was right. You wouldn't go to him with an argument, and he'd stroke his chin, think a bunch, then say whether he'd agree or not. If our judgements disagreed we'd come up with what data we could collect that would determine whose opinion lined up closer to our business.

I think about this when I see competing approaches to model training.

When I worked in credit risk, predictive models tended to be trained and built with a life expectancy of 6 to 36 months. During the lifetime of the model, distributions of score and model fitness are closely watched. When problems are discovered adjustments and realignements are made along the way rather than scrapping and retraining.

Many software engineers seem to have come to a different conclusion. Here the ultimate solution seems to be build a tool that just keeps retraining monthly/weekly/daily. (Ok, I don’t actually believe anyone is advocating daily. But the hyperbole makes the point.) Online training is regarded as being obviously superior to all other solutions.

The most common argument you might have for rebuilding your model regularly is that you are worried about your model becoming stale. The world changes over time so the factors that were predictive aren't going to predict as well as time goes on. This is certainly eventually true.

But consider what you're actually saying here. Say you are trying to predict 30 day attrition. Then your data has to be at least 30 days old to begin with. After all, how can you tell me if a user from two days ago will churn in 30 days. In order to amass some volume your observation period probably goes back another 15 days (and maybe as much as 100). Putting this together what you're saying is at 45 days the model is fresh. But at 75 it is unacceptably stale? I'm skeptical that there's going to be a shift in society that was observable 45 days ago, not observable 75 days ago, and is still relevant now.

The other reason to keep rebuilding is you're building up a larger observation set. If this is your argument clearly you're not worried about staleness because you're probably making your observation period as wide as possible to capture as many observations as possible. But again, I'm not convinced you're going to get that many wins. Maybe you're starting very early and the first month you retrain you double your observations. That's probably going to make some difference. But after that you're increasing by a third, then a quarter. These returns seem to be diminishing pretty quickly.

But of course the big argument to retrain is what's the harm?

Some reasons against auto-retraining:

You end up building rigid structures that you won't modify. That is by automating the process you have to do a little bit more work for each choice than by just doing it. If you have a thought for a post processing of your model output, if you're just writing the code for your model you just need to write the actual post processor and test the code. You probably wouldn't try this if you're doing automatic training because you need a post processor that's dealing with a moving model. Or you might have a framework where this post processor just doesn't fit.

Your model interactions won't be consistent. You're going to end up with multiple models in the long run because different data will arrive at different times and you need to make the right choices at the right time. You might even have models optimized for different targets. Every time you retrain your model you can check if it improves on predicting the given target. But what do you do if model A says 80% and model B say 30%? You want to know that this cohort isn't constantly changing in personality. But every time you retrain you lose knowledge on model interactions.

You'll lose out on gradual model improvements. Or let me put this a scarier way: you will be constantly running with mistakes.

You end up with a more average model. The quality of a model of course is just a sampling from a distribution. And the observed performance on a validation set is a sampling of a distribution that's dependent on the model quality. So what happens if you keep re-sampling? You end up with the expected outcome.

Convinced? You shouldn't be. No really... you really really shouldn't be.

That's because if you're in the position where you are guiding model based strategies you should be pretty impervious to arguments. You should understand that the best sounding argument is frequently wrong.

In the case of deciding how and when to retrain which of these arguments makes sense for your situation depends a lot on… well... your situation. In the case of advertising and spam where your performance periods are short and the users change quickly very few of my reasons against auto-retraining make sense. However, for modelling churn and fraud, the situation is pretty different. The performance periods are longer and the behaiviours change slower. How do you know what situation you’re in?

That’s why you work with the data instead. Figure out what the data tells you.

You can build the composite model that you would have had if you retrained every week, or every month, or kept the single model. Here's a simple task. Take your history and train a model for every week. Then evaluate every model for every future week graphing the AUC of the model on the y-axis. You basically end up with what looks like a cohort chart for model age. It'll become really clear the rate at which your model is degrading.

This exercise of course isn't perfect. In reality the data you would have collected had you updated your model regularly wouldn't look like the model you currently have. But if these issues are large enough to skew your data that it would change your conclusion you have even bigger problems.

I was reminded in a recent conversation with a very skilled modeler of the old adage the proof of the pudding is in the eating. Ultimately, you are going to be far happier if you incorporate these changes with champion/challenger strategies (really just A/B testing). When it comes to building code that’s generating models that is then affecting users it is far harder to tell what work is actually adding value and what work is only adding debt. It is far too easy to not understand the actual effects your users are experiencing.

I think it’s reasonable to say that in the field of modeling you should be spending at least half your time measuring rather than building. At first this seems disappointing because this means you will only be able to build half as many things that you are excited about. But when look back over the things I’ve worked on it is the things I’ve measured are the things that I’m most proud of. It is only the things I’ve measured that has allowed me to gain knowledge from my experience.

Successful data teams hustle

2013-12-20T15:20:00.000-08:00

There seems to be a lot of ways to start a data team at a startup. One popular technique is for you to be an internal consultancy within the company. The rest of the company is supposed to come up with needs that requires a data specialist and you are supposed to prioritize and respond to those needs while building tooling to solve those needs.

Unfortunately, this often ends up producing a team of data cops. A team more interested in enforcing how others should use their tools rather than producing value with data.

I think there's a much more effective approach.

Instead, consider your new data team to be a startup inside the company: making and selling a product. Your product is the data; and intuition and judgement are your entrenched competition.

You're probably the sort of person to whom it's obvious that people should be using more data in their decisions. You probably shiver every time you hear someone say they are basing their decisions on their strongly held beliefs that have no evidence to support them. But you may not realize how un-obvious this is to everyone else.

In order to succeed as a data team, you're going to have to learn to be operate like a successful startup.

That means that just like any founder, you're not just a developer. You're sales. You're support. You're the number one advocate.

And you're going to have to hustle.

Don't make people's lives harder. Don't be confused in thinking that the rest of the company (your customers) are going to put in extra effort to deliver your data ready to be consumed. Don't try to start putting impositions on product development to make your life easier. Startups that make a product that puts demands on its users rarely survive. To put it simply, you work for them.

Make people's lives easier. Adopt work that existed before there was a data team, where that makes sense (eg: take on log maintenance). This is what all other teams do when they form. A new design team is responsible for the landing page design even if it was originally designed by a developer. Startups that solve a previously unsolved problem rarely take off. Take on the schleps.

Anticipate opportunities for data to be the answer and then have it ready. A friend of mine recently told a story how when he was in the t-shirt business he'd respond to potential contracts with custom shirts in the bid. When you see an opportunity for data to make your business better build it, don't argue for it. Big pitches don't sell. Big pitches that don't even have a screen shot really don't sell. Having the product ready and pre-configured sells.

The word no isn't in your vocabulary anymore. When you have succeeded in gaining some interest, don't turn around and tell them "well, that's not actually what I'm building". Successful companies pivot in response to demand; and so do you. I'm not saying you have to be a GIGO machine that answers every question you are given. But every request is a lead; and every lead is gold.

Communication will make you or destroy you. What's worse than having bad data? Having to discover for yourself that the data is bad. You will make mistakes. But you also need to earn trust. People will learn to trust you are providing good answer when you pro-actively and aggressively communicate where things have gone wrong.

Learn to take the blame. In general, learn how to provide customer service. For example when someone has a data need that your tooling can't handle instead of responding by saying "well, you can't really do that because that request is kind of unreasonable" try

That's a totally reasonable request and I can understand why you'd want that. Embarrassingly, the tools we've setup don't actually support that yet. But let me come up with something that will solve your problem for now.

Try to remember you're not telling people to eat their vegetables. It's very easy to be seen as the doctor saying "if only you were to eat all your vegetables you will eventually appreciate them". But you're not offering vegetables. You're offering pie. The pleasures of using data is almost immediate and it never gets old (just like pie). So while you are competing with an entrenched product (intuition) your competition doesn't have what you have. You have pie.

My hair on fire rule of metrics

2013-11-26T14:46:00.000-08:00

I feel since I talk about this rule a fair amount I should have it published somewhere. A hair on fire rule is one which when noticed you don't wait for arguments to weigh the pros and cons. You just put out the fire. I have one rule like this for metrics.

A metric for a time period can't change after it has been reported.

This doesn't mean you have to be able to report the metric immediately after the time period has ended. And it doesn't mean you can't fix errors later. But it means that the definition of the metric shouldn't be affected by future events.

Some podcasts you should listen to if you're involved in A/B testing

2013-05-27T16:36:00.001-07:00

Statistics has been in the news recently which has made for some really thoughtful content being made on the topic. I started compiling a list of people who I thought would enjoy listening to these podcasts on the topic and that list got pretty long so I'll use this blog to broadcast instead. I'll resist giving commentary or critiques on the actual conclusions of the speakers except to say they are interesting.

First was Frakt on Medicaid and the Oregon Medicaid Study on EconTalk which is a great discussion of the statistical power of studies.

Second is Paul Bloom and Joseph Simmons on Bloggingheads.tv which really illustrates how getting fake results from bad statistical practices isn't just a theoretical problem and how you can demonstrate this with simulations.

And finally, back on EconTalk, is Jim Manzi on the Oregon Medicaid Study, Experimental Evidence, and Causality which gets into some more subtle analysis flaws that can destroy the value of A/B testing and really drives home the point that it is a failing endeavour to try to harvest a lot of confidence out of any single experiment. That confidence is gained through an iterative process that comes out of a lot of simple experiments that are constantly updating your priors.

I'll break my no commentary promise a little here. One thing I find quite interesting is how Simmons and Manzi essentially come to the same conclusion on the problem of gaining knowledge from a single experiment while using modern data mining techniques; but they offer different cures. Simmons recommends not allowing yourself to search over your data over lots of dimensions as that will surely lead to false positives. Where as Manzi seems to say you should never be too positive about the results of any single experiment. So iterate over a series of small experiments instead; each one informing the next. Perhaps this is a reflection of their industries (academic vs business) but then this too may be overfit. They both agree that we have to accept that we can't gain truths as quickly as we currently think we can.

Piece wise linear trends in the browser

2013-02-10T08:13:00.002-08:00

Somehow I never blogged about the Javascript implementation of l1tf released by my friend Avi Bryant and myself. l1tf is a way to find piece wise linear trends in time series data.

Simple cross domain tracking

2013-02-04T10:11:00.000-08:00

I hear of some really complicated schemes from time to time to track users across multiple domains that belong to a single site. While I'm sure they mostly work it seems like there's a simple way to do this that I assume many people are already using but is probably too boring to comment on. So, let's be boring for a moment.

Let us say you own eggs.com, bacon.com, and coffee.ca. When a user visits eggs.com he is assigned a unique tracking token in the eggs.com cookie (we'll call it [tracking-token-eggs]). At some point after that token is assigned, include it in the page requests to //bacon.com/tracking.gif?token=[tracking-token-eggs]&domain=eggs.com, and //coffee.ca/tracking.gif?token=[tracking-token-eggs]&domain=eggs.com. (Create the same setup for visitors to bacon.com and coffee.ca).

If the browser already has a token stored in the bacon.com or coffee.ca cookies you will now have a request that includes both domains and both tokens; both domains are in the url, one token is in the url and the other token is in the cookie of the request. The first domain is also in the referrer/referer. This works even if 3rd party cookies are blocked (at least in the browsers I've tried). Now you can store this request in a database table or just a log file.

If you want to do something slightly more complicated that involves javascript you can alter the technique to use iframes instead of gifs. Just don't try to create or store any new tokens in the iframe from the foreign domain because this is when techniques fail.

[Edit: I should add that this is a technique for when you have half a dozen domains or so. Not for hundreds of domains.]

On calculating Fibonacci numbers in C

2013-01-28T07:57:00.000-08:00

A few months ago Evan Miller wrote an essay called The Mathematical Hacker. While an interesting post he does make a mistake when he gives the "proper way" to calculate the Fibonacci numbers.

The essay claims that you shouldn't use the tail-recursive method you would learn in a CS class to compute the Fibonacci numbers because, as any mathematician knows, an exact analytical solution exists. His C example looks like:
But there are actually a few more optimizations I picked up while studying linear recurrence sequences that I thought I'd share. The first drops the time almost by half:
The reason why this works is because the last part of the expression (the (0.5 - 0.5 × sqrt(5.0))/sqrt(5.0) part) has a magnitude less than 0.5 so you get the same result just by using the round.

Note I benchmarked these with: using gcc fibonacci.c -O2 -o fibonacci

Using these benchmarks I get 12006 ms vs 7123 ms. And the validation number matches as 0x6744

But there's yet another optimization:
That's right, we can do even better by using the tail-call recursion method dismissed in the essay. Now we get a time of 2937 ms.

For the observant of you you'll notice that what my benchmark does is just recalculates the first 40 Fibonacci numbers over and over again while summing them and taking the last 4 hex digits of the sum for validation. (It's not just for validation. We also do this because if you gcc with -O2 and you don't do anything with the output gcc is smart enough to skip the whole step. We need -O2 so gcc will recognize the tail-call.)

You could call foul on me right now. After all the reason why the analytic approach is slower is because of pow, and pow gets way more efficient with larger exponents.

Alright, fair enough. Let us run the test again except we'll sum the first 90 Fibonacci numbers instead (not much point going much past 90 since the 96th Fibonacci number is the largest to fit in an unsigned long int). So we update the code to r = (r + f(i % 90)) % 0x10000;

Now we get 7795 ms for the recursive solutions and 12840 ms and 7640 ms for the analytical solutions. I ran the benchmark a few times and the recursive method is consistently faster statistically but I think that 2% faster has to be within the gcc optimization margin of error.

But there's something else to notice. For the two analytic solutions the validation number is 0x2644 but for the recursive solution it is 0x2f9c. Two against one right? Well, votes don't count in math and dictatorships.

What happened is at the 71st Fibonacci number both analytical solutions lost precision. This is because C doesn't check what we're trying to do. It does what we tell it to do. And we told it to take a float approximation of an irrational number, with only the precision a float has, and take it to a power.

I do want to stop here a moment and say I'm not pointing out this error as a gotcha moment or as evidence that Evan Miller is poor at math. I think How Not To Run An A/B Test is an incredibly important essay and should be understood by anyone who is using A/B test results. Also if you are doing statistics on a Mac you probably should have bought Wizard by now.

However, I do think this mistake illustrates an important lesson. If we tell programmers that the more math (or should that be Math) they use the better programmer they are, we are setting up unnecessary disasters.

One reason is because virtually no programmer spends a majority of their time doing things that look like Math. Most spend 99.5% doing things that don't look like Math. If a programmer takes this message to heart then they are going to spend a lot of time feeling like they aren't a true programmer; which is silly and bad for its own sake.

The other issue is that a focus on better programming looking like Math can be a major distraction. And it can lead to really silly time wasting debates (eg https://github.com/twitter/bijection/issues/41).

But most dreadfully if we tell programmers that they should give more weight to the more mathematical solutions they will often not choose the best solution for their context. I've certainly not given the best solution for finding Fibonacci numbers for all contexts. Heck, I'd bet you could get better results for my own benchmark by using memoization (for the record there's a memoization technique for both recursion and the analytical solution -- but it's easier with the recursion solution).

My solution wouldn't be that all programmers learn more Math. My solution would be that it is good to be part of a group where different people know different things. And we should take advantage of this knowledge base by not being embarrassed to ask questions. I have a number of friends who send me emails from time to time that just has the subject "math question." And I all the time send emails with the subject "programming questions," "statistics question," "writing question," or even "math question." I find it works really well for me.

So no, I don't think every programmer needs to be taught more Math. Except for linear algebra of course. Everyone should learn more linear algebra.

(You can download the full source for these examples from github.)

How to get random lines out of a file or piped stream

2012-11-24T22:07:00.002-08:00

Several months ago Aaron Olson, Camilo Lopez, and I were sitting around drinking beers (after 4pm on a Friday at the Shopify office; that's what you do after making ecommerce software). And we were griping how there wasn't an easy way to get a random sample of lines out of a log file or a stream for testing purposes. Sure, there's head and there's tail. But we wanted random lines and we didn't want to have to think about it.

So we made a solution: dimsum. The usage is in the readme. You install it with gem (so ruby is required). And then just use it like head or tail. Submit issues to github.

PS dimsum uses reservoir sampling so you can pipe right to it.

Analyst, measure thyself

2012-05-03T08:19:00.000-07:00

The other day I had a nice example of my work failing that I thought I would share. At Shopify I have several models that predict all sorts of future behaviours of sellers, purchasers, signups, etc. Earlier this year there was a change in the signup population that dramatically reduced the effectiveness of one of these models.

What's interesting is I didn't discover this myself. One of our VPs did.

It is certainly never fun to have a co-worker discover that some of your work is inadequate. It is even less fun when that co-worker is a VP. But what was cool was that he found out by looking at a self-updating report; a report that I wrote.

What I had made was a web page that broke signups from a month ago up into five groups based on the model. The groups were ordered by what the model at the time thought were the chances that members of each group would convert into paying customers. For each group I listed the rate that each group actually did convert. In an ideal world the bottom group would have converted at close to 0% and the top group would have converted at close to 100%. Let's just say that when this VP looked at this page the conversion rate of the bottom group was substantially higher than 0% and the conversion rate of the top group was substantially lower than 100%.

I'd love to say that my initial response was to accept that my model was no longer performing. Instead I tried to explain why this report was misleading. I insisted that the situation was complicated and you couldn't just look at a simple table like this and understand what was going on. But it didn't matter because the evidence was too convincing and it showed that the situation was pretty simple. The evidence said the model must be rebuilt and the evidence had come from me.

Most analysts are perfectly comfortable with the idea that the best way to know if a marketing campaign is succeeding, or if a design change is making customers happier, is to measure the results and then try to prove to the opposite. And yet we often fail to hold our own work up to the same scrutiny.

With that in mind I recommend your modeling projects include these steps:

Have quality tests for all the models you create. The tests need to be simple and require minimal explanation. It should be easy for any other analyst to replicate your test just by looking at how you present your test report. When you are done a model you need to be able to say "this is what I have done, and here is my proof."
Have fit tests for all the techniques you used. For example if you are using a curve-fitting technique to predict the future, show how that technique would predict the present using past data. You also need to show that your selection model type is a good choice. Even if you've used SVM models to predict these sorts of results in the past you still need to show that SVMs work with this data.
Write your tests before you start. This way you can't later avoid making the tests that expose your weak spots. Also this helps you know when you are done. If your test results haven't improved in the last few days then your work has stopped accomplishing anything. Plus it is motivating to be able to see your test numbers improve as you work. I'm pretty sure test driven modeling has been practiced longer than test driven development.
Even your tests should have tests. They need to reconcile with your accounting numbers. If your test reports revenue from customers the revenue total should reconcile with your reported revenue numbers.
Your tests need to be able to alert someone if they start failing. I dropped the ball here last time.
Make your tests public. This takes away your option in the future to decide that this test doesn't apply. More importantly, the users of your model are going to have a much better understanding of what it does if they can see its past behaviour. Even better would be to make your tests public from day one.

Note that these tests are not a substitution for model validation. There will probably be overlap in types of tests; but you still need to have a separate validation hold out.

Finally consider this: if the defense of your model is a description of the techniques you used then you're doing it wrong. The defense of your model is presenting its performance. If your model is useful and it performs well then people will ask you about your techniques. You don't show that a new glass technology is unbreakable by describing the molding process. You show it by giving a man a baseball bat and letting him try to break it. And if he fails, give him a bigger bat.

Me on TV

2012-04-28T07:50:00.003-07:00

I've been bad at keeping this blog up to date so this is a little old. However, a few months ago I was on TVO's The Agenda talking about the work of a data scientist on an episode called "Big Bad Data".

You can skip forward to about 9:30 if you want to see just me. But the first interview Andrew McAfee is quite good so you should watch it too. In fact, you should watch it instead.

A scary thought about marketing attribution models

2012-04-27T22:01:00.001-07:00

I was reading through Google's white paper on marketing attribution and a scary thought occurred to me which I tweeted. I'll self referentially quote my tweet here.

If your model has low sensitivity to subjective decisions then your decisions don't matter. High sensitivity then your model doesn't matter.
— Steven H. Noble (@snoble) April 28, 2012

So what do I mean by this? Well, the paper talks about the different attribution models various companies use for click tracking that include first click, last click, time decay, linear, etc.

That's a lot of choices for a model without a clear way of deciding what's right. How to decide?

Well, the first step would seem to be do some quick mock up models comparing the result of using the different types. If you're lucky then the results will be pretty close to each other. That is your model will be insensitive to your choice. In that case it doesn't matter which type you choose so choose whichever one is the least amount of work and move on to the next project. If someone wants to argue over the choice let them win because the decision is literally not worth the time of the argument.

But what if you are unlucky? What if under a linear model organic unbranded search gets the attribution for 20,000 signups but under first click it gets the attribution for 12,000 signups? Then you're in serious trouble. Because while it seems like you are deciding between linear and first click you are actually deciding on how many signups to attribute to organic unbranded search. You might as well get rid of the guise of a model and just write down numbers for how many attributions you feel you should give to each channel. After all that is what is happening now. You are not modeling; you are deciding.

Fortunately, there actually is a bit of an end run around this problem. And that is to test which model actually lines up with your data. That is to make the decision non-subjective again.

I have a few options on how to do this which I will talk about in a future post. But the point is this: if you can't test the results of your model against your data (in a simple way) then you are probably better off not making a model.

Or to put it more bluntly: if you don't have a test then what you've built isn't actually a model. Not one that matters at least.

Churn rate definition

2011-12-10T11:48:00.001-08:00

The other day I wrote a post on the Shopify blog on the definition of churn rate. I won't copy and paste it here but check it out if you're interested.

Your n is probably a lot smaller than you think

2011-05-19T19:47:00.000-07:00

If you have had a conversation with me about statistics you've probably gathered that I am a little hostile by the frequent use of p-values. This is partly because it is very easy to confuse their meanings. And I'm very happy that the Bayesianists are out there are fighting this battle and trying to get people to accept something better. But this isn't my real issue.

And there's the issue that many people aren't reporting adjusted p-values when using data mining techniques which make their results appear a lot more significant than they are (as brilliantly illustrated by xkcd). This is a major problem. But it's also not the problem I'm going to talk about here.

My issue is that even in the simplest of A/B tests that you may be running, your p-value is probably much less meaningful than your stats class taught you.

Consider that you are working for Blinko Laboratory supplies. You are trying to perfect a new automatic diagnosis device that detects the presence of a particular bacteria in blood samples. Right now you are testing a series of code patches to the detection algorithm to see if you can improve your accuracy. So you start off the device with 1,000 positive samples and it misses 156 of them. Pretty high failure right but that's why you're trying to improve the situation.

So you put in your first code patch that uses a fancy new elliptical curve technique for pattern detection (are elliptical curve techniques still the hotness; or am I dating myself here?). You put through another 1,000 samples and now the device only misses 104. Sweet! I mean you're still too high to release to the market but this result has a p-value of 0.0003 (using a quick normal approximation and the sample means to compute variance). You must have done something. So you apply the next patch. And the error count goes up to 150. Shoot, that was a bad patch. The p-value of 0.001 suggests this sway wasn't random noise. This patch must have caused harm. Well, you undo that last patch and your error count drops to 141.

Wait, what? The statistics are suggesting that it is far more likely that your last patch never got undone. So you start debugging and you notice that the device has been disconnected from the network the whole time. But how is that possible?! Surely something changed between your first test and your second test. As well as between your second test and your third test. You have the p-values to prove it.

These p-values must guarantee you something, right?

But here's what you didn't know. Every 200 trials the machine performs a quick self cleaning and scrubs its lens. But this process has a 20% chance of leaving a streak on the lens. When there's no streak your device is able to detect 90% of the infections but when there is a streak it only detects 60% of the infections (the numbers used were generated using this distribution). So while you thought the proper n to use for your calculation was 1,000 there was also this other n of 5 hiding in the machine. Not to say 5 is the proper n to use either, you actually have a completely different distribution. (Is this a convolution? I can never get that definition straight.)

Sure, there are simple techniques that could have caught this. But how often are these actually done? I know I can check if my errors are auto-regressive in serial testing? How often do I do it? How many simple techniques do I need to make me impervious?

Clearly what I'm pointing out here is nothing new. Some people call this systemic noise. Or this is the issue of assuming independence. This isn't exactly a Black Swan I'm talking about here because this isn't a low incident high impact event. This is more like a medium incident medium impact event.

But think how many hidden low n's are affecting the results in your business. Maybe you are in the credit card business and you want to forecast how many proposals you will receive in a month. How many TV ad buys do you think are made in Toronto by credit attorneys? Under 20? What would the impact be on your Toronto proposals if their were two more?

Or you are trying to predict traffic to your website. How many blog posts are written on your site each month? 10? What if there was one less? (And surely you know that the number of blog posts written about your site isn't Poisson.)

This isn't to say you shouldn't have any understanding of the naive p-values implied by your results. You should be able to recognize when a result falls with in your statistical noise. But please don't spend a huge amount of time trying to sharpen the precision of your p-value and your standard error. If the extra work you are doing doesn't increase the precision of your p-value by at least an order of magnitude there are probably better ways you could be spending your time. Lower your error; don't sharpen it.

For example consider Nate Silver's approach. Just before an election that he has predicted the results for he writes about creative scenarios that would lead to his predictions being radically wrong. He's not computing probabilities for these events; just noting them as possibilities and considering their impacts. He's spending less time computing and more time thinking. Thinking about the impacts of events that are completely outside your model is something most of us spend far too little time doing.

Is bandwidth the new corn?

2011-02-09T16:17:00.000-08:00

With the recent push back by the Canadian Conservative party against the CRTC's choice to allow usage based billing for internet I can't help but to wonder if we are seeing the beginning of another corn in the economy.

It is pretty well recognized that while corn does grow pretty well the reason why it is the go to ingredient in so many products that we buy is because of its regulated availability and subsidization. Because it is subsidized marginal profits don't fall off (marginal costs should go up eventually but subsidies compensate) so you don't get a point where you stop producing. So we can't ever get to equilibrium demand. Prices don't rise to the point where buyers start considering alternatives. Alternatives that would be more efficient if not for the subsidies. Subsidies we all eat.

Aren't we slowly moving in this direction for on demand video? There are two extremes for non-physical delivery for on demand video. One is the deliverer has a separate stream of the same video for each watcher delivered at the time of demand. If three neighbours all want to watch the same show at the same time it takes up three times the bandwidth than if one person wanted to watch it.

The other extreme is the video is delivered to everyone at the same time via broadcast and stored at the household to be played at time of demand. In this case storage is the alternative to bandwidth.

Which is more efficient? Well clearly that depends on the popularity of the video. If it is very popular the broadcast and store method (or rather the PVR method) is more efficient. If the consumer were to feel the cost of the resources he is consuming the market should move towards this efficiency. But without some form of usage billing the consumer sees no benefit to the PVR method over the Hulu/Netflix/web streaming method. PVRs are almost certainly more popular than streaming TV shows at the moment. But there is no extra cost to streaming House rather than recording and playing House.

So where does this get you? This gets you to the world of corn. A world where the costs are externalized. Or another parallel, our streets with respect to traffic and gasoline. On our streets our roads, our gas, and now even our cars are all subsidized with an aim towards universal access. Because of this the consumer (ie commuter) doesn't feel the real costs of his decisions. The cost of traffic is externalized but the benefits of convenience and independence are fully realized. So we get busy roads instead of an exodus to mass transit¹.

Yes, ISPs can put down more last mile cable to increase bandwidth to give user's this free choice. And corn producers can always produce more corn. And we can always build more roads. Of course this never sounds like a stupid answer when we take the first step down the road of externalizing cost. I'm sorry. I meant to say universalizing access.

¹Note that mass transit is also subsidized, and almost certainly more heavily subsidized than driving since they also receive the subsidization of roads and gas. But the point is that driving is subsidized but the rewards are not diminished.

How Groupon is nailing it (because you suck at evaluating prices)

2010-12-03T11:06:00.000-08:00

It's easy to look at Groupon's success and shrug with "who knew people wanted coupons?" But that misses a big part of the magic. Groupon has figured out a way to present their coupons that manages to double their perceived value, and not in a way that misleads anyone.

To understand how they do this you first have to acknowledge that you suck evaluating prices. So do I. So do experts. You can read about all the reasons why you suck at prices in Dan Ariely's Predictably Irrational or William Poundsone's Priceless. They'll talk about a range of reasons like anchoring and default choice and comparable items. But what Groupon takes advantage of is that when it comes to price we're all about ratios.

The classic example of this is consider you are shopping for a pair of jeans. You find a pair you like at the beginning of the day for $80 but you decide to keep looking because you've just started shopping and you may find a better price. At the end of the day you're in another store, across town and they have the exact same pair of jeans for $130. You're now 30minutes away from the original store, and you're tired, and there's traffic. Would you go back for the better deal? Most people would say they would.

But lets say instead you are shopping for a laptop. Now the last store you are in is selling it for $1,550 and the original store is selling it for $1,500. Now do you go back? Most people would say they wouldn't. But the weird thing is the $50 savings has nothing to do with the product. On the one hand the deal is:

1. Get the thing you want

2. Pay out the lower price

3. Spend 30minutes in traffic

and in the other case the deal is:

1. Get the thing you want

2. Pay out the lower price

3. Pay an extra $50.

Really in both cases the question is would you spend 30minutes in traffic to save $50. (The response would be different still if it were framed as spend 30minutes in traffic to be paid $50. But that's neither here nor there.) But we instead fall into the trap of thinking about ratio savings.

So how does this apply to Groupon? Let's consider the deal I was offered today. Get a $125 coupon for Indochino.com for $50. That's great, right? $125 value for $50; that's 60% off. Pretty good. But you haven't actually got anything yet. So what can you get for $125 from Indochino? Well their specialty is custom suits that start at $299. So $125 is about 40% off. Still pretty good.

But really, your entire savings is never more than $75. Which means your savings off that suit is only around 25%. By splitting up the purchase of buying a suit from Indochino into buying a coupon from Groupon and then paying the rest at Indochino, Groupon has managed to leverage the savings rate from 25% up to 60%.

So you would figure if they are using leverage to make the first half of the transaction feel like a good deal then the second half of the transaction must feel like a terrible deal. Not quite. Remember that 40% number we arrived at? When we are actually buying the suit, which is a second transaction that doesn't occur on Groupon, that $50 we paid is out of mind. It's sunk cost. And we ignore sunk costs. (Actually sometimes we're really good at ignoring sunk costs, and sometimes we're terrible at it. For example, even if Indochino were to suddenly jack up their prices we would have a hard time not spending that coupon because it cost us $50 which we are resistant to ignoring.)

So the result is after the second transaction you've saved 40% thanks to Groupon. Feels pretty good. Feels like brand loyalty. Maybe you'll buy another coupon.

So Groupon has managed to transform 25% savings into a 40% savings and a 60% savings? Did I start out by saying they doubled the perceived value?

(40%+60%)/25% = 4X

Turns out they actually quadrupled the perceived value. Pretty clever.

I want to make a map/reduce logistic regression machine in December. Who's on board?

2010-11-06T12:21:00.001-07:00

Because of my work, logistic regression has become one of my favourite analytic tools. But now I've crossed the point where everything looks like a nail (which it does) and I'm at a point where I want to make my own hammer (for all these nails that are piling up). So why not write this to work in the space where the future of large dataset analysis is probably going to happen: the Hadoop map/reduce world.

I figure I will have new found time in December (I write the first CFA exam on December 4th) so I might as well try to do this. First step will be to make a simple weighted linear regression machine. If I can implement one in Excel surely it can't be so hard to implement one anywhere else. Then figuring out the actual algorithm will be a combination of digging into the R source code, using some common sense, and talking to a friend who has actually built one of these before.

But I'd love help if anyone's game. Even just to answer questions. Like, how do I actually set up a test Hadoop server? Or more importantly, is this a silly exercise?

I use SAS and Maple the same way... and nothing else

2010-11-04T22:05:00.000-07:00

Today I was thinking about how I use SAS: write some code that creates some stuff, select and run it, write some more code that looks at that stuff (usually in a separate tab), select and run it, then write some code that builds more stuff off of the old stuff, and repeat.

At times it's awkward and ungraceful. But it's also super handy that I don't have to know ahead of time that some code is going to take a few minutes to run so I had better write some code that stores the output somewhere. And it's also almost exactly how I used Maple. And how I understand one uses Mathematica (which I've used all of once, but I think I would love it if I used it more).

But I use no other language this way. And I've had opportunity. There were several years before I ever touched SAS where I had mostly stopped using Maple when I wrote a lot of JavaScript and PHP. And sometimes some Python and maybe some Ruby. And Java keeps popping up. Oh, and R. But this pattern of use has never been a consideration for me for anything other than SAS or Maple.

Is there a way to use any of these languages this way? Is there tool out there that I'm missing?

If credit is so tight then why do I keep getting offered more credit?

2010-09-10T15:26:00.000-07:00

About a week ago I posted as a comment on the Atlantic's business blog. I thought it may be of interest to those who read here as well (the both of you).

Let me suggest what I think is happening in terms of the apparent contradiction in lenders being both too tight and at the same time offering too much money. The credit industry has gotten very good at classifying you into a risk category to determine chances that they will have to write off the money lent into that category.

As you may imagine, if the lender sees an expected profit on the amount of risk of lending to people in that risk category they will do it. The problem is it is not well solved how to model how much the risk of default changes as you extend someone's debt obligations (known as modeling your credit capacity). This is the sort of tool you need to determine someone's limit.

This hasn't been that much of an issue because a lender usually limits how much they will lend out based on their own available exposure, and they just spread it around to as many borrowers as possible.

But when modeled risk suddenly shrinks the number of profitable borrowers this constraint is no longer sufficient (even if the bank's acceptable exposure is also shrinking). And beyond that, their models are showing they can make up some of their lost profits by lending more into still profitable categories.

There actually do exist a few tools in modeling credit capacity but at the moment they are either crude (like using a flat income to total debt service ratio ceilings) or at the moment unproven. There certainly are people trying to solve this problem, and some institutions may even have working solutions. But it is a problem that not all institutions have solved yet.

I should add that in Canada there is the additional issue that credit card companies now need permission from the card holder to extend the card holder's credit limit. This means that risk to exposure formulas will need to be modified as new data comes in on how accepted limit increases are used. Until that is sorted out card companies will feel like they have all this extra safe exposure potential from unaccepted limit increases that they desperately want to use up.

Android Acer liquid e from Rogers review: recommend don't buy

2010-07-21T18:22:00.000-07:00

Update: The missing apps I was looking for on the Acer Liquid E now seem to be available in the market place. I'm not sure how this happened but it certainly means I'm less critical of the phone.

This is by no means a full review and I am by no means a consumer electronics reviewer. That said I have to strongly suggest that anyone considering purchasing the acer liquid e from rogers to buy another phone from another provider instead (the obvious suggestion would be an iPhone 4 from bell).

What this may explain for some is why they can't find some popular apps, like the TD app and drop7, in the android marketplace on their phone.

The liquid e itself is a pretty straight forward android 2.1 phone, which basically makes it a commodity; ie replaceable.

The problem lies in the total stone walling I've received in my attempt to get a very simple matter addressed.

Because of the open nature of the phone, and the hack-ability of the hardware, the android market place has given developers the option to make their apps protected. This means only phones with OS images that have been registered with the market place can install the app. This aids in preventing some piracy and lowering the chances that the app is used on a compromised phone (important for, say, a banking app). It is a sensible policy. Both TD and drop7 have taken advantage of this option.

My issue is that the acer liquid e appears not to be registered with the market place.

Rogers' response (via @rogershelps and @RogersKeith)

It's possible that the app wasn't designed for all Android devices. Best to check with TD.

And they have not responded to further follow up.

Google has a support forum set up that theoretically has staff responding to problems; but you end up just getting users talking to each other about the problem and a canned response from an employee. I appreciate customer service is hard and it takes staffing. But best I can tell, 30-40 questions come in a day. Would you need anymore than a staff of 20-30 to be able to deal with this? To support an OS you are hoping takes off? How many staff do you think Microsoft or Apple have on their main support channels? I'm guessing a lot more than 30.

And finally acer's response.

With your permission may I go ahead and arrange a callback for you from the specialist team Pay for support. They would contact you and give more details on the charges and support boundaries.

So acer's response is they are willing to look into the problem if I pay them first. Again, I appreciate there is a place for a pay for support services. But I'm not really looking for support to something custom or unique. I'm simply asking acer to do what is a basic step when releasing an android phone. Something that google really should enforce by use of the android copyright. Something that rogers should pull together as the customer liaison. (The sad part is I might just pay them.)

If any one of these actors had stepped up to their reasonable obligation this problem would never have made it to market. But this is what happens when everyone does a half assed job. Really, go buy an iPhone from Bell.

This is how advertising should work

2010-02-06T18:49:00.000-08:00

My new landlord sent me an email today to let me know the dimensions of my windows in the apartment I'm moving into. But what do I know about drapes, or blinds, or whatever one uses to cover windows.

But what's that at the top of my webmail inbox?

Now I could treat this as an invasion of privacy or being over exposed to advertisements. But honestly, this is an ad that makes my life easier.

I haven't looked at their product/prices yet so in no way should you take this as an endorsement. However, point one in their favour is that they are paying for me to receive information. Clearly, my receiving this info is mutually beneficial yet they are incurring all of the costs. That's the right attitude as far as I'm concerned.

Eating Crow

2010-01-27T14:28:00.001-08:00

If I'm going to write a prediction post it's only fair that I write a follow up to rate myself.

The product will be too expensive. Obviously I was very wrong here. There were a lot of leaks that the price point would be a lot lower than people expected and I clearly ignored them. But let's not pretend the unit is $500 either. 16GBs is clearly not enough storage to use the iPad in any reasonable way. And personally I'm resistant to paying for another monthly data bill. But we aren't talking about apple hifi or iPhone gen 1 prices here.
Most developers will be locked out, except perhaps in a ridiculously restrictive way. Honestly, my expectation was that the only way you would be able to get applications onto the device was through the app store. But I over stated because I didn't expect iPhone apps to be compatible (and I assumed the API would be released at WWDC). I figured even if there were app store applications available immediately it would seem pretty pathetic because I was expecting a device that looked more like a mac than an over-sized iPod. All that said, I'm still going to give myself half points for this one, rather than no points. Just because we've gotten used to the idea that you have to get permission to put an application on your own device doesn't mean it's not ridiculously overly restrictive.
Content channels will be incredibly locked down. Again, I was expecting a more general/mac like device instead of an over-sized iPod. If this had been a mac tablet then it would have seemed absurd that it couldn't play divx. As a large iPod we've been conditioned to expect it not to play divx. But I think I really have to give myself a zero on this prediction simply because apple went with a standard and used epub. However, here are some things to consider. My guess is apple will treat epub like aac: you will be able to install non-drm'd epubs from anyone but only apple will be allowed to offer drm'd epubs. Where does this put amazon? I'm sure the kindle app will still be available but there will be advantages to using the native iPad book reader. Amazon's content deals won't let them sell drm free books and apple won't let amazon sell drm'd books into the iPad book reading app. So the only major seller of content of books for the iPad will still be apple.
There will be some incredibly obvious feature that is inexplicably missing. Really this is a silly prediction. Of course features are inexplicably missing. Because the explanation is available to those who are working in the real world who are dealing with timelines and pricing restrictions. So yeah, there is no gps in the non-3g version, no camera, no multi-tasking, and no flash. And there are probably many more. So I get full points but how could I not.

So why were my predictions so wrong? Am I just an idiot? I think all my predictions fairly cited apple's history for first gen products. Perhaps what I missed is that maybe this isn't a first gen product. This really is just a big iPod touch. Book store is new and so are these iWork apps but really they are just riffs on current businesses. And we haven't seen how they will pan out yet.

Careful overextending that model

2010-01-25T18:43:00.000-08:00

In my current work I spend a lot of time working with models that try to predict the likelihood of a customer's future action. From time to time we have a need to model something new quickly and it is very tempting in those cases to take a different but similar model and see if it can simply be re-aligned to fit the new circumstance. Sometimes this works fine but it is interesting how often this fails dramatically.

But I have to be careful about writing about specific examples from work. So let's consider a hypothetical example instead to see the danger of assuming a model extends.

How about consider a software company who has started a pilot program of giving potential clients six week trials of their product. Now they would like to know as soon as possible what sort of customer is most likely to convert so they can focus their sales force most effectively. Rather than wait six weeks for results they decided to try to observe who will convert in the first week (ie a target of one week conversion). That way after one week, plus development time, they will have a model that they can start using. While this model won't have much accuracy in predicting the probability of converting in six, surely it will rank order correctly with a model that used a full observation period. After all, if a customer that doesn't convert in week one has similar characteristics as someone who does then it seems fair to conclude they just need a little bit more of a push to become more like their cohorts.

However, this model fails disastrously. Those who were supposed to be the most likely to convert ended up being the least likely. So the model is revisited to see what could have gone wrong. It is discovered that the biggest drivers of the model were

lots of use of the software
lots of interaction with the sales department (asking questions about price, etc)
indicates on questionnaire that the software is to be used on an urgent project
asks a lot of questions about the features that are locked by the trial.

And that's when the analyst hypothesizes his mistake. Instead of targeting for the desire to convert, he suspects that he's targeted for the need to make a decision quickly. So those who were rated highly by the model and didn't convert in the first week, didn't convert because they had decided in that week that they would never convert.

To test this theory the analyst checks whether those who did not convert, but were predicted to, had bought the competitor's software (the trial included some spyware so he can check for such things). And there is the answer. Not only had they bought the competitor's software in large numbers, but usually with in the first week of the trial. Those that were considered the best leads had, unbeknownst to the company, had already ruled themselves out of being future customers.

Now real life examples are rarely this clean cut. In reality, the population with a high likelihood of converting in the first week would probably be made of up a mixture of types. Instead of moving as a group to being the worst converters, you would probably some sort of indeterminate result.

So how could this mistake have been prevented. Well first by validating that there was some justification that the model could be extended. Perhaps he could have validated against the second week to find out sooner that he had made a mistake. But more importantly, the analyst waited too long to try to understand his variables. When making a model like this it is important to try to understand why each of the drivers of the model are actually predictive. And then test the relationships that you believe exist. That is to say construct a narrative that explains why your variable predicts the way it does, consider the other consequences of this narrative, and see if these other consequences exist in your data.

Even though we aren't wearing lab coats it is a good idea to keep in mind that we are still doing science. The scientific method makes a pretty good guide.

Implement US health care reform at a state level

2010-01-24T19:03:00.001-08:00

Slate asks how the Democrats can still implement healthcare reform. This was my response:

Consider the major points of the senate bill: state level exchanges, individual mandate, no refusal based on pre-existing conditions, and subsidies paid for by Cadillac insurance tax. Assuming the tax and subsidy is distributed equally, the benefits for any one state implementing this bill is not affected by how many other states implement this bill.

And since health insurance companies are not allowed, and would not be allowed, to operate in one state and insure someone in another, there is no issue of a loss in cost savings due to smaller pools being insured.

So if there is no advantage to implementing this nationwide, instead of in just the states that can pass it, why not just pass the bill at a state level in states where it can pass. A domestic coalition of the willing.

There is a non-zero sum game in game theory called stag hunt which has two defining characteristics: a single defector is able to lower the payoff of the co-operators, and the single defector lowers his own payoff by defecting (unless there is another defector). It is this situation that justifies modern democracy where the majority is able to enforce the co-operation of defectors (whose defection wouldn’t even be in the defector’s own self interest).

But this isn't stag hunt. A single defector, or many defectors, has almost no affect on those that co-operate. So implementing at a state level may even be more democratic in this case. After all, shouldn’t individual choice be preserved where possible.

There may be two migration issues to worry about in this strategy: those with pre-existing conditions moving into states that implement this bill, and healthcare providers moving out. However, there are lots of other ways that entitlements vary from state to state, so if migration isn’t already a strategy for those seeking extra entitlements there’s no reason to think this entitlement would be any different. And we’ve been told that this bill won’t hurt healthcare providers so there should be no reason for them to leave.

Finally, consider our healthcare in Canada. While we have broad laws at a federal level requiring single payer and transferable health insurance, healthcare itself is actually implemented at the provincial level. This is because (a) the provinces were considered too varied to manage all of healthcare from Ottawa and (b) it was thought that trying to manage the healthcare of 30 million people in a single institution is a task that only a crazy person would attempt. 300 million is a bit more.

My pessimistic/realistic apple tablet predictions

2010-01-20T19:38:00.000-08:00

So I wanted my next blog post to be a little less controversial but I think I'm about to kick the biggest hornets' nest yet: apple fan-boys. But the next apple announcement is around the corner and I don't have much longer to be a wet blanket.

Note, that I'm basing these predictions completely on apple's track record with first gen devices. This isn't what I want to see happen, just the pattern I'm used to from apple. Also be warned; I'm a bit rant-y.

The product will be too expensive. What did the iPod, iPhone, apple speaker and appleTV all have in common on day one? They looked fun but they were way too expensive to consider actually buying under normal circumstances. In each case multiple hundreds of dollars to expensive. Now, they come down in price eventually (assuming the product isn't discontinued first). And they can get away with the high price because there is a cluster of consumers at the high end of the demand curve that are sufficient to buy all of their initial supply, which is usually pretty small. But I don't think it makes sense to expect a sane price.
Most developers will be locked out, except perhaps in a ridiculously restrictive way. I've heard various predictions on how developers should be excited because they will have a whole new way to make a ton of money. But apple has never let developers make money on their products on day one. And when you are allowed it has to be exactly on their terms. When the iPhone came out it was just assumed that anyone who was buying one would jailbreak because apple had locked it down so much. This is still true with the appleTV. This is the company that made an iPhone that had a recessed headphone jack so the vast majority of third party headphones wouldn't work with it. And the company that disallowed the iPhone podcaster app, allowed it, then re-disallowed it, put it in a three month penalty box, then re-allowed it again.*
Content channels will be incredibly locked down. In fact all content will go through apple. This one seems obvious to me. The appleTV is really the only set top box left that can't stream netflix in the US. Neither the iPhone nor the appleTV have an approved way to play divx files. Their may be a crack in this lockout similar to podcasts on the iPod. But if you want to charge for your content you are going to need to go through apple to get onto their device. And then you'll have to wait (and as I understand it, wait and wait and wait) to get paid by apple. So I expect predictions that this will be the perfect universal content device to be very very wrong.
There will be some incredibly obvious feature that is inexplicably missing. The appleTV doesn't have a tv tuner, and no approved way to add one. The iPhone has had bluetooth from day one but no way to use an external bluetooth keyboard. The iPhone took multiple generations to get cut and paste. The apple mouse still doesn't have a simple second button. (No, that doesn't count. I said a "simple" second button.)

So might these predictions be wrong? Sure, and I hope they are. But if they are wrong it's because apple has decided to alter their behaviour. There are those who expect this to be the perfect universal device that can be crafted into whatever they need a tablet for. I have no idea why they think such a device could ever be a first gen apple product.

*Ok, I need to rant about this a little more. Let me be clear: apple hates developers. I used to think they were generous because they gave away xcode. But then I realized that they aren't giving away xcode, they are exclusively bundling the only IDE that can develop for their platforms with their computers. So as a developer, you are allowed to develop for the iPhone, but to do so you need to buy one of their computers, only develop on that computer, and then seek approval for what you've done to get it into the store. There, I'm done ranting.