Wednesday, September 22, 2021

A LLR for significance of interaction of two variables upon an effect (or how to compare A/B test effects in different segments)

I should warn you, this is a blog post about a formula. And a formula that's really a work in progress. But I think a useful formula none the less. But before I reveal it let me give a motivating example of why you might find this formula interesting.

Consider an online store who decides they are going to increase conversions with an extra incentive; with every purchase they are going to include company stickers. But the problem is they only have a fixed inventory of stickers so they can't give them to everyone. So how do they decide whom to offer the stickers. When do you say "if you buy today we'll throw in a sticker!"

So the company enlists two data scientists to solve the problem. And, as data scientists are wont to do, they take a look at the funnel and make a conversion model. And after cleaning the data and understanding the results they have a segmentation to which breaks up by conversion rate. So they schedule a meeting to present their results.

Scientist 1: After combing through the data we've figured out your best and worst converters.
Scientist 2: Visitors from Texas who are on the site before 9am are your highest converters. And visitors from western Canada who visit after 3pm are your lowest.
Scientist 1: So when can we begin offering these stickers to the Texans?
Scientist 2: You mean the Canadians? The Texans are already converting high. It's the Canadians that have the low conversion rate we have to raise.
Scientist 1: You want to give an incentive to our least receptive audience? Clearly the Canadians don't like us. We're not going to be able to make them like us with just some stickers. At least we know the Texans like us and the non-converters just need a little bit more of a nudge.
Scientist 2: But if we give an extra promotion to the Texans we're just going to be cannibalizing our own sales. We know we'll be giving out a ton of our limited stickers to visitors who were already going to convert.

And this is when the data scientists look around and realize everyone else has already left the meeting. The problem is they didn't think about the actual problem they wanted to solve. What they needed to find out is what's the incremental impact of offering stickers to which visitors. Do they gain more conversions on the margin by offering stickers to those after 3pm Western Canadians or the before 9am Texans. Or is that even the right divide. Perhaps the best sting for the sticker is to offer them to Firefox users referred by organic search. What they want to model is what is the incremental impact of offering a sticker to any given visitor.

Unsurprisingly, this will require an A/B test. What's more tricky is how to handle the results. Usually when you compare whether two segments are different (Chrome users vs non-Chrome users) you want to find out if the different segments convert at different rates. But remember our data scientists solved that but it didn't help them solve the problem they were given. What they want to solve is whether there's an interaction between their segmentation variable and their assignment variable when applied to conversion.

You can imagine this is interesting in general anytime you have an A/B test. You may know that treatment B converted better than treatment A. But it would be nice to know was there a particular cohort who particular preferred B. Perhaps there's a small cohort who preferred treatment A.

The simplest model that you can ever look at is just stating the global results without segmentation: group with treatment A converted at 20% and group with treatment B converted at 25%, so treatment B increases conversion by 5%. The second simplest model is to take a single split of the population and determine if the two subpopulations are different with statistical significance with respect the property you care about. Segment 1 had conversion rates 15% and 20% for A and B respectively while Segment 2 had conversion rates 30% and 40%. Did B have a bigger effect over A in Segment 1 or Segment 2? (This will vary depending on what we mean by "effect".)

Consider what data you need to collect. For each of the two segments you need the number of conversions (here to forth referred to as target) for each treatment, as well as the number of non-conversions (non-target) for each treatment. This might look something like:

A/Bsegmenttargetnon-targettotaltarget:non-target odds

From this you may make some charts to try to tell what is going on:

Well, segment 2 certainly has more targets than segment 1 in both A and B. But it also has more non-targets. And did B go up more in segment 2? Maybe it's better to look at some percentages.

Ok, now we can tell segment 2 definitely has a higher target rate. And treatment B has a higher target rate in both segments. But did treatment B have more marginal impact in segment 1 or segment 2? For that we look at the ratio of the target odds between B and A in each segment.

Great, we can see treatment B has more of an impact on target rate over treatment A in Segment one than in Segment 2. In Segment 1, the odds of conversion one increased 20% when going from treatment A to treatment B. But only increased 7% in Segment 2.

But there are so many moving parts. We started with 8 numbers (targest/non-target, segment, treatment) and kept dividing in various ways. Each of those 8 numbers are just a sampling with error bounds. How do we know if we have enough volume? Can we state any conclusion with any confidence at this point?

And that's where the promised formula comes in. It's not pretty to look at so I'm going to make you click through if you want to see it. What this technically is, is the maximal log likelihood of a given segment, where an impact or effect is provided. In this case the impact is the difference of the ln odds observed in treatment B and the ln odds in treatment A. Given this you can come up with a log likelihood ratio (LLR) where you compare the assumption that both segments have the same impact vs having distinct impacts.

That's quite a mouthful but for those interested I've provided the derivation using sagemath (which is only a handful of lines given that sagemath does most of the heavy lifting for us).

What's actually useful is using this derivation I've written a simple calculator to calculate these LLR values. Feel free to take and modify this code.

For the above example the calculator gives us an LLR of 0.118, which is quite small so we would say there is no difference in the strength of the impact from treatment B over A in segment 2 compared to the same in segment 1. That is to say we can not say with confidence there is an interaction between the assignment variable and the segmentation variable upon our target.

It's interesting to note that there is a significantly different target rate for our two targets. The target rate goes from 10% to 13.4%. If we were to run a simple t-test between our two segments with respect to target we would find they are different with statistical significance. But remember, we're not interested in separating the high converters from the low converters.

But what if we tweak the numbers a bit and increase the volume:

A/Bsegmenttargetnon-targettotaltarget:non-target odds

Now we get an LLR of 19.07. This is starting to get high enough where it's starting to be convincing there is an effect, despite the graphs looking very similar:

The difference is there's more volume and a slight more effect. But the actual conversion rate between the segments got closer at 11% and 12.6%.

However, this isn't just a measurement of volume. We can tweak the numbers again with similar volumes and end up with an LLR of 0.048.
A/Bsegmenttargetnon-targettotaltarget:non-target odds

The LLR dropped because now the ratio of odds is very close. So while we have lots of volume, and again the target ratio is very different, the impact of B over A is the same in both segments. It increases the odds by 50% in both segments. So the segments are not different by this measure even though the segment target rate is the most different we've seen so far: 13% and 18%. (You can find all the data above in a Google sheet.)

None of this is that novel in the world of statistics. All of this can be done by looking at the significance variable of the interaction coefficient in an ANOVA table for a two feature logistic regression. But using the simple code in this calculator you can automate the hunt for segmentations that you care about. Or you can take the formula, and a decision tree modeler like brushfire, and build a tree based on segments where the impact is highest or lowest. Using this model our data scientists may be able to solve the actual problem they were asked and be able to give away these stickers.

Try the calculator below and let me know if you get any surprising results.

Saturday, March 22, 2014

How often should you retrain your model?

One of my favourite modeling managers taught me that it wasn't his job to determine if my judgement or his judgement was right. You wouldn't go to him with an argument, and he'd stroke his chin, think a bunch, then say whether he'd agree or not. If our judgements disagreed we'd come up with what data we could collect that would determine whose opinion lined up closer to our business.

I think about this when I see competing approaches to model training.

When I worked in credit risk, predictive models tended to be trained and built with a life expectancy of 6 to 36 months. During the lifetime of the model, distributions of score and model fitness are closely watched. When problems are discovered adjustments and realignements are made along the way rather than scrapping and retraining.

Many software engineers seem to have come to a different conclusion. Here the ultimate solution seems to be build a tool that just keeps retraining monthly/weekly/daily. (Ok, I don’t actually believe anyone is advocating daily. But the hyperbole makes the point.) Online training is regarded as being obviously superior to all other solutions.

The most common argument you might have for rebuilding your model regularly is that you are worried about your model becoming stale. The world changes over time so the factors that were predictive aren't going to predict as well as time goes on. This is certainly eventually true.

But consider what you're actually saying here. Say you are trying to predict 30 day attrition. Then your data has to be at least 30 days old to begin with. After all, how can you tell me if a user from two days ago will churn in 30 days. In order to amass some volume your observation period probably goes back another 15 days (and maybe as much as 100). Putting this together what you're saying is at 45 days the model is fresh. But at 75 it is unacceptably stale? I'm skeptical that there's going to be a shift in society that was observable 45 days ago, not observable 75 days ago, and is still relevant now.

The other reason to keep rebuilding is you're building up a larger observation set. If this is your argument clearly you're not worried about staleness because you're probably making your observation period as wide as possible to capture as many observations as possible. But again, I'm not convinced you're going to get that many wins. Maybe you're starting very early and the first month you retrain you double your observations. That's probably going to make some difference. But after that you're increasing by a third, then a quarter. These returns seem to be diminishing pretty quickly.

But of course the big argument to retrain is what's the harm?

Some reasons against auto-retraining:

  • You end up building rigid structures that you won't modify. That is by automating the process you have to do a little bit more work for each choice than by just doing it. If you have a thought for a post processing of your model output, if you're just writing the code for your model you just need to write the actual post processor and test the code. You probably wouldn't try this if you're doing automatic training because you need a post processor that's dealing with a moving model. Or you might have a framework where this post processor just doesn't fit.

  • Your model interactions won't be consistent. You're going to end up with multiple models in the long run because different data will arrive at different times and you need to make the right choices at the right time. You might even have models optimized for different targets. Every time you retrain your model you can check if it improves on predicting the given target. But what do you do if model A says 80% and model B say 30%? You want to know that this cohort isn't constantly changing in personality. But every time you retrain you lose knowledge on model interactions.

  • You'll lose out on gradual model improvements. Or let me put this a scarier way: you will be constantly running with mistakes.

  • You end up with a more average model. The quality of a model of course is just a sampling from a distribution. And the observed performance on a validation set is a sampling of a distribution that's dependent on the model quality. So what happens if you keep re-sampling? You end up with the expected outcome.

Convinced? You shouldn't be. No really... you really really shouldn't be.

That's because if you're in the position where you are guiding model based strategies you should be pretty impervious to arguments. You should understand that the best sounding argument is frequently wrong.

In the case of deciding how and when to retrain which of these arguments makes sense for your situation depends a lot on… well... your situation. In the case of advertising and spam where your performance periods are short and the users change quickly very few of my reasons against auto-retraining make sense. However, for modelling churn and fraud, the situation is pretty different. The performance periods are longer and the behaiviours change slower. How do you know what situation you’re in?

That’s why you work with the data instead. Figure out what the data tells you.

You can build the composite model that you would have had if you retrained every week, or every month, or kept the single model. Here's a simple task. Take your history and train a model for every week. Then evaluate every model for every future week graphing the AUC of the model on the y-axis. You basically end up with what looks like a cohort chart for model age. It'll become really clear the rate at which your model is degrading.

This exercise of course isn't perfect. In reality the data you would have collected had you updated your model regularly wouldn't look like the model you currently have. But if these issues are large enough to skew your data that it would change your conclusion you have even bigger problems.

I was reminded in a recent conversation with a very skilled modeler of the old adage the proof of the pudding is in the eating. Ultimately, you are going to be far happier if you incorporate these changes with champion/challenger strategies (really just A/B testing). When it comes to building code that’s generating models that is then affecting users it is far harder to tell what work is actually adding value and what work is only adding debt. It is far too easy to not understand the actual effects your users are experiencing.

I think it’s reasonable to say that in the field of modeling you should be spending at least half your time measuring rather than building. At first this seems disappointing because this means you will only be able to build half as many things that you are excited about. But when look back over the things I’ve worked on it is the things I’ve measured are the things that I’m most proud of. It is only the things I’ve measured that has allowed me to gain knowledge from my experience.

Friday, December 20, 2013

Successful data teams hustle

There seems to be a lot of ways to start a data team at a startup. One popular technique is for you to be an internal consultancy within the company. The rest of the company is supposed to come up with needs that requires a data specialist and you are supposed to prioritize and respond to those needs while building tooling to solve those needs.

Unfortunately, this often ends up producing a team of data cops. A team more interested in enforcing how others should use their tools rather than producing value with data.

I think there's a much more effective approach.

Instead, consider your new data team to be a startup inside the company: making and selling a product. Your product is the data; and intuition and judgement are your entrenched competition.

You're probably the sort of person to whom it's obvious that people should be using more data in their decisions. You probably shiver every time you hear someone say they are basing their decisions on their strongly held beliefs that have no evidence to support them. But you may not realize how un-obvious this is to everyone else.

In order to succeed as a data team, you're going to have to learn to be operate like a successful startup.

That means that just like any founder, you're not just a developer. You're sales. You're support. You're the number one advocate.

And you're going to have to hustle.

Don't make people's lives harder. Don't be confused in thinking that the rest of the company (your customers) are going to put in extra effort to deliver your data ready to be consumed. Don't try to start putting impositions on product development to make your life easier. Startups that make a product that puts demands on its users rarely survive. To put it simply, you work for them.

Make people's lives easier. Adopt work that existed before there was a data team, where that makes sense (eg: take on log maintenance). This is what all other teams do when they form. A new design team is responsible for the landing page design even if it was originally designed by a developer. Startups that solve a previously unsolved problem rarely take off. Take on the schleps.

Anticipate opportunities for data to be the answer and then have it ready. A friend of mine recently told a story how when he was in the t-shirt business he'd respond to potential contracts with custom shirts in the bid. When you see an opportunity for data to make your business better build it, don't argue for it. Big pitches don't sell. Big pitches that don't even have a screen shot really don't sell. Having the product ready and pre-configured sells.

The word no isn't in your vocabulary anymore. When you have succeeded in gaining some interest, don't turn around and tell them "well, that's not actually what I'm building". Successful companies pivot in response to demand; and so do you. I'm not saying you have to be a GIGO machine that answers every question you are given. But every request is a lead; and every lead is gold.

Communication will make you or destroy you. What's worse than having bad data? Having to discover for yourself that the data is bad. You will make mistakes. But you also need to earn trust. People will learn to trust you are providing good answer when you pro-actively and aggressively communicate where things have gone wrong.

Learn to take the blame. In general, learn how to provide customer service. For example when someone has a data need that your tooling can't handle instead of responding by saying "well, you can't really do that because that request is kind of unreasonable" try
That's a totally reasonable request and I can understand why you'd want that. Embarrassingly, the tools we've setup don't actually support that yet. But let me come up with something that will solve your problem for now.

Try to remember you're not telling people to eat their vegetables. It's very easy to be seen as the doctor saying "if only you were to eat all your vegetables you will eventually appreciate them". But you're not offering vegetables. You're offering pie. The pleasures of using data is almost immediate and it never gets old (just like pie). So while you are competing with an entrenched product (intuition) your competition doesn't have what you have. You have pie.

Tuesday, November 26, 2013

My hair on fire rule of metrics

I feel since I talk about this rule a fair amount I should have it published somewhere. A hair on fire rule is one which when noticed you don't wait for arguments to weigh the pros and cons. You just put out the fire. I have one rule like this for metrics.
A metric for a time period can't change after it has been reported.
This doesn't mean you have to be able to report the metric immediately after the time period has ended. And it doesn't mean you can't fix errors later. But it means that the definition of the metric shouldn't be affected by future events.

Monday, May 27, 2013

Some podcasts you should listen to if you're involved in A/B testing

Statistics has been in the news recently which has made for some really thoughtful content being made on the topic. I started compiling a list of people who I thought would enjoy listening to these podcasts on the topic and that list got pretty long so I'll use this blog to broadcast instead. I'll resist giving commentary or critiques on the actual conclusions of the speakers except to say they are interesting.

First was Frakt on Medicaid and the Oregon Medicaid Study on EconTalk which is a great discussion of the statistical power of studies.

Second is Paul Bloom and Joseph Simmons on which really illustrates how getting fake results from bad statistical practices isn't just a theoretical problem and how you can demonstrate this with simulations.

And finally, back on EconTalk, is Jim Manzi on the Oregon Medicaid Study, Experimental Evidence, and Causality which gets into some more subtle analysis flaws that can destroy the value of A/B testing and really drives home the point that it is a failing endeavour to try to harvest a lot of confidence out of any single experiment. That confidence is gained through an iterative process that comes out of a lot of simple experiments that are constantly updating your priors.

I'll break my no commentary promise a little here. One thing I find quite interesting is how Simmons and Manzi essentially come to the same conclusion on the problem of gaining knowledge from a single experiment while using modern data mining techniques; but they offer different cures. Simmons recommends not allowing yourself to search over your data over lots of dimensions as that will surely lead to false positives. Where as Manzi seems to say you should never be too positive about the results of any single experiment. So iterate over a series of small experiments instead; each one informing the next. Perhaps this is a reflection of their industries (academic vs business) but then this too may be overfit. They both agree that we have to accept that we can't gain truths as quickly as we currently think we can.

Sunday, February 10, 2013

Piece wise linear trends in the browser

Somehow I never blogged about the Javascript implementation of l1tf released by my friend Avi Bryant and myself. l1tf is a way to find piece wise linear trends in time series data.

Monday, February 04, 2013

Simple cross domain tracking

I hear of some really complicated schemes from time to time to track users across multiple domains that belong to a single site. While I'm sure they mostly work it seems like there's a simple way to do this that I assume many people are already using but is probably too boring to comment on. So, let's be boring for a moment.

Let us say you own,, and When a user visits he is assigned a unique tracking token in the cookie (we'll call it [tracking-token-eggs]). At some point after that token is assigned, include it in the page requests to //[tracking-token-eggs]&, and //[tracking-token-eggs]& (Create the same setup for visitors to and

If the browser already has a token stored in the or cookies you will now have a request that includes both domains and both tokens; both domains are in the url, one token is in the url and the other token is in the cookie of the request. The first domain is also in the referrer/referer. This works even if 3rd party cookies are blocked (at least in the browsers I've tried). Now you can store this request in a database table or just a log file.

If you want to do something slightly more complicated that involves javascript you can alter the technique to use iframes instead of gifs. Just don't try to create or store any new tokens in the iframe from the foreign domain because this is when techniques fail.

[Edit: I should add that this is a technique for when you have half a dozen domains or so. Not for hundreds of domains.]