Wednesday, September 22, 2021

A LLR for significance of interaction of two variables upon an effect (or how to compare A/B test effects in different segments)

I should warn you, this is a blog post about a formula. And a formula that's really a work in progress. But I think a useful formula none the less. But before I reveal it let me give a motivating example of why you might find this formula interesting.

Consider an online store who decides they are going to increase conversions with an extra incentive; with every purchase they are going to include company stickers. But the problem is they only have a fixed inventory of stickers so they can't give them to everyone. So how do they decide whom to offer the stickers. When do you say "if you buy today we'll throw in a sticker!"

So the company enlists two data scientists to solve the problem. And, as data scientists are wont to do, they take a look at the funnel and make a conversion model. And after cleaning the data and understanding the results they have a segmentation to which breaks up by conversion rate. So they schedule a meeting to present their results.

Scientist 1: After combing through the data we've figured out your best and worst converters.
Scientist 2: Visitors from Texas who are on the site before 9am are your highest converters. And visitors from western Canada who visit after 3pm are your lowest.
Scientist 1: So when can we begin offering these stickers to the Texans?
Scientist 2: You mean the Canadians? The Texans are already converting high. It's the Canadians that have the low conversion rate we have to raise.
Scientist 1: You want to give an incentive to our least receptive audience? Clearly the Canadians don't like us. We're not going to be able to make them like us with just some stickers. At least we know the Texans like us and the non-converters just need a little bit more of a nudge.
Scientist 2: But if we give an extra promotion to the Texans we're just going to be cannibalizing our own sales. We know we'll be giving out a ton of our limited stickers to visitors who were already going to convert.

And this is when the data scientists look around and realize everyone else has already left the meeting. The problem is they didn't think about the actual problem they wanted to solve. What they needed to find out is what's the incremental impact of offering stickers to which visitors. Do they gain more conversions on the margin by offering stickers to those after 3pm Western Canadians or the before 9am Texans. Or is that even the right divide. Perhaps the best sting for the sticker is to offer them to Firefox users referred by organic search. What they want to model is what is the incremental impact of offering a sticker to any given visitor.

Unsurprisingly, this will require an A/B test. What's more tricky is how to handle the results. Usually when you compare whether two segments are different (Chrome users vs non-Chrome users) you want to find out if the different segments convert at different rates. But remember our data scientists solved that but it didn't help them solve the problem they were given. What they want to solve is whether there's an interaction between their segmentation variable and their assignment variable when applied to conversion.

You can imagine this is interesting in general anytime you have an A/B test. You may know that treatment B converted better than treatment A. But it would be nice to know was there a particular cohort who particular preferred B. Perhaps there's a small cohort who preferred treatment A.

The simplest model that you can ever look at is just stating the global results without segmentation: group with treatment A converted at 20% and group with treatment B converted at 25%, so treatment B increases conversion by 5%. The second simplest model is to take a single split of the population and determine if the two subpopulations are different with statistical significance with respect the property you care about. Segment 1 had conversion rates 15% and 20% for A and B respectively while Segment 2 had conversion rates 30% and 40%. Did B have a bigger effect over A in Segment 1 or Segment 2? (This will vary depending on what we mean by "effect".)

Consider what data you need to collect. For each of the two segments you need the number of conversions (here to forth referred to as target) for each treatment, as well as the number of non-conversions (non-target) for each treatment. This might look something like:

A/Bsegmenttargetnon-targettotaltarget:non-target odds
A1454555000.10
B1544465000.12
A21308701,0000.15
B21388621,0000.16

From this you may make some charts to try to tell what is going on:

Well, segment 2 certainly has more targets than segment 1 in both A and B. But it also has more non-targets. And did B go up more in segment 2? Maybe it's better to look at some percentages.

Ok, now we can tell segment 2 definitely has a higher target rate. And treatment B has a higher target rate in both segments. But did treatment B have more marginal impact in segment 1 or segment 2? For that we look at the ratio of the target odds between B and A in each segment.

Great, we can see treatment B has more of an impact on target rate over treatment A in Segment one than in Segment 2. In Segment 1, the odds of conversion one increased 20% when going from treatment A to treatment B. But only increased 7% in Segment 2.

But there are so many moving parts. We started with 8 numbers (targest/non-target, segment, treatment) and kept dividing in various ways. Each of those 8 numbers are just a sampling with error bounds. How do we know if we have enough volume? Can we state any conclusion with any confidence at this point?

And that's where the promised formula comes in. It's not pretty to look at so I'm going to make you click through if you want to see it. What this technically is, is the maximal log likelihood of a given segment, where an impact or effect is provided. In this case the impact is the difference of the ln odds observed in treatment B and the ln odds in treatment A. Given this you can come up with a log likelihood ratio (LLR) where you compare the assumption that both segments have the same impact vs having distinct impacts.

That's quite a mouthful but for those interested I've provided the derivation using sagemath (which is only a handful of lines given that sagemath does most of the heavy lifting for us).

What's actually useful is using this derivation I've written a simple calculator to calculate these LLR values. Feel free to take and modify this code.

For the above example the calculator gives us an LLR of 0.118, which is quite small so we would say there is no difference in the strength of the impact from treatment B over A in segment 2 compared to the same in segment 1. That is to say we can not say with confidence there is an interaction between the assignment variable and the segmentation variable upon our target.

It's interesting to note that there is a significantly different target rate for our two targets. The target rate goes from 10% to 13.4%. If we were to run a simple t-test between our two segments with respect to target we would find they are different with statistical significance. But remember, we're not interested in separating the high converters from the low converters.

But what if we tweak the numbers a bit and increase the volume:

A/Bsegmenttargetnon-targettotaltarget:non-target odds
A14554,5455,0000.10
B16524,3485,0000.15
A21,3048,69610,0000.15
B21,2288,77210,0000.14

Now we get an LLR of 19.07. This is starting to get high enough where it's starting to be convincing there is an effect, despite the graphs looking very similar:

The difference is there's more volume and a slight more effect. But the actual conversion rate between the segments got closer at 11% and 12.6%.

However, this isn't just a measurement of volume. We can tweak the numbers again with similar volumes and end up with an LLR of 0.048.
A/Bsegmenttargetnon-targettotaltarget:non-target odds
A14554,5455,0000.10
B16524,3485,0000.15
A21,3048,69610,0000.15
B21,8038,19710,0000.22

The LLR dropped because now the ratio of odds is very close. So while we have lots of volume, and again the target ratio is very different, the impact of B over A is the same in both segments. It increases the odds by 50% in both segments. So the segments are not different by this measure even though the segment target rate is the most different we've seen so far: 13% and 18%. (You can find all the data above in a Google sheet.)

None of this is that novel in the world of statistics. All of this can be done by looking at the significance variable of the interaction coefficient in an ANOVA table for a two feature logistic regression. But using the simple code in this calculator you can automate the hunt for segmentations that you care about. Or you can take the formula, and a decision tree modeler like brushfire, and build a tree based on segments where the impact is highest or lowest. Using this model our data scientists may be able to solve the actual problem they were asked and be able to give away these stickers.

Try the calculator below and let me know if you get any surprising results.



No comments: