Sunday, February 10, 2013

Piece wise linear trends in the browser

Somehow I never blogged about the Javascript implementation of l1tf released by my friend Avi Bryant and myself. l1tf is a way to find piece wise linear trends in time series data.

Monday, February 04, 2013

Simple cross domain tracking

I hear of some really complicated schemes from time to time to track users across multiple domains that belong to a single site. While I'm sure they mostly work it seems like there's a simple way to do this that I assume many people are already using but is probably too boring to comment on. So, let's be boring for a moment.

Let us say you own eggs.com, bacon.com, and coffee.ca. When a user visits eggs.com he is assigned a unique tracking token in the eggs.com cookie (we'll call it [tracking-token-eggs]). At some point after that token is assigned, include it in the page requests to //bacon.com/tracking.gif?token=[tracking-token-eggs]&domain=eggs.com, and //coffee.ca/tracking.gif?token=[tracking-token-eggs]&domain=eggs.com. (Create the same setup for visitors to bacon.com and coffee.ca).

If the browser already has a token stored in the bacon.com or coffee.ca cookies you will now have a request that includes both domains and both tokens; both domains are in the url, one token is in the url and the other token is in the cookie of the request. The first domain is also in the referrer/referer. This works even if 3rd party cookies are blocked (at least in the browsers I've tried). Now you can store this request in a database table or just a log file.

If you want to do something slightly more complicated that involves javascript you can alter the technique to use iframes instead of gifs. Just don't try to create or store any new tokens in the iframe from the foreign domain because this is when techniques fail.

[Edit: I should add that this is a technique for when you have half a dozen domains or so. Not for hundreds of domains.]

Monday, January 28, 2013

On calculating Fibonacci numbers in C

A few months ago Evan Miller wrote an essay called The Mathematical Hacker. While an interesting post he does make a mistake when he gives the "proper way" to calculate the Fibonacci numbers.

The essay claims that you shouldn't use the tail-recursive method you would learn in a CS class to compute the Fibonacci numbers because, as any mathematician knows, an exact analytical solution exists. His C example looks like:
But there are actually a few more optimizations I picked up while studying linear recurrence sequences that I thought I'd share. The first drops the time almost by half:
The reason why this works is because the last part of the expression (the (0.5 - 0.5 × sqrt(5.0))/sqrt(5.0) part) has a magnitude less than 0.5 so you get the same result just by using the round.

Note I benchmarked these with: using gcc fibonacci.c -O2 -o fibonacci

Using these benchmarks I get 12006 ms vs 7123 ms. And the validation number matches as 0x6744

But there's yet another optimization:
That's right, we can do even better by using the tail-call recursion method dismissed in the essay. Now we get a time of 2937 ms.

For the observant of you you'll notice that what my benchmark does is just recalculates the first 40 Fibonacci numbers over and over again while summing them and taking the last 4 hex digits of the sum for validation. (It's not just for validation. We also do this because if you gcc with -O2 and you don't do anything with the output gcc is smart enough to skip the whole step. We need -O2 so gcc will recognize the tail-call.)

You could call foul on me right now. After all the reason why the analytic approach is slower is because of pow, and pow gets way more efficient with larger exponents.

Alright, fair enough. Let us run the test again except we'll sum the first 90 Fibonacci numbers instead (not much point going much past 90 since the 96th Fibonacci number is the largest to fit in an unsigned long int). So we update the code to r = (r + f(i % 90)) % 0x10000;

Now we get 7795 ms for the recursive solutions and 12840 ms and 7640 ms for the analytical solutions. I ran the benchmark a few times and the recursive method is consistently faster statistically but I think that 2% faster has to be within the gcc optimization margin of error.

But there's something else to notice. For the two analytic solutions the validation number is 0x2644 but for the recursive solution it is 0x2f9c. Two against one right? Well, votes don't count in math and dictatorships.

What happened is at the 71st Fibonacci number both analytical solutions lost precision. This is because C doesn't check what we're trying to do. It does what we tell it to do. And we told it to take a float approximation of an irrational number, with only the precision a float has, and take it to a power.

I do want to stop here a moment and say I'm not pointing out this error as a gotcha moment or as evidence that Evan Miller is poor at math. I think How Not To Run An A/B Test is an incredibly important essay and should be understood by anyone who is using A/B test results. Also if you are doing statistics on a Mac you probably should have bought Wizard by now.

However, I do think this mistake illustrates an important lesson. If we tell programmers that the more math (or should that be Math) they use the better programmer they are, we are setting up unnecessary disasters.

One reason is because virtually no programmer spends a majority of their time doing things that look like Math. Most spend 99.5% doing things that don't look like Math. If a programmer takes this message to heart then they are going to spend a lot of time feeling like they aren't a true programmer; which is silly and bad for its own sake.

The other issue is that a focus on better programming looking like Math can be a major distraction. And it can lead to really silly time wasting debates (eg https://github.com/twitter/bijection/issues/41).

But most dreadfully if we tell programmers that they should give more weight to the more mathematical solutions they will often not choose the best solution for their context. I've certainly not given the best solution for finding Fibonacci numbers for all contexts. Heck, I'd bet you could get better results for my own benchmark by using memoization (for the record there's a memoization technique for both recursion and the analytical solution -- but it's easier with the recursion solution).

My solution wouldn't be that all programmers learn more Math. My solution would be that it is good to be part of a group where different people know different things. And we should take advantage of this knowledge base by not being embarrassed to ask questions. I have a number of friends who send me emails from time to time that just has the subject "math question." And I all the time send emails with the subject "programming questions," "statistics question," "writing question," or even "math question." I find it works really well for me.

So no, I don't think every programmer needs to be taught more Math. Except for linear algebra of course. Everyone should learn more linear algebra.

(You can download the full source for these examples from github.)

Saturday, November 24, 2012

How to get random lines out of a file or piped stream

Several months ago Aaron Olson, Camilo Lopez, and I were sitting around drinking beers (after 4pm on a Friday at the Shopify office; that's what you do after making ecommerce software). And we were griping how there wasn't an easy way to get a random sample of lines out of a log file or a stream for testing purposes. Sure, there's head and there's tail. But we wanted random lines and we didn't want to have to think about it.

So we made a solution: dimsum. The usage is in the readme. You install it with gem (so ruby is required). And then just use it like head or tail. Submit issues to github.

PS dimsum uses reservoir sampling so you can pipe right to it.

Thursday, May 03, 2012

Analyst, measure thyself

The other day I had a nice example of my work failing that I thought I would share. At Shopify I have several models that predict all sorts of future behaviours of sellers, purchasers, signups, etc. Earlier this year there was a change in the signup population that dramatically reduced the effectiveness of one of these models.

What's interesting is I didn't discover this myself. One of our VPs did.

It is certainly never fun to have a co-worker discover that some of your work is inadequate. It is even less fun when that co-worker is a VP. But what was cool was that he found out by looking at a self-updating report; a report that I wrote.

What I had made was a web page that broke signups from a month ago up into five groups based on the model. The groups were ordered by what the model at the time thought were the chances that members of each group would convert into paying customers. For each group I listed the rate that each group actually did convert. In an ideal world the bottom group would have converted at close to 0% and the top group would have converted at close to 100%. Let's just say that when this VP looked at this page the conversion rate of the bottom group was substantially higher than 0% and the conversion rate of the top group was substantially lower than 100%.

I'd love to say that my initial response was to accept that my model was no longer performing. Instead I tried to explain why this report was misleading. I insisted that the situation was complicated and you couldn't just look at a simple table like this and understand what was going on. But it didn't matter because the evidence was too convincing and it showed that the situation was pretty simple. The evidence said the model must be rebuilt and the evidence had come from me.

Most analysts are perfectly comfortable with the idea that the best way to know if a marketing campaign is succeeding, or if a design change is making customers happier, is to measure the results and then try to prove to the opposite. And yet we often fail to hold our own work up to the same scrutiny.

With that in mind I recommend your modeling projects include these steps:

  • Have quality tests for all the models you create. The tests need to be simple and require minimal explanation. It should be easy for any other analyst to replicate your test just by looking at how you present your test report. When you are done a model you need to be able to say "this is what I have done, and here is my proof."
  • Have fit tests for all the techniques you used. For example if you are using a curve-fitting technique to predict the future, show how that technique would predict the present using past data. You also need to show that your selection model type is a good choice. Even if you've used SVM models to predict these sorts of results in the past you still need to show that SVMs work with this data.
  • Write your tests before you start. This way you can't later avoid making the tests that expose your weak spots. Also this helps you know when you are done. If your test results haven't improved in the last few days then your work has stopped accomplishing anything. Plus it is motivating to be able to see your test numbers improve as you work. I'm pretty sure test driven modeling has been practiced longer than test driven development.
  • Even your tests should have tests. They need to reconcile with your accounting numbers. If your test reports revenue from customers the revenue total should reconcile with your reported revenue numbers.
  • Your tests need to be able to alert someone if they start failing. I dropped the ball here last time. 
  • Make your tests public. This takes away your option in the future to decide that this test doesn't apply. More importantly, the users of your model are going to have a much better understanding of what it does if they can see its past behaviour. Even better would be to make your tests public from day one.

Note that these tests are not a substitution for model validation. There will probably be overlap in types of tests; but you still need to have a separate validation hold out.

Finally consider this: if the defense of your model is a description of the techniques you used then you're doing it wrong. The defense of your model is presenting its performance. If your model is useful and it performs well then people will ask you about your techniques. You don't show that a new glass technology is unbreakable by describing the molding process. You show it by giving a man a baseball bat and letting him try to break it. And if he fails, give him a bigger bat.

Saturday, April 28, 2012

Me on TV

I've been bad at keeping this blog up to date so this is a little old. However, a few months ago I was on TVO's The Agenda talking about the work of a data scientist on an episode called "Big Bad Data".

You can skip forward to about 9:30 if you want to see just me. But the first interview Andrew McAfee is quite good so you should watch it too. In fact, you should watch it instead.

Friday, April 27, 2012

A scary thought about marketing attribution models

I was reading through Google's white paper on marketing attribution and a scary thought occurred to me which I tweeted. I'll self referentially quote my tweet here.

So what do I mean by this? Well, the paper talks about the different attribution models various companies use for click tracking that include first click, last click, time decay, linear, etc.

That's a lot of choices for a model without a clear way of deciding what's right. How to decide?

Well, the first step would seem to be do some quick mock up models comparing the result of using the different types. If you're lucky then the results will be pretty close to each other. That is your model will be insensitive to your choice. In that case it doesn't matter which type you choose so choose whichever one is the least amount of work and move on to the next project. If someone wants to argue over the choice let them win because the decision is literally not worth the time of the argument.

But what if you are unlucky? What if under a linear model organic unbranded search gets the attribution for 20,000 signups but under first click it gets the attribution for 12,000 signups? Then you're in serious trouble. Because while it seems like you are deciding between linear and first click you are actually deciding on how many signups to attribute to organic unbranded search. You might as well get rid of the guise of a model and just write down numbers for how many attributions you feel you should give to each channel. After all that is what is happening now. You are not modeling; you are deciding.

Fortunately, there actually is a bit of an end run around this problem. And that is to test which model actually lines up with your data. That is to make the decision non-subjective again.

I have a few options on how to do this which I will talk about in a future post. But the point is this: if you can't test the results of your model against your data (in a simple way) then you are probably better off not making a model.

Or to put it more bluntly: if you don't have a test then what you've built isn't actually a model. Not one that matters at least.