Saturday, November 24, 2012

How to get random lines out of a file or piped stream

Several months ago Aaron Olson, Camilo Lopez, and I were sitting around drinking beers (after 4pm on a Friday at the Shopify office; that's what you do after making ecommerce software). And we were griping how there wasn't an easy way to get a random sample of lines out of a log file or a stream for testing purposes. Sure, there's head and there's tail. But we wanted random lines and we didn't want to have to think about it.

So we made a solution: dimsum. The usage is in the readme. You install it with gem (so ruby is required). And then just use it like head or tail. Submit issues to github.

PS dimsum uses reservoir sampling so you can pipe right to it.

4 comments:

Jonathan said...

What's the difference from "shuf"?

steven said...

I must confess we didn't know about shuf before making this. That said we've since looked at the source and they permute the entire list in memory then select the first few rows. The difference is the implementation here is done with reservoir sampling (which could be added to shuf).

Rob De Almeida said...

You could also use awk for this:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'

The example above prints ~1% of the lines.

malcook said...

So, I posted to the coreutil mailing list to see who might take the bait: http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079.html