Bloom Filters in Practice

January 22, 2016    statistics programming python probability

A filter in the sky

Today we’re going to talk about what a Bloom filter is and discuss some of the applications in data science. In a later post, we’ll build a simple implementation with the goal of learning more about how they work.

What is a Bloom Filter?

A Bloom filter is a probabilistic data structure. Let’s break that term down. Any time you hear the word “probabilistic” the first thing that should come to mind is “error.” That is, it sometimes has errors. When you hear “data structure” you should think about “space,” more specifically storage space or memory.

Bloom filters are designed to answer questions of set membership, that is, “is this item one of X?” Here are some simple questions you might be working with if you were considering them:

  • Have we seen this email address sign up to our site recently?
  • Is this a product the user has bought before?
  • Are the registrations we’re seeing from this IP address on a whitelist of IPs that we can trust?

Basically, it compresses simple set membership information into a smaller memory footprint at the cost of a bit of error.

By design, Bloom filters only implement two types of operations, add and contains. So, once you’ve added a member to it, you wouldn’t be able to remove it. Additionally, you wouldn’t be able to query for a list of elements in it.

So, why would you be okay with error in data in exchange for using less memory? We can answer that question (as well as better understand the use-case of set membership) with a few useful examples.

How are Bloom filters used?

It would be hard not to mention this Medium article on the subject because it clarifies how they apply to a data science task. At Medium, they have some version of a distributed data store. Now, distributing data helps you scale out information across servers, but it also has a chance of increasing your variance in expected response times (read: more requests might take longer to finish running). Most web companies focus on making the user experience as pleasant as possible, so they value response time. In this case, one particular request that was important was the set of articles the particular user had read before.

At risk of retelling the already well-told story, suffice it to say that they used a Bloom filter to prevent recommending to the user articles he or she had read before. This was a case where they used a data science model (the store recommendation engine) and they augmented it by including a component that could use compressed information (the Bloom filter) to prevent the user from needing to see the same article twice. Moreover, even though there’s potential for error in that filter, that error is negligible (in the sense that the user’s experience isn’t hinder by it). To summarize, because they could efficiently represent the set of “articles this user had read before” and since it had a defined error rate, they could improve their user experience.

As another example, think about buying items on Amazon has almost any item imaginable and probably also distributes their data for scale. They’re still able to tell you right on the product page whether you’ve bought an item before and when. I don’t have any insight into what’s going on behind the scenes but this is another perfect place for a Bloom filter by using one at the user level (holding the set of products someone has bought) or at the product level (holding a set of all the users who have bought the product). A negative match (which will be correct 100% of the time) means operationally you don’t need to perform that database lookup to see if someone bought this item. A positive match (which will probably be rare most of the time) will be the only time when you confirm and go see that transaction data.

Finally, I wanted to point out a use-case that I found by perusing various implementations of the tool. Bloom filters can also be used to track time-dependent information (or various forms of time series data). One thing you could do is store aggregate level information (like whether someone bought a particular product in a given time horizon like the last 30 or 90 days) in a Bloom filter. Then, based on that information, you can make modeling decisions like what sort of ads you show this person.

I hope this post helped you learn a little bit about Bloom filters. In a later post, I’ll go into some detail on how they’re implemented with a focus on pedagogy. In the meantime, follow me on WordPress or Twitter for updates.

Additional links to check out

  • A Python package from Moz on using Redis as a backend (they were using it to ensure they didn’t crawl the same websites multiple times)
  • Another Python implementation optimized for scaling
  • A very detailed article of several other probabilistic data structures
  • A great bitly post on the subject as well as their own implementation (which also supports removing set members)