0:02

So, before we embark in the analysis, what are we hoping to understand? Well, it

Â seems intuitively clear is that there is going to be some trade off between the two

Â resources of the bloom filter. One resource is space consumption, the other

Â resource is essentially correctness so the more space we use, the larger number of

Â bits, we'd hope that we'd make fewer and fewer errors. And then as we compress the

Â table more and more, we use bits more and more for different objects then presumably

Â the error rate is going to increase. So, the goal of the analysis that we're about

Â to do is to understand this trade off precisely at qualitative level. Once we

Â understand the trade off occur between these two resources, then we can ask is

Â there is a sweet spot which gives us a useful data structure? Quite small space

Â and quite manageable error probability. So the way we're going to proceed with the

Â analysis, we'll be familiar to those of you who watched the open addressing video

Â about hash tables so to make the mathematical analysis tractable, I'm going

Â to make a heuristic assumption The strong assumption which is not really satisfied

Â by hash functions you would use in the practice. We're going to use that

Â assumption to derive a performance guarantee for bloom filters but as all as

Â any implementation you should check that your implementation actually is getting

Â performance comparable to what the idealizing analysis suggest. That said, if

Â you use a good hash function and if you have a non-pathological data, the hopes

Â and this is going out many empirical studies is that you will see performance

Â comparable to what this heuristic analysis will suggest. So, what is the heuristic

Â assumption? Well, it's going to be again familiar from my hashing discussions.

Â We're just going to assume that all the hashing is totally random. So, for each

Â choice of a hash function hi and for each possible object ax, the slots, the

Â position of the array which the hash functions gives for that object is

Â uniformly random and first of all and it's independen t from all other outputs of all

Â hash functions on all objects. So the set up then is we have n bits. We have a data

Â set, S which we have inserted into our bloom filter. Now our eventual goal is to

Â understand the error rate or the false positive probability. That is the chance

Â that an object which we haven't inserted into the bloom filter looks as if it has

Â been inserted into the bloom filter but as a preliminary step, I want to ask about

Â the population of 1s after we've inserted this data set S into the bloom filter. So,

Â specifically let's focus on a particular position of the array and by symmetry it

Â doesn't matter which one. And let's ask what is the probability that a given bit,

Â a given position on this array has been set to one after we've inserted this

Â entire data set S? Alright, so this, this is a somewhat difficult quiz question

Â actually. The correct answer is the second answer. It's one - quantity one - 1/n

Â raised to the number of hash functions k the number of objects cardinality of S,

Â that's the probability let's say the first bit of the bloom filter has been set to

Â one after the data set S has been inserted. So the, maybe the easiest way to

Â see this is to first focus on the first answer. So, the first answer is going to

Â be the probability I claim that the first bit is zero after the entire data set has

Â been inserted. Then of course it's probably it's a one, is just the one - its

Â quantity which is equal to the second answer. So we just seem to understand why

Â the first choice is probably the first bit = zero. Well, it's initially zero,

Â remember stuff is only set from zero to one. So we really need to analyze the

Â probability that this first bit survives all of these darts that are getting thrown

Â to the bloom filter over the course of this entire data set being inserted. So

Â there, the cardinality of these objects each get inserted on an insertion k darts

Â uniformly at random and independent from each other or effectively thrown at the

Â array at the bloom filter. Any position of the dart hits, gets set to one. Maybe it

Â was one already but if it was zero, it gets set to one. If it's one then it stays

Â one. So, how is this first pick going to stay zero? We'll have to be missed by all

Â of the darts. A given dart, a given bit flick is uniformly likely to be any of the

Â n bits so the probability of the ones that being this bit is only 1/n but, if it even

Â it's fortunately somebody else? Well, that's one - 1/n so you have a chance of

Â surviving a single dart with probably one - 1/n There is the number of hash

Â functions k the number of objects cardinality that's a dart being thrown.

Â Right k per object that gets inserted so the overall probability of eluding all of

Â the darts is one - one or n raised to the number of hash functions k the number of

Â insertions cardinality of S. Again, the probability that is one which is the one -

Â that quantity which is the second option in the quiz. So, let's go ahead and resume

Â our analysis using the answer to that quiz. So, what do we discover, discover

Â the probability that a given bit is one, is one - quantity one - 1/n or n is the

Â number of position raised to the number of hash functions k the number of insertions

Â cardinality of S. So, that's the kind of messy quantity so let's recall a simple

Â estimation facts that we used once earlier. You saw this when we analyzed

Â cardinals construction algorithm and the benefit of multiple repetitions or

Â cardinals contraction algorithm. And the trick here is to estimate a quantity

Â that's on the form of one + x or one - x by either the x or the - x as the case

Â maybe. So you take the function one + x which goes through the points -ten and 01.

Â And of course it's a straight line and then you also look at the function e to

Â the x. Well, those two functions are going to kiss at the point 0,1 and everywhere

Â else e to the x is going to be above one + x. So for any real value of x we can

Â always upper bound the quantity one + x by either the - x. So let's apply this fact

Â to this quantity here, one - 1/n raise to the k cardinality of S. We're going to

Â take x to the - 1/n so that gives us an upper bound on this probability of one - e

Â to th e - k the number of insertions over n, okay? So that's taking x to the - 1/n.

Â Let's simplify and finalize a little bit further by introducing some notation. So,

Â I'm going to let b denote the number of bits that were using per object. So this

Â is the quantity I was telling you to think about as eighth previously. This is the

Â ratio n, the total number of bits divided by the cardinality of S. So, this green

Â expression becomes one - e^k where b is the number of bits per object. And now

Â we're already seeing this type of trade off that we're expecting. Remember we're

Â expecting that as we use more and more space, then the error rate we think should

Â go down so if you can press the table a lot or use bits for lots of different

Â objects that's when you start going to see a lot of false positives so in this light

Â blue expression if you take the number of bits per objects with the number space,

Â the amount of space, little b if you take that going very large expanding to

Â infinity, this exponent to zero. So either the -zero is one. So overall, this

Â probability of a given bit being one is turning to zero. So, that is, the more

Â bits you have, the bigger space you have. The, well, the smaller of the fraction of

Â 1s. The bigger the faction of 0s. That should translate to a smaller false

Â positive probability unless we will make precise on the next and final slot. So

Â let's, let's rewrite the upshot form the last slide but probability that a given

Â bit is equal to one is that at above by one - e to the - k over b where k is the

Â number of hash functions and b is the number of bits we're using per object. Now

Â this is not the quantity that you care about. The quantity we care about is a

Â false positive probability where something looks like it's in the bloom filter even

Â though it's never been inserted so it's focused on some object like some IP

Â address which is never ever been inserted into this bloom filter. So for a given

Â object x which is not in the data set, that this has not been inserted into the

Â bloom filter or what has to happen for us to have a success ful look up for false

Â positive for this object? Well each one of its k bits has to be set to one. So, we

Â already computed the probability that a given bit is set to one. So, what has to

Â happen for all k of the bits that indicates x's membership in the bloom

Â filter all k of them has to be set to one. So we just take the quantity we computed

Â on the previous slide and we raise that to the kth power. Indicating that it has to

Â happen k different times. So believe it or not we now have exactly what we wanted.

Â What we set out to do which is derive a qualitative understanding of the intuitive

Â trade off between the one hand space used and on the other hand on the error

Â probability. The false positive of probability. So, we're going to call this

Â green circle quantity and name it. We'll call it epsilon for the error rate and

Â again all errors are false positives. And again as b goes to infinity, as we use

Â more and more space, this exponent goes to zero so one - e to that quantity is going

Â to zero as well. And of course, once we power it through the kth power, it gets

Â even closer to zero. So if the bigger b gets the small of this error rate epsilon

Â gets. So now let's get to the punch line. So remember the question is, is this data

Â structure actually useful? Can we actually set all of the parameters in a way that we

Â could both really usefully small space but a tolerable error epsilon? And, of course

Â we wouldn't be giving this video if the answer wasn't yes. Now one thing I've been

Â alluding all along is how do we set k? How do we choose the number of hash functions?

Â I told you at the very beginning We think of k as a small constant like 2345. And

Â now that we have this really nice qualitative version of how the error rate

Â in the space trade off with each other. We can answer how to set k. Namely set k

Â optimally so what do I mean? Well, fix the number of bits that you're using per

Â object. Eight, sixteen, 24, whatever. For fixed b, you can just choose the k that

Â minimize the screen quantity. That minimizes the error rate epsilon. So, how

Â do you minimize t his quantity? Well, you do it just like you learn in calculus and

Â I'll leave this as an exercise for you to do in the privacy of your own home. But

Â for fixed b, the way to get this green quantity epsilon as small as possible is

Â to set the number of hash functions k to be roughly the natural log of two. That's

Â a number of < one notice that's like .693 b. So, in other words the number of

Â hash functions for the optimal implementation of the bloom filter is

Â scaling linearly than the number of bits that you're using per object. It's about

Â .693 the bits per object. Of course this is generally not going to be an integer so

Â you just pick k either this number rounded up or this number rounded down. But,

Â continuing the heuristic analysis, now that we know how to set k optimally to

Â minimize the error for a given amount of space we can plug that value of k back in

Â and see well, how does the space and the error rate trade off against each other

Â and we get a very nice answer. Specifically, we get that the error rate

Â epsilon is just under an optimal trades to the number of hash functions k decreases

Â exponentially in the number of bits that you use per object. So, it's roughly one

Â half raised to the natural log of two or .693 roughly the number of bits per object

Â b. But, again the key qualitative point here is notice that epsilon is going down

Â really quickly as you scale b. If you double the number of bits that you're

Â allocating per object, you're squaring the error rate and for small error rates,

Â squaring it makes it much, much, much smaller. And of course this is just one

Â equation in two variables. If you prefer, you can solve this equation to express b,

Â the space requirement as a function of an error requirement. So if you know that the

Â tolerance for false positives in your application is one percent you can just

Â solve this for b and figure out how many bits per object you need to allocate. And

Â so rewriting what you get is that the number of bits per object that you need is

Â roughly 1.44 the log base two of one over epsilon. So, as expected as epsilon gets

Â smaller and smaller, you want fewer and fewer errors, the space requirements will

Â increase. So, the final question is, is it a useful data structure? Can you set all

Â the parameters so that you get you know, really interesting space error trade off

Â and the answer is totally. So, let me give you an example. Let's go back to having

Â eight bits of storage per object so that corresponds to b = eight. Then, what this

Â pick formula indicates is we should use five or six hash functions and already you

Â have an error probability of something like two percent which for a lot of the

Â motivating applications we talked about is already good enough. And again, if you

Â double the number of bits to say sixteen per object, then this error probability

Â would be really small. Pushing you know one in 5,000 or something like that. So,

Â to conclude at least in this idealized analysis which again, you should check

Â against at any real world implementation although empirically, it is definitely

Â achievable with well implemented bloom filter in nonpathological data to get this

Â kind of performance even with really a ridiculously minuscule amount of space per

Â object much less generally than storing the object itself, you can get fast

Â inserts, fast look ups, you do have to have false positives but with a very

Â controllable amount of error rates and that what's make bloom filters a win in a

Â number of applications.

Â