This explores ideas for optimizing a bloom filter using a factional number of hash functions by applying them probabilistic-ally.

See https://en.wikipedia.org/wiki/Bloom_filter for info on bloom filters.

The optimal number of hash functions "k" for a bloom filter with "n" entries "m" bits is:

k = m/n*ln(2) ~= 0.7 * m/n

Which is only an integer if n/m is an integer multiple of ln(2). If it is not, that implies that using an integer number of hash functions is sub-optimal. This means to get the best performance for your bits you should pick an optimal n/m ratio for a k that gives you the desired false-positive rate. Note that ln(2) ~= 0.7 happens to also be the commonly recommended max load factor for hashtables...

In practice, this is not possible. The "n" number of entries depends on the dataset. The "m" bloom filter size is typically a power of 2 for performance reasons, and has an upper practical limit of the L1, L2, or RAM cache sizes. In practice the desired false-positive rate is "as low as possible". This means m and n are usually beyond our control, leaving only k as the tune-able setting. An example I've encountered is the bloom filter needs to fit inside the L2 cache to be worth using, but the "n" is so large the n/m ratio is > 1! If the numbers are telling us the optimum k is fractional, does that mean we could do better with a non-integer number of hash functions? But how can you use a non-integer number of hash functions?

By applying them probabalistic-ally and only setting the bloom filter bit for that hash function "P" fraction of the time. For an m=2^M sized bloom filter, a hash function needs to produce M bits to pick which bit in the bloom filter to set, and we can use some other bits of the hash as fixed point number "p" scaled between 0->1 and only set the bit if p>=P. We do the same when reading the filter; we only check the bit if p>=P and otherwise behave like it was set. This means that the bloom filter is only used for that hash function for a fraction of the entries, helping keep the bloom filter "sparse". This is similar to a technique to improve cache hitrates when the working set is larger than the cache; only bother caching a fraction of the data, because it's better to get some hits on a faction of the data than no hits on all of it.

How does this affect the performance? The false-positive rate for a normal bloom filter is:

f = (1 - e^(-k*n/m))^k

There are 3 different ways to partially apply "k" hash functions that give the following false-positive rates;

- apply all k checks p of the time together:

f = (1-p) + p * (1 - e^(-p*k*n/m))^k

- apply each check p of the time independently:

f = (1 - p*e^(-p*k*n/m))^k

- apply (k-1) checks every time, and the last check p of the time:

f = (1 - e^(-(k-1+p)*n/m))^(k-1) * (1 - p*e^(-(k-1+p)*n/m))

Note for each of these, putting p=1 gives the normal bloom-filter false-positive rate. Setting p=0 gives f=1 for 1. and 2. and for 3. its equivalent to applying k-1 hashes. When you plug these into https://www.derivative-calculator.net/ and play with the numbers it quickly becomes apparent that fractional application of hashes is only an improvement when the load factor of n/m is > 1.0.

What if we only apply a single hash p fraction of the time? The probability that a bit is not set by that partial hash is:

1 - p/m = 1 - 1/(m/p)

That hash applied for n entries gives a probability for that bit being not set of:

(1 - 1/(m/p))^n = ((1 - 1/(m/p))^(m/p))^(n*p/m) ~= e^(-p*n/m)

If we only bother checking that bit when we are matching p of the time, then the chance it being checked and found not set to give a true negative result is:

p * e^(-p*n/m)

Which gives the following probability of a false positive:

f = 1 - p*e^(-p*n/m)

If n and m are fixed, what p value between 0 -> 1 will minimize f? It's when this term is maximized:

p*e^(-p*n/m)

Which is when the derivative is zero:

0=e^(-p*n/m) - p*n/m*e^(-p*n/m) 0=1 - p*n/m p=m/n

Interestingly this means the performance is best when we push the "load factor" (ie added bits / bits space) to 1.0, which is quite different to the optimal k load factor of ~0.7. It turns out that this results in any sort of fractional k approach to try and improve things when the optimal k is > 1 doesn't help; the best result is to just round(k). However, when the ideal k is < 1.0, using a fractional k does improve things. How much?

n/m | k=1 | p=m/n |
---|---|---|

1.0 | 0.632 | 0.632 |

2.0 | 0.865 | 0.816 |

3.0 | 0.950 | 0.877 |

4.0 | 0.982 | 0.908 |

5.0 | 0.993 | 0.926 |

So you can have a bloom filter with 1/4 as many bits as you have entries and still avoid 9.2% of unnecessary lookups, or 1/3 the size and avoid 12.3% of unnecessary lookups. A normal bloom filter with k=1 would only prevent 1.8% or 5% respectively for those cases. That might not be worth it for many applications, but for some that can directly translate into shaving 9.2% or 12.3% off your execution time.