1 2 3 4 5 | ||
Editor: DonovanBaarda
Time: 2020/04/25 21:35:35 GMT+10 |
||
Note: |
changed: - This explores ideas for optimizing a bloom filter using a factional number of hash functions by applying them probabilistic-ally. See https://en.wikipedia.org/wiki/Bloom_filter for info on bloom filters. The optimal number of hash functions "k" for a bloom filter with "n" entries "m" bits is:: k = m/n*ln(2) ~= 0.7 * m/n Which is only an integer if n/m is an integer multiple of ln(2). If it is not, that implies that using an integer number of hash functions is sub-optimal. This means to get the best performance for your bits you should pick an optimal n/m ratio for a k that gives you the desired false-positive rate. Note that ln(2) ~= 0.7 happens to also be the commonly recommended max load factor for hashtables... In practice, this is not possible. The "n" number of entries depends on the dataset. The "m" bloom filter size is typically a power of 2 for performance reasons, and has an upper practical limit of the L1, L2, or RAM cache sizes. In practice the desired false-positive rate is "as low as possible". This means m and n are usually beyond our control, leaving only k as the tune-able setting. An example I've encountered is the bloom filter needs to fit inside the L2 cache to be worth using, but the "n" is so large the n/m ratio is > 1! If the numbers are telling us the optimum k is fractional, does that mean we could do better with a non-integer number of hash functions? But how can you use a non-integer number of hash functions? By applying them probabalistic-ally and only setting the bloom filter bit for that hash function "P" fraction of the time. For an m=2^M sized bloom filter, a hash function needs to produce M bits to pick which bit in the bloom filter to set, and we can use some other bits of the hash as fixed point number "p" scaled between 0->1 and only set the bit if p>=P. We do the same when reading the filter; we only check the bit if p>=P and otherwise behave like it was set. This means that the bloom filter is only used for that hash function for a fraction of the entries, helping keep the bloom filter "sparse". This is similar to a technique to improve cache hitrates when the working set is larger than the cache; only bother caching a fraction of the data, because it's better to get some hits on a faction of the data than no hits on all of it. How does this affect the performance? The false-positive rate for a normal bloom filter is:: f = (1 - e^(-k*n/m))^k But what if we only apply a hash p fraction of the time? The probability that a bit is not set by that partial hash is:: 1 - p/m = 1 - 1/(m/p) That hash applied for n entries gives a probability for that bit being not set of:: (1 - 1/(m/p))^n = ((1 - 1/(m/p))^(m/p))^(n*p/m) ~= e^(-p*n/m) If we only bother checking that bit when we are matching p of the time, then the chance it being checked and found not set to give a true negative result is:: p * e^(-p*n/m) Which gives the following probability of a false positive:: f = 1 - p*e^(-p*n/m) If n and m are fixed, what p value between 0 -> 1 will minimize f? It's when this term is maximized:: p*e^(-p*n/m)
This explores ideas for optimizing a bloom filter using a factional number of hash functions by applying them probabilistic-ally.
See https://en.wikipedia.org/wiki/Bloom_filter for info on bloom filters.
The optimal number of hash functions "k" for a bloom filter with "n" entries "m" bits is:
k = m/n*ln(2) ~= 0.7 * m/n
Which is only an integer if n/m is an integer multiple of ln(2). If it is not, that implies that using an integer number of hash functions is sub-optimal. This means to get the best performance for your bits you should pick an optimal n/m ratio for a k that gives you the desired false-positive rate. Note that ln(2) ~= 0.7 happens to also be the commonly recommended max load factor for hashtables...
In practice, this is not possible. The "n" number of entries depends on the dataset. The "m" bloom filter size is typically a power of 2 for performance reasons, and has an upper practical limit of the L1, L2, or RAM cache sizes. In practice the desired false-positive rate is "as low as possible". This means m and n are usually beyond our control, leaving only k as the tune-able setting. An example I've encountered is the bloom filter needs to fit inside the L2 cache to be worth using, but the "n" is so large the n/m ratio is > 1! If the numbers are telling us the optimum k is fractional, does that mean we could do better with a non-integer number of hash functions? But how can you use a non-integer number of hash functions?
By applying them probabalistic-ally and only setting the bloom filter bit for that hash function "P" fraction of the time. For an m=2^M sized bloom filter, a hash function needs to produce M bits to pick which bit in the bloom filter to set, and we can use some other bits of the hash as fixed point number "p" scaled between 0->1 and only set the bit if p>=P. We do the same when reading the filter; we only check the bit if p>=P and otherwise behave like it was set. This means that the bloom filter is only used for that hash function for a fraction of the entries, helping keep the bloom filter "sparse". This is similar to a technique to improve cache hitrates when the working set is larger than the cache; only bother caching a fraction of the data, because it's better to get some hits on a faction of the data than no hits on all of it.
How does this affect the performance? The false-positive rate for a normal bloom filter is:
f = (1 - e^(-k*n/m))^k
But what if we only apply a hash p fraction of the time? The probability that a bit is not set by that partial hash is:
1 - p/m = 1 - 1/(m/p)
That hash applied for n entries gives a probability for that bit being not set of:
(1 - 1/(m/p))^n = ((1 - 1/(m/p))^(m/p))^(n*p/m) ~= e^(-p*n/m)
If we only bother checking that bit when we are matching p of the time, then the chance it being checked and found not set to give a true negative result is:
p * e^(-p*n/m)
Which gives the following probability of a false positive:
f = 1 - p*e^(-p*n/m)
If n and m are fixed, what p value between 0 -> 1 will minimize f? It's when this term is maximized:
p*e^(-p*n/m)