home contents changes issues index options help subscribe edit

Edit detail for RollsumChunking revision 4 of 32

	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Editor: DonovanBaarda Time: 2020/06/30 08:40:04 GMT+10
Note:

changed:
-Option 1. seems pretty minor, but they claim a noticeable speedup from `!(h&mask)` vs `(h&mask)==r`, presumably because the assembler `&` operation gives you a zero-check for free. Using a zero target is also more likely to hit degenerate cases if you are not careful with your byte-mapping. For example, if zero is mapped to zero (as it would be using eg murmurhash's mix32 finalizer), then you always match on runs of zero's.
Option 1. seems pretty minor, but they claim a noticeable speedup from `!(h&mask)` vs `(h&mask)==r`, presumably because the assembler `&` operation gives you a zero-check for free. Using a zero target is also more likely to hit degenerate cases if you are not careful with your byte-mapping. For example, if zero is mapped to zero (as it would be using eg murmurhash's mix32 finalizer), then you always match on runs of zero's. Also, `h <= r` is probably just as cheap and can give you arbitrary target block sizes, with arbitrary probabilities of matching.

changed:
-The "hard cliff" at the target-size feels a little ugly, but it does avoid complicating things. An incremental adjustment to the probabilities to give a more normal distribution would be interesting to try/analize.
-
-
The "hard cliff" at the target-size feels a little ugly, but it does avoid complicating things. An incremental adjustment to the probabilities to give a more normal distribution would be interesting to try/analize. This together with `h <= r` hash judgment can give a continuously variable match probability.

Rollsum Chunking

Using rollsums for chunking is a little different to how rsync/librsync uses them, and they have slightly different requirements.

Rsync uses rollsums as a fast hash of a whole fixed sized block for comparison against other blocks. This means it needs a large rolling "window", and a good hash with low collisions and clustering behaviour.

Chunking only needs a small rolling window with enough bytes (32, 48 or 64 bytes is common) to give a hash with enough meaningful bits to compare against a target value to identify the cut points. This means collisions and clustering probably don't matter as much, and pretty much the only thing that really matters is that the target hash value has a good random distribution with the right frequency across all the possible input-window values. Note that a generally good hash would also meet this requirement, but it is a weaker requirement than a good hash.

References

https://github.com/ipfs/specs/issues/227 - The IPFS issue tracking their chunking analysis.
https://github.com/ipfs/go-ipfs-chunker/issues/18 - Another IPFS issue about this.
https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10 - A long summary of deduplication ideas.
https://github.com/aidanhs/rollsum-tests - A project testing rollsums for chunking.
https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf - The FastCDC? paper.

Thoughts

The FastCDC? paper highlighted that typical chunking algorithms use N bits of the hash to get an average chunk size of 2^N, but the distribution is not a normal distribution around the target chunk size, but an exponential distribution with the most common block size being the smallest. Also, skipping over a min-block-size effectively shifts this distribution by the min size, which changes the average chunk size to min+2^N. We will call 2^N the "target-size", so the average-size = min-size + target-size.

The min chunk size also cause it to take longer for chunks to "re-align" after differences because it hides chunk-boundaries that are in the skipped data. The exponential distribution of chunk-sizes means that many chunk boundaries will fall within the min-chunk-size, and skipping them can skip over boundaries used on previous versions, delaying re-alignment., This means you really want your min-size to be small to avoid hurting the chunk re-alignment and thus de-duplication. I'd suggest 1/4 of the "target chunk size", which also ensures the average chunk size is only 25% larger than the target chunk size.

window size

The window size used needs to give you N bits of entropy in the N bits of the hash used for the chunking hash judgement. Compression can give a rough estimate of the bits of entropy per byte, and with text files 4:1 compression is common, and for things like html it reportedly can go to 10:1. Conservatively we could assume we get 1 bit of entropy per byte in the window, so for N bits in the hash, we need a window with at least N bytes. If we are selecting only N bits from a 32bit hash generated from the window bytes, then provided the hash is good, any N bits from that hash should include N bits of entropy from the N byte window. However, if the hash is not good, and the entropy is not evenly distributed across all the hash bits, and in particular over the N hash bits selected, we would need more than N bytes in the window. For a full 32bits of entropy in the 32bit hash we'd need at least 32bytes in the window.

If you map bytes to a 32bit value, you are smearing that byte's entropy over 32bits. If you then select a subset of that 32bit value, are you selecting only a subset of the initial bytes entropy? I think that depends on the mapping and how you select bits from it. For example, simply repeating the byte 4x to get 32bits, and then selecting any contiguous 8bits, you get all the original bit's and thus all the entropy in the original byte. However, I'm pretty sure if you select less than 8 bits, or select 8 bits that are not impacted by all 8 original bits (ie, selecting multiple copies of some bits), then you must be getting only a fraction of the original entropy.

If we assume the mapping is a "good hash", the entropy of the original byte is spread over all 32bits in a way that if you select any 8bits you get all the byte's original entropy, but if you select <8 bits you get a corresponding fraction of the entropy. However, we must assume the entropy within the original byte is not spread evenly over all 8 bits (so not a good hash), so selecting less than 8 bits gives you that fraction of the original entropy. So if we use only 3bits of a 32bit word mapped from one byte that contained only 2bits of entropy, we would get 2*3/8 bits of entropy. It would be interesting to figure out a formula for approximating the entropy you get from selecting n bits expanded to 32bits from m bits that contained e bits of entropy.

Gear

The gear rollsum has a really neat feature for chunking; there is no need to keep a sliding window because old bytes "expire" out of the hash as they get shifted out. It uses a lookup table to map bytes to the full hash width before adding them to the hash, and they slowly get shifted out with each byte added. This means that the window is limited to the size of the hash, with older bytes only represented in the more significant bits of the hash.

This means the normal technique of selecting the N least-significant bits with a mask also prunes the window size to N bytes. Importantly, it not only prunes the window size to N bytes, but it also only selects part of the entropy of the last 8 bytes of that window (assuming the original byte's entropy is perfectly spread across the hash by the mapping function). If we assume the entropy linearly decays for the last 8 bytes, it means we loose 4 bytes worth of entropy. So for N bits of entropy, assuming 1 bit of entropy per byte, we need at least N+4 bytes in the window.

The oldest bytes are included in less and less high end bits the older they get. So even though a 32bit hash includes 32bytes worth of window, the oldest byte is only represented in the single most-significant bit.

For windows larger than 32bits you need a larger hash (64bits), which means you need a larger byte-mapping table. These tables eat CPU memory cache, which can hurt the overall program even if the chunking bit is fast.

FastCDC?

Uses a Gear rollsum, but adds 3 things;

Simplifies the hash judgement check by making the target hash zero.
Expands the Gear window by zero-padding the mask 5 bits to the left, expanding the window size to N+5.
Uses "Normalized Chunking", using N+x bits of the hash before the target chunk size, and N-x bits after the target chunk size.

Option 1. seems pretty minor, but they claim a noticeable speedup from !(h&mask) vs (h&mask)==r, presumably because the assembler & operation gives you a zero-check for free. Using a zero target is also more likely to hit degenerate cases if you are not careful with your byte-mapping. For example, if zero is mapped to zero (as it would be using eg murmurhash's mix32 finalizer), then you always match on runs of zero's. Also, h <= r is probably just as cheap and can give you arbitrary target block sizes, with arbitrary probabilities of matching.

For 2. it seems strange that they didn't seem to realize that just using the most significant bits would give you the largest window size. Shifting the mask by 5 seems very arbitrary, and less effective than always using the most significant N bits. Also I'm not sure but !(h>>S) where S=32-N might be just as fast as !(h&M) where M=((~0) << (32-N)). However, it's also interesting to note that in the Gear entropy analysis above we need at least N+4 bytes in the window, so adding 5 bytes by shifting the mask 5 places should be sufficient to get your N bits of entropy. However, using the most significant bits would mean you are using a full window size of 32-4 = 28bytes, which is probably still better.

The idea of 3. is interesting, and they claim it speeds up re-syncing chunk boundaries faster after skipping over min-size bytes, giving better deduplication. It's not clear in the paper if the de-duplication wins are really because of this, or because they way they use it gives smaller chunks. There are several ways to think of what this does, and how it relates to min-size handling.

It means that the hash effectively includes history data before the window, since chunk boundary selection also depends on how far back the last chunk boundary was.

The chunk boundaries have a degree of "hardness" based on their hash. A hash with more zero bits is "harder" and more likely to be chosen as a chunk boundary even if there was another boundary before it. These "hard boundaries" force the synchronisation of chunk boundaries after changes harder.

The "hard cliff" at the target-size feels a little ugly, but it does avoid complicating things. An incremental adjustment to the probabilities to give a more normal distribution would be interesting to try/analize. This together with h <= r hash judgment can give a continuously variable match probability.