Edit detail for RollsumChunking revision 1 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Editor: DonovanBaarda
Time: 2020/06/27 15:41:54 GMT+10
Note: u

changed:
-
Rollsum Chunking
================

Using rollsums for chunking is a little different to how rsync/librsync uses them, and they have slightly different requirements. 

Rsync uses rollsums as a fast hash of a whole fixed sized block for comparison against other blocks. This means it needs a large rolling "window", and a good hash with low collisions and clustering behaviour.

Chunking only needs a small rolling window with enough bytes (32, 48 or 64 bytes is common) to give a hash with enough meaningful bits to compare against a target value to identify the cut points. This means collisions and clustering probably don't matter as much, and pretty much the only thing that really matters is that the target hash value has a good random distribution with the right frequency across all the possible input-window values. Note that a generally good hash would also meet this requirement, but it is a weaker requirement than a good hash.

References
==========

* https://github.com/ipfs/specs/issues/227 - The IPFS issue tracking their chunking analysis.
* https://github.com/ipfs/go-ipfs-chunker/issues/18 - Another IPFS issue about this.
* https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10 - A long summary of deduplication ideas.
* https://github.com/aidanhs/rollsum-tests - A project testing rollsums for chunking.
* https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf - The FastCDC paper.

Thoughts
========

The FastCDC paper highlighted that typical chunking algorithms use N bits of the hash to get an average chunk size of 2^N, but the distribution is not a normal distribution around the target chunk size, but an exponential distribution with the most common block size being the smallest. Also, skipping over a min-block-size effectively shifts this distribution by the min size, which changes the average chunk size to min+2^N. We will call 2^N the "target-size", so the average-size = min-size + target-size.

The min chunk size also cause it to take longer for chunks to "re-align" after differences because it hides chunk-boundaries that are in the skipped data. The exponential distribution of chunk-sizes means that many chunk boundaries will fall within the min-chunk-size. This means you really want your min-size to be small to avoid hurting the chunk re-alignment and thus de-duplication. I'd suggest 1/4 of the "target chunk size", which also ensures the average chunk size is only 25% larger than the target chunk size.

Gear
----

The gear rollsum has a really neat feature for chunking; there is no need to keep a sliding window because old bytes "expire" out of the hash as they get shifted out. It uses a lookup table to map bytes to the full hash width before adding them to the hash, and they slowly get shifted out with each byte added. This means that the window is limited to the size of the hash, with older bytes only represented in the more significant bits of the hash. This means the normal technique of selecting the N least-significant bits with a mask also prunes the window size to N bytes.

The oldest bytes are included in less and less high end bits the older they get. So even though a 32bit hash includes 32bytes worth of window, the oldest byte is only represented in the single most-significant bit.

For windows larger than 32bits you need a larger hash (64bits), which means you need a larger byte-mapping table. These tables eat CPU memory cache, which can hurt the overall program even if the chunking bit is fast.

FastCDC
-------

Uses a Gear rollsum, but adds 3 things;

1. Simplifies the hash judgement check by making the target hash zero.
2. Expands the Gear window by zero-padding the mask 5 bits to the left, expanding the window size to N+5.
3. Uses "Normalized Chunking", using N+x bits of the hash before the target chunk size, and N-x bits after the target chunk size.

Option 1. seems pretty trivial, but they claim a noticeable speedup from `!(h&mask)` vs `(h&mask)==r`, presumably because the assembler `&` operation gives you a zero-check for free. Using a zero target is also more likely to hit degenerate cases if you are not careful with your byte-mapping. For example, if zero is mapped to zero (as it would be using eg murmurhash's mix32 finalizer), then you always match on runs of zero's.

For 2. it seems strange that they didn't seem to realize that the MSBits gave you the largest window size. Shifting the mask by 5 seems very arbitrary, and less effective than always using the most significant N bits. Also I'm not sure but `!(h>>S)` where S=32-N might be just as fast as `!(h&M)` where M=((~0) << (32-N)).

The idea of 3. is interesting, and they claim it speeds up re-syncing chunk boundaries faster after skipping over min-size bytes, giving better deduplication. It's not clear in the paper if the de-duplication wins are really because of this, or because they way they use it gives smaller chunks. There are several ways to think of what this does, and how it relates to min-size handling.

It means that the hash effectively includes history data before the window, since chunk boundary selection also depends on how far back the last chunk boundary was.

The chunk boundaries have a degree of "hardness" based on their hash. A hash with more zero bits is "harder" and more likely to be chosen as a chunk boundary even if there was another boundary before it. These "hard boundaries" force the synchronisation of chunk boundaries after changes harder.

The "hard cliff" at the target-size feels a little ugly, but it does avoid complicating things. An incremental adjustment to the probabilities to give a more normal distribution would be interesting to try/analize.



Rollsum Chunking

Using rollsums for chunking is a little different to how rsync/librsync uses them, and they have slightly different requirements.

Rsync uses rollsums as a fast hash of a whole fixed sized block for comparison against other blocks. This means it needs a large rolling "window", and a good hash with low collisions and clustering behaviour.

Chunking only needs a small rolling window with enough bytes (32, 48 or 64 bytes is common) to give a hash with enough meaningful bits to compare against a target value to identify the cut points. This means collisions and clustering probably don't matter as much, and pretty much the only thing that really matters is that the target hash value has a good random distribution with the right frequency across all the possible input-window values. Note that a generally good hash would also meet this requirement, but it is a weaker requirement than a good hash.

References

Thoughts

The FastCDC? paper highlighted that typical chunking algorithms use N bits of the hash to get an average chunk size of 2^N, but the distribution is not a normal distribution around the target chunk size, but an exponential distribution with the most common block size being the smallest. Also, skipping over a min-block-size effectively shifts this distribution by the min size, which changes the average chunk size to min+2^N. We will call 2^N the "target-size", so the average-size = min-size + target-size.

The min chunk size also cause it to take longer for chunks to "re-align" after differences because it hides chunk-boundaries that are in the skipped data. The exponential distribution of chunk-sizes means that many chunk boundaries will fall within the min-chunk-size. This means you really want your min-size to be small to avoid hurting the chunk re-alignment and thus de-duplication. I'd suggest 1/4 of the "target chunk size", which also ensures the average chunk size is only 25% larger than the target chunk size.

Gear

The gear rollsum has a really neat feature for chunking; there is no need to keep a sliding window because old bytes "expire" out of the hash as they get shifted out. It uses a lookup table to map bytes to the full hash width before adding them to the hash, and they slowly get shifted out with each byte added. This means that the window is limited to the size of the hash, with older bytes only represented in the more significant bits of the hash. This means the normal technique of selecting the N least-significant bits with a mask also prunes the window size to N bytes.

The oldest bytes are included in less and less high end bits the older they get. So even though a 32bit hash includes 32bytes worth of window, the oldest byte is only represented in the single most-significant bit.

For windows larger than 32bits you need a larger hash (64bits), which means you need a larger byte-mapping table. These tables eat CPU memory cache, which can hurt the overall program even if the chunking bit is fast.

FastCDC?

Uses a Gear rollsum, but adds 3 things;

  1. Simplifies the hash judgement check by making the target hash zero.
  2. Expands the Gear window by zero-padding the mask 5 bits to the left, expanding the window size to N+5.
  3. Uses "Normalized Chunking", using N+x bits of the hash before the target chunk size, and N-x bits after the target chunk size.

Option 1. seems pretty trivial, but they claim a noticeable speedup from !(h&mask) vs (h&mask)==r, presumably because the assembler & operation gives you a zero-check for free. Using a zero target is also more likely to hit degenerate cases if you are not careful with your byte-mapping. For example, if zero is mapped to zero (as it would be using eg murmurhash's mix32 finalizer), then you always match on runs of zero's.

For 2. it seems strange that they didn't seem to realize that the MSBits? gave you the largest window size. Shifting the mask by 5 seems very arbitrary, and less effective than always using the most significant N bits. Also I'm not sure but !(h>>S) where S=32-N might be just as fast as !(h&M) where M=((~0) << (32-N)).

The idea of 3. is interesting, and they claim it speeds up re-syncing chunk boundaries faster after skipping over min-size bytes, giving better deduplication. It's not clear in the paper if the de-duplication wins are really because of this, or because they way they use it gives smaller chunks. There are several ways to think of what this does, and how it relates to min-size handling.

It means that the hash effectively includes history data before the window, since chunk boundary selection also depends on how far back the last chunk boundary was.

The chunk boundaries have a degree of "hardness" based on their hash. A hash with more zero bits is "harder" and more likely to be chosen as a chunk boundary even if there was another boundary before it. These "hard boundaries" force the synchronisation of chunk boundaries after changes harder.

The "hard cliff" at the target-size feels a little ugly, but it does avoid complicating things. An incremental adjustment to the probabilities to give a more normal distribution would be interesting to try/analize.