## Rollsum Chunking

Using rollsums for chunking is a little different to how rsync/librsync uses them, and they have slightly different requirements.

Rsync uses rollsums as a fast hash of a whole fixed sized block for comparison against other blocks. This means it needs a large rolling "window", and a good hash with low collisions and clustering behaviour.

Chunking only needs a small rolling window with enough bytes (32, 48 or 64 bytes is common) to give a hash with enough meaningful bits to compare against a target value to identify the cut points. This means collisions and clustering probably don't matter as much, and pretty much the only thing that really matters is that the target hash value has a good random distribution with the right frequency across all the possible input-window values. Note that a generally good hash would also meet this requirement, but it is a weaker requirement than a good hash.

More conclusive analysis, code and experiments testing the ideas explored here can be found at;

## References

- https://github.com/ipfs/specs/issues/227 - The IPFS issue tracking their chunking analysis.
- https://github.com/ipfs/go-ipfs-chunker/issues/18 - Another IPFS issue about this.
- https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10 - A long summary of deduplication ideas.
- https://github.com/aidanhs/rollsum-tests - A project testing rollsums for chunking.
- https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf - The FastCDC? paper.
- https://www.usenix.org/system/files/conference/atc12/atc12-final293.pdf - MicroSoft? Deduplication paper.

## Thoughts

The FastCDC? paper highlighted that typical chunking algorithms use N bits of the hash to get an average chunk size of 2^N, but the distribution is not a normal distribution around the target chunk size, but an exponential distribution with the most common block size being the smallest. Also, skipping over a min-block-size effectively shifts this distribution by the min size, which changes the average chunk size to min+2^N. We will call 2^N the "target-size", so the average-size = min-size + target-size. There is a >22% chance of a chunk-boundary within 1/4 of, the median chunk size is about 70% of, and <13% are greater than 2x, the target-size.

The min chunk size also cause it to take longer for chunks to "re-align" after differences because it hides chunk-boundaries that are in the skipped data. The exponential distribution of chunk-sizes means that many chunk boundaries will fall within the min-chunk-size, and skipping them can skip over boundaries used on previous versions, delaying re-alignment., This means you really want your min-size to be small to avoid hurting the chunk re-alignment and thus de-duplication. I'd suggest 1/4 of the "target chunk size", which also ensures the average chunk size is only 25% larger than the target chunk size.

### window size

The window size used needs to give you N bits of entropy in the N bits of the hash used for the chunking hash judgement. Compression can give a rough estimate of the bits of entropy per byte, and with text files 4:1 compression is common, and for things like html it reportedly can go to 10:1. Conservatively we could assume we get 1 bit of entropy per byte in the window, so for N bits in the hash, we need a window with at least N bytes. If we are selecting only N bits from a 32bit hash generated from the window bytes, then provided the hash is good, any N bits from that hash should include N bits of entropy from the N byte window. However, if the hash is not good, and the entropy is not evenly distributed across all the hash bits, and in particular over the N hash bits selected, we would need more than N bytes in the window. For a full 32bits of entropy in the 32bit hash we'd need at least 32bytes in the window.

If you map bytes to a 32bit value, you are smearing that byte's entropy over 32bits. If you then select a subset of that 32bit value, are you selecting only a subset of the initial bytes entropy? I think that depends on the mapping and how you select bits from it. For example, simply repeating the byte 4x to get 32bits, and then selecting any contiguous 8bits, you get all the original bit's and thus all the entropy in the original byte. However, I'm pretty sure if you select less than 8 bits, or select 8 bits that are not impacted by all 8 original bits (ie, selecting multiple copies of some bits), then you must be getting only a fraction of the original entropy.

If we assume the mapping is a "good hash", the entropy of the original byte is spread over all 32bits in a way that if you select any 8bits you get all the byte's original entropy, but if you select <8 bits you get a corresponding fraction of the entropy. However, we must assume the entropy within the original byte is not spread evenly over all 8 bits (so not a good hash), so selecting less than 8 bits gives you that fraction of the original entropy. So if we use only 3bits of a 32bit word mapped from one byte that contained only 2bits of entropy, we would get 2*3/8 bits of entropy. It would be interesting to figure out a formula for approximating the entropy you get from selecting n bits expanded to 32bits from m bits that contained e bits of entropy.

## Algorithms

### PolyHash?

This is the Polynomial Hash AKA RabinKarp? (not Rabin Fingerprint) mentioned at https://en.wikipedia.org/wiki/Rolling_hash, and used by librsync. It gives a very good hash (low collisions, good distribution) without needing a byte-mapping table, which saves cache memory. However, it does use 2 multiplies per rotation, which makes it slower than rollsums that only use addition or xor like CyclicPoly?/Buzhash, Rollsum/Alder32/Bap, or Gear. Also unlike Gear, it does need to keep the sliding window of data.

### Gear

The gear rollsum has a really neat feature for chunking; there is no need to keep a sliding window because old bytes "expire" out of the hash as they get shifted out. It uses a lookup table to map bytes to the full hash width before adding them to the hash, and they slowly get shifted out with each byte added. This means that the window is limited to the size of the hash, with older bytes only represented in the more significant bits of the hash. This makes it completely useless as a fast rolling hash for rsync, but ideal for chunking.

This means the normal technique of selecting the N least-significant bits with a mask also prunes the window size to N bytes. Importantly, it not only prunes the window size to N bytes, but it also only selects only part of the entropy of the last 8 bytes of that window (assuming the original byte's entropy is perfectly spread across the hash by the mapping function). If we assume the entropy linearly decays for the last 8 bytes, it means we loose 4 bytes worth of entropy. So for N bits of entropy, assuming 1 bit of entropy per byte, we need at least N+4 bytes in the window.

The oldest bytes are included in less and less high end bits the older they get. So even though a 32bit hash includes 32bytes worth of window, the oldest byte is only represented in the single most-significant bit.

For windows larger than 32bits you need a larger hash (64bits), which means you need a larger byte-mapping table. These tables eat CPU memory cache, which can hurt the overall program even if the chunking bit is fast.

### FastCDC?

Uses a Gear rollsum, but adds 3 things;

- Simplifies the hash judgement check by making the target hash zero.
- Expands the Gear window by zero-padding the mask to the left, expanding the window size to N+z where z is the number of zero's padded.
- Uses "Normalized Chunking", using N+x bits of the hash before the target chunk size, and N-x bits after the target chunk size.

Option 1. seems pretty minor, but they claim a noticeable speedup from !(h&mask) vs (h&mask)==r, presumably because the assembler & operation gives you a zero-check for free. Using a zero target is also more likely to hit degenerate cases if you are not careful with your byte-mapping. For example, if zero is mapped to zero (as it would be using eg murmurhash's mix32 finalizer), then you always match on runs of zero's. Also, h <= r is probably just as cheap and can give you arbitrary target block sizes, with arbitrary probabilities of matching.

For 2. it seems strange that they didn't seem to realize that just using the most significant bits would give you the largest window size. The have picked masks with zero padding at apparently arbitrary points, though maybe these points avoid degenerate cases in their byte-mapping. It seems it would still be less effective than always using the most significant N bits, again providing your byte-mapping is good. Also I'm not sure but !(h>>S) where S=32-N might be just as fast as !(h&M) where M is a mask with N arbitrary bits set. However, it's also interesting to note that in the Gear entropy analysis above we need at least N+4 bytes in the window, so adding 5 bytes by shifting the mask 5 places should be sufficient to get your N bits of entropy. However, using the most significant bits would mean you are using a full window size of 32-4 = 28bytes, which is probably still better.

The idea of 3. is interesting, and they claim it speeds up re-syncing chunk boundaries faster after skipping over min-size bytes, giving better deduplication. It's not clear in the paper if the de-duplication wins are really because of this, or because they way they use it gives smaller chunks. There are several ways to think of what this does, and how it relates to min-size handling.

- It means that the hash judgement effectively includes history data before the window, since chunk boundary selection also depends on how far back the last chunk boundary was.
- The chunk boundaries have a degree of "hardness" based on their hash. A hash with more zero bits is "harder" and more likely to be chosen as a chunk boundary even if there was another boundary before it. These "hard boundaries" force the synchronisation of chunk boundaries after changes harder.

The "hard cliff" at the target-size feels a little ugly, but it does avoid complicating things. An incremental adjustment to the probabilities to give a more normal distribution would be interesting to try/analyze. This together with h <= r hash judgment can give a continuously variable match probability.

### Regression Chunking

The Microsoft Deduplication paper introduces "Regression Chunking" as a way to reduce the problems of max_len truncations. Instead of just truncating chunks at max_len, it "goes back" looking for a chunk boundary before the limit with a weaker boundary criteria. It uses bit-masks, and the weaker criteria uses a mask with N-k bits for k=1..4. This is implemented by keeping all the k=1..4 last weak-matches as it scans, then using the last of the strongest matches found when it hits max_len. This has the effect of re-destributing the tail max_len chunks over smaller chunk lengths, giving a nearly uniform distribution according to their graphs.

Also worth noting is they use chunker settings min_len=32K, tgt_len=64K, max_len=128K and get an average length of about 80K after Regression Chunking. These settings are equivalent to about min_len=0.33x, max_len=1.33x, which is a very low max_len setting. Their doc talks about this giving about 14% truncations without regression chunking, but they seem to miss the fact that min_len offsets the real average, and actually they would have got e^-1.5 = 22%, truncations which would make some kind of truncation handling improvements even more necessary. The Regression Chunking re-distribution of truncated chunks resulted in a 5% reduction of the average chunk size. It's possible that all the "Regression Chunking" deduplication wins are entirely due to this average chunk size reduction, something which could maybe have been achieved by setting better basic chunking settings.

I don't know how important it is to have tightly constrained chunk sizes. Just how bad would it have been to set max_len=256K instead of max_len=128K? Using min_len=32K, tgt_len=32K, max_len=128K without Regression Chunking would have had 5% chunks at max_len and an average length of 62K which would probably give better deduplication. Increasing max_len to 192K would mean less than 0.7% at max_len with an average length of 64K, probably giving slightly better deduplication because of avoided truncations.

It would be interesting to test Regression Chunking too. Using a more flexible h<p judgement criteria, A "better" regression chunking could be implemented by simply storing the offset with the minimum hash value, which is the "best chunk boundary", and using that as the chunk boundary when max_len is reached.

## Exponential Chunking

The normal chunking algorithms use a constant probability of a break point each byte, and typically have a min and max chunk size.

With no min or max chunksize, the average chunk size is the "target_size" when the probability P = lambda = 1/target_size. When you add a min and/or max chunk size, the average chunk size is modified. The min_size simply adds to the target_size. The max_size changes the size distribution and modifies the average like this:

target_size = 1/P = 1/lambda large_fract = 1 - CDF(max_size - min_size) = 1 - (1 - e^-(lambda*(max_size - min_size)) = e^((min_size - max_size)/target_size) avg_large_size = max_size + target_size avg_size = min_size + target_size - large_fract*avg_large_size + large_fract*max_size = min_size + target_size - large_fract*(avg_larg_size - max_size) = min_size + target_size - large_fract*target_size = min_size + (1 - large_fract)*target_size = min_size + (1 - e^((min_size - max_size)/target_size))*target_size

This is non-trivial to transform into function to calculate the target_size for a given average, min, and max size, so the easiest way is to solve the following numerically:

0 = min_size + (1 - e^-((max_size - min_size)/target_size))*target_size - avg_size

## Better Normalized Chunking

We can use a hash judgement of h < p where the hash h and probability p are treated as a fixed point numbers between 0.0 -> 1.0 scaled into a 32bit number. This lets use use arbitrary chunking probabilities based on the distance from the previous break-point to give whatever chunk-size distribution we want.

Using the following gives a pretty good approximation of a standard distribution centered on the target_size, except with hard min-size and max-size limits:

min_size=target_size/4 max_size=2.5*target_size + min_size median_size = min_size + target_size p = (2*(size - min_size)^2/target_size^3

You can see this graphed here;

https://docs.google.com/spreadsheets/d/1ofg3-ZX04mbGA-9HZAcu5shfkbPYEO_PGW_4UuKFQpc/edit?usp=sharing

This is calculated from this distribution equation:

PDF(x) = f(x) * (1 - CDF(x))

Which can be solved on https://onsolver.com/diff-equation.php and differentiated and graphed using https://www.derivative-calculator.net/ as the differential eqn::

y' = f(x) * (1 - y) y(0) = 0, y'(0) = 0

Which for the following f(x) gives:

f(x) = K*x^2 CDF(x) = 1 - e^(-K*x^3/3) PDF(x) = K*x^2*e^(-K*x^3/3)

This is apparently a Weibull distribution with k=3 and lamda=(3/K)^(1/3) (or b=K/3 using the alternate parameter form). This has distribution curve characteristics::

mode = (2/K)^(1/3) median = (ln(2)*3/K)^(1/3) ~= (2.08/K)^(1/3) ~= 1.01*mode mean = (3/K)^(1/3) * gamma(4/3) ~= (2.14/K)^(1/3) ~= 1.02*mode max ~= 2.13*mean

Which means you can calculate K for a target mode, median, or mean from:

K = 2/mode^3 K = ln(2)*3/median^3 ~= 2.08/median^3 K = 3*(gamma(4/3)/mean)^3 ~= 2.14/mean^3

Alternatively you can use a more slowly increasing chance of having a cut point using::

f(x) = K*x CDF(x) = 1 - e^(-K*x^2/2) PDF(x) = K*x*e^(-K*x^2/2)

This is a Weibull distribution with k=2 and lamda=(2/K)^(1/2). This has distribution curve characteristics::

mode = (1/K)^(1/2) median = (ln(2)*2/K)^(1/2) ~= (1.39/K)^(1/2) ~= 1.18*mode mean = (2/K)^(1/2) * gamma(3/2) ~= (1.57/K)^(1/2) ~= 1.25*mode max ~= 2.97*mean

Which means you can calculate K for a target mode, median, or mean from:

K = 1/mode^2 K = ln(2)*2/median^2 ~= 1.39/median^2 K = 2*(gamma(3/2)/mean)^2 ~= 1.57/mean^2

Note that p as a fixed point unsigned 32bit number can be calculated efficiently incrementally like this:

p = 0 step = 2 * (2^32 / target_size^3) incr = step * 2 for each iteration after min_size: h = <update rollsum> if h < p: <chunk breakpoint> p += step step += incr

Note that if target_size is not a power of 2 then this will be rounded down and will "stretch" the distribution to give longer chunks. If target_size is close or larger than 2**10 then incr will be too small to be accurate enough. If you want to calculate it more accurately you'd need integer fractions like this:

p = 0 denom = (target_size/1024)^3 step = (2 * 4) / denom step_frac = (2 * 4) % denom incr = step * 2 incr_fract = step_frac * 2 if incr_frac >= denom: incr += 1 incr_frac -= denom for each iteration after min_size: h = <update rollsum> if h < p: <chunk breakpoint> p += step p_frac += step_frac step += incr step_frac += incr_frac if step_frac >= denom: step += 1 step_frac -= denom

or (probably better) update p and step at a courser granularity like this every 256 bytes:

p = 0 step = (2 * 256^2 * 4) / (target_size/1024)^3 incr = step * 2 for each iteration after min_size: h = <update rollsum> if h < p: <chunk breakpoint> if size % 256 == 0: p += step step += incr

Comparing the hash to a probability like this probably requires that the hash used is fairly good, which is a stronger requirement than for (h&mask == r) style hash judgement. Things like Gear are probably sufficient provided the byte-mapping table is good. Buzhash and Rollsum/Adler/Bup without a decent byte-mapping table are not.

It's not entirely clear that doing this actually does improve re-syncing breakpoints, which is necessary to maximize duplicate detection, but it will give a much better chunk size distribution.

The (h&mask == r) hash judgment is equivalent to using a fixed probability (h < p) hash judgement. Using a fixed probability and skipping over min-size=target-size/4 means there is a 22% chance that any break-point masks another breakpoint after it. Also, with max-size=target-size*2, there is a 13% chance a long chunk is terminated at an arbitrary break point, and that break point has a 22% chance of masking a legitimate break point after that. Any change has a chance of introducing a new break-point, which can mask an old break-point after that, or removing a break-point, which can expose a previously masked break-point. These added/exposed/hidden breakpoints can affect the break-points after them, further delaying re-synchronizing with the old breakpoints. It's possible that in rare/degenerate cases they never re-synchronize. Synchronization is only guaranteed when there is a breakpoint after a run >min-size and <max-size without a breakpoint, which for any one chunk is a 100% - 22% - 13% = 65% chance. The only way to guarantee re-synchronizing is to not have any min-size, ensuring that all break points are found and synchronized on, at the expense of a large number of very small chunks.

Using a non-constant probability p to normalize the chunk size is a like using non-binary break-point masking, with breakpoints being masked based on their "strength" (how close to zero the hash is) and their distance from the previous break point. This means any added/removed breakpoint will modify what breakpoints will be detected after it, potentially delaying synchronization for the benefit of a better chunk-size distribution. My gut feeling is that this will be better than a fixed p with min-size skipping and max-size limit, but it could be worse. The FastCDC? paper suggests that it should be better, but I'm unconvinced it proves that; I think it's possible the de-duplication wins are from smaller chunk sizes, in particular a tighter chunk-size distribution with less very large chunks. Note that the variable p approach also gives a tighter chunk-size distribution, so it should get the same wins.

Note that although a variable p hash judgement approach does variable break-point masking, it doesn't gain any speed benefits from masking them like skipping over min-size does. It does include the speed benefits of skipping over min-size, but doesn't gain any speed benefits for skipping over later break-points that get masked because they were not "strong" enough to be below the p value for the distance from the previous break point. The main benefit of this is not speed, but a better chunk-size distribution, which may (but might not) also translate into better duplication detection.

## Updates

Some preliminary testing suggests that using normal exponential chunking (AKA weibull with k=1) with tgt_len == min_len for an average block size of 2*tgt_len gives better performance than Weibull with k=2 or k=3 and even fastcdc for the same average block size. It performs better than exponential chunking with no min_len. This appears to be because the larger block sizes matter too, and keeping them small improves things too. With min_len == tgt_len, the tgt_len is halved compared to min_len=0 for the same average block size, which results in 98% of blocks being under 2.5x the average length, compared to 4x the average length. It appears reducing the larger block lengths improves things more than the possibly masked block boundaries under min_len costs.

Reducing max_len below the 98th percentile clearly hurts deduplication for all distributions. Partly this is because it introduces bad/meaningless break points, but also because for the same avg block length you need to increase the tgt_len to offset the truncated blocks. This makes it less sensitive to break-points and increases the "expected" block size, increasing the number of truncations even further. Note the 98th percentile is approximately 4x/3x/2x tgt_len past min_len for weibull weibull with k=1/2/3 respectively.

Note k=2 does perform better than exponential with min_len=0 too, but not by as much. This is probably because k=2 reduces the max length to ~3x the average length. For k=3 the max block length is reduced to ~2x which also makes it better than exponential, but not as good as k=2. This suggest the more aggressive adjustments to the chunking probability based on the length mess with synchronizing after changes, since a previous chunk boundary might not be recognized because it's position relative to the block length has changed. Increasing min_len for k=2 makes it perform worse, and even more so for k=3, again probably not just because it can mask potential breakpoints, but because for the same avg_len it reduces tgt_len, increasing the dynamicness of the chunking probability.

I also suspect that in the case of exponential, the high probability immediately after the min_len helps it quickly find chunk points close to each other, speeding up synchronization. This suggests that a better approach for k=2 would be not to just "shift" the distribution by min_len, but "chop it off", by having the "hazzard function" f(x) start at the K*x^P value for x=min_len. This is what effectively happens with exponential because it's hazzard function is a constant. This changes the distribution to no longer match a weibull distribution, with the following differential equation solved on wolframalpha, and then the derivative solved on derivative-calculator as:

y' = f(x) * (1 - y) y(0) = 0, y'(0) = PDF(0) = M*C^(k-1)

Where:

C = min_len f(x) = M*(x+C)^(k-1) y(x) = CDF(x) y'(x) = PDF(x)

Giving:

CDF(x) = 1 - e^-(M/k*((x+C)^k - C^k)) PDF(x) = M*(x+C)^(k-1) * e^-(M/k*((x+C)^k - C^k))

Which we can calculate the the following properties:

T = max_len - min_len mean = integ((x+C) * PDF(x), 0, T) + (C+T)*(1-CDF(T)) = integ(M*(x+C)^k * e^-(M/k*((x+C)^k - C^k)), 0, T) + (C+T) * e^-(M/k*((T+C)^k - C^k))

The integral works out a little cleaner if we use the weibull distribution lambda=L instead of the hazzard function multiplier M where M=k/L^k, changing this to the following:

mean = integ((x+C) * PDF(x), 0, T) + (C+T)*(1-CDF(T)) = integ((k/L^k)*(x+C)^k * e^-((1/L^k)*((x+C)^k - C^k)), 0, T) + (C+T)*e^-((1/L^k)*((T+C)^k - C^k)) = integ(k*((x+C)/L)^k * e^-(((x+C)^k - C^k)/L^k), 0, T) + (C+T)*e^-(((T+C)^k - C^k)/L^k)

The integral (but not the definite integral) can be solved on the integral calculator giving:

integ(k*((x+C)/L)^k * e^-(((x+C)^k - C^k)/L^k)) = -L*e^(C^k/L^k)*gammaupper((k+1)/k,(x+C)^k/L^k) + const = -K*e^((C/L)^k)*gammaupper((k+1)/k,((x+C)/L)^k) + const

The definite integral evaluated between 0 and T thus becomes:

integ(k*((x+C)/L)^k * e^-(((x+C)^k - C^k)/L^k), 0, T) = -L*e^((C/L)^k) * gammaupper((k+1)/k,((T+C)/L)^k) - -L*e^((C/L)^k) * gammaupper((k+1)/k,((0+C)/L)^k) = L*e^((C/L)^k) * gammaupper((k+1)/k,(C/L)^k) - L*e^((C/L)^k) * gammaupper((k+1)/k,((T+C)/L)^k) = L*e^((C/L)^k) * (gammalower((k+1)/k,((T+C)/L)^k) - gammalower((k+1)/k,(C/L)^k))

giving the following mean for a weibull distribution limited between min_len=C and max_len=T+C:

mean = L*e^((C/L)^k) * (gammalower((k+1)/k,((T+C)/L)^k) - gammalower((k+1)/k,(C/L)^k)) + (C+T)*e^-(((T+C)^k - C^k)/L^k)

Note that with no max_len limit, T = infinity, giving:

mean = L*e^((C/L)^k) * (gamma((k+1)/k) - gammalower((k+1)/k,(C/L)^k))