UPDATE: the thoughts on this page have been experimented with more extensively with tests and documented as a github project at;

https://github.com/dbaarda/rollsum-tests

The conclusions after the more extensive tests have even better solutions than discussed here.

There are two ways to think about it; how bit information propagates in the s1 and s2 sums, and how the distribution of the s1 and s2 total sums ends up distributed in the truncated 16bit s1 and s2 values.

The modulus options

The truncation to 16 bits is doing a mod 0x10000 and chops of information stored in higher bits. Information in bits of bytes added together only propagate up to higher bits, never down into lower bits.

Doing mod 0xFFFF instead of just truncating to 16 bits is the same as using 1's compliment addition. This does an "end around carry", propagating information that overflows into the higher bits back into the lower bits. This ensures all bits get treated more evenly and preserves bit information overflowing past 16bits. This is what Fletcher uses.

Note mod 0xffff can be implemented as (v >> 16) + (v & 0xffff) for repeated summing, followed by "if v>0xffff then v-= 0xffff" at the end. This works because V % N can be implemented by subtracting N until V<N. So V % (N - 1) can be implemented by subtracting N and adding 1 until V<(N-1). The mask and shift sum operation is the same as this. See http://en.wikipedia.org/wiki/Fletcher%27s_checksum#Optimizations for examples. Also the zlib adler32 implementation does something similar for it's mod operation.

Doing a mod of the largest 16bit prime is what adler32 does. This has been shown to just reduce the range of possible values for no additional gains over mod 0xffff.

Some experimenting with rollsum variants for 1M of csv ASCII data giving 1047553 1K blocks showed using 0xffff is unexpectedly worse. The reason for this is the s1 sum for 1K ASCII data uses so little of the 16bit range, the mod size has no effect on it at all, and the chance of colliding in s1 is very high. This means most of the hash depends on the well distributed s2 sum. The sacrifice of 1 possible value in s2 hurts badly, since everything that previously had s2=0xffff is now spread as s2 collisions over other values, and is pretty much guaranteed to also have collisions in s1. This effect outweighs the benefit of the slightly better hashing with s2.

Using 0x10000 there were 12850 collisions. For s1 16bits only 2434 values were used, leaving 63102 values unused with min/avg/max/sdev = 0/15.9843902588/2434/159.743669365. For s2 the entire range was used with min/avg/max/sdev = 3/15.9843902588/34/3.99462726489.

Using base=0xffff there were 12883 collisions. The s1 16bits distribution was identical, but the s2 distribution had a little worse min->max range because of the missing 0xffff with min/avg/max/sdev = 0/15.9843902588/35/3.98825070239. However, even despite the reduced s2 range, the variance was lower, suggesting the distribution was actually a little better.

For almost random zip file data, using 0xffff did show some improvement with collisions going from 722 down to 718.

After identifying the narrow s1 range as being a problem, I experimented with trying to increase its dynamic range by using the square of the byte values. This had a huge effect, with significantly reduced collisions and much better hash distributions for both the whole rollsum and s1 and s2, looking close to optimal poisson distributions for all of them for both csv and zip data. With these better distributions using 0xffff was better than 0x10000 for both zip and csv data.

Using squared byte values, mod 0xffff, for the csv data collisions went down to 162, and for zip data down to 110. That's a huge drop from the original csv=12850 and zip=722 collisions respectively!

The +31 to each byte

The +31 is adding 0xf to each byte. On a byte by byte basis this is spreading the information in the lower 4 bits up into the higher bits faster. Without an end-around-carry this ensures the information overflows out of the s1 word faster. In in the s1 word of the checksum it is equivalent to adding 31*B to the sum of all the bytes. If the block size is a power of two, it's equivalent to adding 0xf << N to the sum. If the blocksize is > 2^12 some of those bits overflow out of s1 anyway.

For the s2 word it's equivalent to adding 31*B/2*(B+1) = 31*(B^2+B)/2 to the s2 sum. When B is a power of 2 this is equivalent to 0xf << (2N-1) + 0xf << (N-1). For B > 2^9 the larger term overflows out of the s2 word, so only the lower term counts.

For very small block sizes the +31 does help propagate information into higher order bits faster, which probably helps a naive truncation appear more evenly balanced when using the sum as a hashtable key, but for even normal sized blocks all it does is ensure the information overflows out of the sums faster.

So the +31 doesn't help. When truncating the sums to 16bits it overflows bit information out of the s1 and s2 sums faster. When doing mod 0xffff then information doesn't overflow out of the sums, but it doesn't make the sums any better.

Tests have shown that for fixed block sizes initializing s1 or adding offsets to bytes have zero impact on the rollsum quality. This is unsurprising, since they both effectively just add fixed offsets to s1 and s2.

Initializing s1=1 would be simpler and achieve the same objective of ensuring the length is encoded in the s2 value. This is what adler32 does.

Characteristics of a single random byte

Note for evenly distributed random bytes, for a single byte we can calculate::

sum,sum2 = 0,0
for i in range(256):
  sum += i
  sum2 += i*i
meanB=sum/256.0
varB=sum2/256.0 - meanB*meanB
stddevB=sqrt(varB)

Which gives meanB=127.5, varB=5461.25, stddevB=73.9.

A sample 1M fragment of a zip file gives stats very similar to this::

count=1048576 min/avg/max/sdev = 7/128.726448059/255/73.8443282326

However, for text data the distribution is way less. A sample 1M fragment of a csv file gives::

count=1048576 min/avg/max/sdev = 10/47.6125984192/120/5.88370775439

This equates to::

meanA=47.6
varA=34.6

For a sum of N bytes, we end up with a normal distribution with::

mean = N*meanB
var = N*varB.

For a single byte multiplied by N we get::

mean = N*meanB
var = N*N*varB

Characteristics of s1 sum

The s1 sum of bytes x1,x2,...,xB is just::

s1 = x1 + x2 + ... + xB

For N evenly distributed random bytes we have::

min = 0
max = N * 255
mean = N * meanB
     = N*127.5
var = N * varB
    = N*5461.25

For a blocksize of 1kB s1 is sum of 2^10 evenly distributed random bytes giving max=0x3fc00, mean=0x1fe00 and stddev=0x93c. Since the mean and max are more than 16 bits, bit information has already started overflowing out of a truncated sum.

Since 99.7% of possible sums will fall within a contiguous range of +-3*stddev, this means a range of 0x3768, or within less than 1/4 of the possible range of 16bit values, or only the bottom 14bits. Also, 68% will fall within +-1*stddev, or a range of only 0x1278, or just over 1/16 of the possible values, or the bottom 12bits. Any truncation or modulus of this sum just means the common range wraps around, and any offset added just shifts where the range starts. It's only when B>21843 or 21KB that 99.7% of samples use the full 16bit range, and B>196611 or 192kB till 68% of samples use the whole 16bit range.

So the s1 sum will have a poor hash distribution for normal block sizes. It's only for block sizes >21kB that it starts to look OK, and only for block sizes >192kB that it is pretty good.

Tests show that this holds true, and is even worse for ASCII data. Using the square of the byte values has a huge impact on this, but its affect on large block sizes is unknown.

Characteristics of s2 sum

The s2 sum of bytes x1,x2,...,xB is::

s2 = B*x1 + (B-1)*x2 + ... + xB

This means the second last byte is shifted left 1, the 4th last left 2, the 8th last 3, ... the first left N for block size B=2^N. After truncating to 16bits that means for a 1KB block the high end 2 bits of the first byte have overflowed out of the sum. Larger block sizes means even more of the first byte will have overflowed out, and for a 64KB block size there is no remnant of the first byte left in s2.

For N evenly distributed random bytes we have::

min = 0
max = N*255 + (N-1)*255 + ... + 1*255
    = (N + (N-1) + ... + 1) * 255
    = N*(N-1)/2 * 255
mean = N*meanB + (N-1)*meanB + ... 1*meanB
     = (N + (N-1) + ... + 1) * meanB
     = N*(N-1)/2 * meanB
     = N*(N-1)/2 * 127.5
var = N^2*varB + (N-1)^2*varB + ... + varB
    = (N^2 + (N-1)^2 + ... + 1^2)*varB
    = N*(N+1)*(2*N+1)/6 * varB
    = N*(N+1)*(2*N+1)/6 * 5461.25

See https://proofwiki.org/wiki/Sum_of_Sequence_of_Squares for why the variance sum reduces to that value.

For a blocksize of 1kB of 2^10 evenly distributed random bytes, s2 has max=0x7f60200, mean=0x3fb0100, stddev=0x15594a. Since the mean is 28bits, quite a lot of bit information has overflowed out of a truncated 16bit sum. Since the stddev is also much more than 16bits the truncated 16bit sums should be well distributed over the whole 16bit range.

So the s2 sum should have a pretty good hash distribution for even small block sizes.

Tests show that s2 is still quite good even for 1K ASCII data. Using the square of the bytes to improve s1 doesn't seem to adversely affect s2, particularly when using mod 0xffff.

Using the combined sum as hashtable index

The overall 32bit rollsum is rollsum = (s2 & 0xffff) << 16 + (s1 & 0xffff), so s2 is the upper 16bits, and s1 is the lower 16 bits.

Taking a naive modulus of the rollsum as the index into a hash table means for hashtables sized smaller than 2^16 you only use the s1 value, and only the lowest bits. The lowest bits should give good diversity, but since s1 only uses a fraction of the 16bit range the higher bits will not be well distributed. So for large hash tables and small block sizes you can easily end up only populating a fraction of the hashtable with lots of collisions.

Using a hashtable sizes of (2^N-1) and using a simple mod (2^N-1) of the combined 32bit rollsum will mix s2 into s1. Since s2 will have a good distribution, this should give a pretty good hash distribution and minimize collisions. This should be as good as any sort of integer hashing algorithm applied to the rollsum, and much simpler.

Tests show that using mod (2^N) of the current rollsum produces woeful results, particularly for ASCII data. Using mod (2^N-1) does produce near optimal poisson distributed results. However, when summing the square of the bytes the s1 sum becomes much better distributed, and using mod (2^N) for random zip data is actually better because of the extra value in the range. However, for ASCII data using squared bytes using mod (2^N - 1) is still significantly better.

Summary

To make the rollsum simpler and better we could:

  1. drop the +31 to each byte and replace it with initializing s1=1.
  2. change the sums to use mod 0xffff instead of just truncating to 16bits.
  3. change the sums to accumulate the square of the byte values.

These will simplify and improve the quality of the rollsum, but it's unclear by how much. These changes could be implemented easily with little extra overheads. However, they will break backwards compatibility with older versions. I'm not sure how to handle that.

When Using the rollsum as a hashtable index:

  1. Using 2^N hashtable sizes and just taking the lower N bits will give a bad hash distribution with lots of collisions.
  2. Using 2^N-1 hashtable sizes and doing mod 2^N-1 should give a good hash distribution, and will be much simpler than using integer hash functions.




subject:
  ( 1 subscriber )