Edit detail for LibrsyncRollsumAnalysis revision 3 of 8

1 2 3 4 5 6 7 8
Editor: DonovanBaarda
Time: 2016/08/05 13:55:15 GMT+10
Note:

changed:
-There are two ways to think about it; modulus of sums and bit information propagation.
There are two ways to think about it; how bit information propagates in the s0 and s1 sums, and how the distribution of the s0 and s1 total sums ends up distributed in the truncated 16bit s0 and s1 values.

changed:
-The truncation to 16 bits is doing a mod 0x10000 and chops of information stored in higher bits. Information in bits of the bytes added only propagates up to higher bits, never down into lower bits.
-
-Doing a mod 0xFFFF instead of just truncating to 16 bits is the same as using 1's compliment addition. This does an "end around carry" effect, propagating information that overflows into the higher bits back into the lower bits. This ensures all bits get treated more evenly. This is what Fletcher uses.
The truncation to 16 bits is doing a mod 0x10000 and chops of information stored in higher bits. Information in bits of bytes added together only propagate up to higher bits, never down into lower bits.

Doing mod 0xFFFF instead of just truncating to 16 bits is the same as using 1's compliment addition. This does an "end around carry", propagating information that overflows into the higher bits back into the lower bits. This ensures all bits get treated more evenly and preserves bit information overflowing past 16bits. This is what Fletcher uses.

changed:
-The +31 is adding 0xf to each byte. On byte by byte basis this is spreading the information in the lower 4 bits up into the higher bits faster. Without an end-around-carry this ensures the information overflows out of the s0 word faster.
-In in the s0 word of the checksum it is equivalent to adding 31*B to the sum of all the bytes. If the block size B is a power of two, it's equivalent to adding 0xf << N to the sum. If the blocksize is > 2^12 some of those bits overflow out of s0 anyway.
The +31 is adding 0xf to each byte. On a byte by byte basis this is spreading the information in the lower 4 bits up into the higher bits faster. Without an end-around-carry this ensures the information overflows out of the s0 word faster.
In in the s0 word of the checksum it is equivalent to adding 31*B to the sum of all the bytes. If the block size is a power of two, it's equivalent to adding 0xf << N to the sum. If the blocksize is > 2^12 some of those bits overflow out of s0 anyway.

changed:
-For very small block sizes the +31 does help propagate information into higher order bits faster, which probably helps make the hash appear more evenly balanced when using the sum as a hashtable key, but for even normal sized blocks all it does is ensure the information overflows out of the sums faster.
-
-So the +31 doesn't help, at least when just truncating the sums to 16bits. Initializing s0=1 would be simpler and achieve the same objective of ensuring the length is encoded in the s1 value
-
For very small block sizes the +31 does help propagate information into higher order bits faster, which probably helps a naive truncation appear more evenly balanced when using the sum as a hashtable key, but for even normal sized blocks all it does is ensure the information overflows out of the sums faster.

So the +31 doesn't help. When truncating the sums to 16bits it overflows bit information out of the s0 and s1 sums faster. When doing mod 0xffff then information doesn't overflow out of the sums, but it doesn't make the sums any better.

Initializing s0=1 would be simpler and achieve the same objective of ensuring the length is encoded in the s1 value. This is what adler32 does.

Characteristics of a single random byte
---------------------------------------

Note for evenly distributed random bytes, for a single byte we can calculate:::

   sum,sum2 = 0,0
   for i in range(256):
     sum += i
     sum2 += i*i
   meanB=sum/256.0
   varB=sum2/256.0 - meanB*meanB
   stddevB=sqrt(varB)

Which gives meanB=127.5, varB=5461.25, stddevB=73.9.

For a sum of N bytes, we end up with a normal distribution with:::

   mean = N*meanB
   var = N*varB.

For a single byte multiplied by N we get:::

   mean = N*meanB
   var = N*N*varB

changed:
-A sum of B evenly distributed random bytes will have a normal distribution with a min=0, a max=B*255, a mean=B*127.5 and a variance=B*5461.25. For B=2^10 this gives mean=0x1fe00 and stddev=0x93c. Since the mean is more than 16 bits, bit information has already started overflowing out of the sum. Since 99.7% of possible sums will fall within a contiguous range of +-3*stddev, this means a range 0x3768 or within less that 1/4 of the possible range of values. Also, 68% will fall within +-1*stddev, or a range of only 0x1278, or just over 1/16 of the possible values. Any truncation or modulus of this sum just means the range used wraps around, and any offset added just shifts where the range lies. It's only when B>21843 or 21KB that 99.7% of samples use the full 16bit range, and B>196611 or 192kB till 68% of samples use the whole 16bit range.
-
-Taking a naive modulus of the rollsum as the index into a hash table means for hashtables sized smaller than 2^16 you only use the s0 value, and only the lowest bits. The lowest bits should give good diversity, but for large hash tables and small block sizes you can easily end up only populating a fraction of the hashtable with lots of collisions.
-
The s0 sum of bytes x1,x2,...,xB is just:::

   s0 = x1 + x2 + ... + xB

For N evenly distributed random bytes we have:::

   min = 0
   max = N * 255
   mean = N * meanB
        = N*127.5
   var = N * varB
       = N*5461.25

For a blocksize of 1kB s0 is sum of 2^10 evenly distributed random bytes giving max=0x3fc00, mean=0x1fe00 and stddev=0x93c. Since the mean and max are more than 16 bits, bit information has already started overflowing out of a truncated sum. 

Since 99.7% of possible sums will fall within a contiguous range of +-3*stddev, this means a range of 0x3768, or within less than 1/4 of the possible range of 16bit values. Also, 68% will fall within +-1*stddev, or a range of only 0x1278, or just over 1/16 of the possible values. Any truncation or modulus of this sum just means the common range wraps around, and any offset added just shifts where the range starts. It's only when B>21843 or 21KB that 99.7% of samples use the full 16bit range, and B>196611 or 192kB till 68% of samples use the whole 16bit range.

So the s0 sum will have a poor hash distribution for normal block sizes. It's only for block sizes >21kB that it starts to look OK, and only for block sizes >192kB that it is pretty good.

changed:
-The s1 sum of bytes x1,x2,...,xB is B*x1 + (B-1)*x2 + ... + xB.
-
-This means the second last byte is shifted left 1, the 4th last left 2, the 8th last 3, ... the first left N for block size B=2^N. For a 16bit truncate that means for a 1KB block the high end 2 bits of the first byte have overflowed out of the sum.
-
-For B evenly distributed random bytes, the s1 sum, excluding any offsets, will have min=0, max=255*(B^2+B)/2, mean=127.5*(B^2+B)/2, and variance=???
-
-
The s1 sum of bytes x1,x2,...,xB is:::

   s1 = B*x1 + (B-1)*x2 + ... + xB

This means the second last byte is shifted left 1, the 4th last left 2, the 8th last 3, ... the first left N for block size B=2^N. After truncating to 16bits that means for a 1KB block the high end 2 bits of the first byte have overflowed out of the sum. Larger block sizes means even more of the first byte will have overflowed out, and for a 64KB block size there is no remnant of the first byte left in s1.

For N evenly distributed random bytes we have:::

   min = 0
   max = N*255 + (N-1)*255 + ... + 1*255 
       = (N + (N-1) + ... + 1) * 255 
       = N*(N-1)/2 * 255
   mean = N*meanB + (N-1)*meanB + ... 1*meanB 
        = (N + (N-1) + ... + 1) * meanB
        = N*(N-1)/2 * meanB
        = N*(N-1)/2 * 127.5
   var = N^2*varB + (N-1)^2*varB + ... + varB
       = (N^2 + (N-1)^2 + ... + 1^2)*varB
       = N*(N+1)*(2*N+1)/6 * varB
       = N*(N+1)*(2*N+1)/6 * 5461.25

See https://proofwiki.org/wiki/Sum_of_Sequence_of_Squares for why the variance sum reduces to that value.

For a blocksize of 1kB of 2^10 evenly distributed random bytes, s1 has max=0x7f60200, mean=0x3fb0100, stddev=0x15594a. Since the mean is 28bits, quite a lot of bit information has overflowed out of a truncated 16bit sum. Since the stddev is also much more than 16bits the truncated 16bit sums should be well distributed over the whole 16bit range.

So the s1 sum should have a pretty good hash distribution for even small block sizes.

Using the combined sum as hashtable index
-----------------------------------------

The overall 32bit rollsum is rollsum = (s1 & 0xffff) << 16 + (s0 & 0xffff), so s1 is the upper 16bits, and s0 is the lower 16 bits.

Taking a naive modulus of the rollsum as the index into a hash table means for hashtables sized smaller than 2^16 you only use the s0 value, and only the lowest bits. The lowest bits should give good diversity, but since s0 only uses a fraction of the 16bit range the higher bits will not be well distributed. So for large hash tables and small block sizes you can easily end up only populating a fraction of the hashtable with lots of collisions.

Using a hashtable sizes of (2^N-1) and using a simple mod (2^N-1) of the combined 32bit rollsum will mix s1 into s0. Since s1 will have a good distribution, this should give a pretty good hash distribution and minimize collisions. This should be as good as any sort of integer hashing algorithm applied to the rollsum, and much simpler.

Summary
-------

To make the rollsum simpler and better we could:

  a) drop the +31 to each byte and replace it with initializing s0=1.

  b) change the sums to use mod 0xffff instead of just truncating to 16bits.

These will simplify and improve the quality of the rollsum, but it's unclear by how much. These changes could be implemented easily with little to no extra overheads (maybe even less overheads). However, they will break backwards compatibility with older versions. I'm not sure how to handle that.

When Using the rollsum as a hashtable index:

  a) Using 2^N hashtable sizes and just taking the lower N bits will give a bad hash distribution with lots of collisions.

  b) Using 2^N-1 hashtable sizes and doing mod 2^N-1 should give a good hash distribution, and will be much simpler than using integer hash functions.



There are two ways to think about it; how bit information propagates in the s0 and s1 sums, and how the distribution of the s0 and s1 total sums ends up distributed in the truncated 16bit s0 and s1 values.

The modulus options

The truncation to 16 bits is doing a mod 0x10000 and chops of information stored in higher bits. Information in bits of bytes added together only propagate up to higher bits, never down into lower bits.

Doing mod 0xFFFF instead of just truncating to 16 bits is the same as using 1's compliment addition. This does an "end around carry", propagating information that overflows into the higher bits back into the lower bits. This ensures all bits get treated more evenly and preserves bit information overflowing past 16bits. This is what Fletcher uses.

Note mod 0xffff can be implemented as (v >> 16) + (v & 0xffff) for repeated summing, followed by "if v>0xffff then v-= 0xffff" at the end. This works because V % N can be implemented by subtracting N until V<N. So V % (N - 1) can be implemented by subtracting N and adding 1 until V<(N-1). The mask and shift sum operation is the same as this. See http://en.wikipedia.org/wiki/Fletcher%27s_checksum#Optimizations for examples. Also the zlib adler32 implementation does something similar for it's mod operation.

Doing a mod of the largest 16bit prime is what adler32 does. This has been shown to just reduce the range of possible values for no additional gains over mod 0xffff.

The +31 to each byte

The +31 is adding 0xf to each byte. On a byte by byte basis this is spreading the information in the lower 4 bits up into the higher bits faster. Without an end-around-carry this ensures the information overflows out of the s0 word faster. In in the s0 word of the checksum it is equivalent to adding 31*B to the sum of all the bytes. If the block size is a power of two, it's equivalent to adding 0xf << N to the sum. If the blocksize is > 2^12 some of those bits overflow out of s0 anyway.

For the s1 word it's equivalent to adding 31*B/2*(B+1) = 31*(B^2+B)/2 to the s1 sum. When B is a power of 2 this is equivalent to 0xf << (2N-1) + 0xf << (N-1). For B > 2^9 the larger term overflows out of the s1 word, so only the lower term counts.

For very small block sizes the +31 does help propagate information into higher order bits faster, which probably helps a naive truncation appear more evenly balanced when using the sum as a hashtable key, but for even normal sized blocks all it does is ensure the information overflows out of the sums faster.

So the +31 doesn't help. When truncating the sums to 16bits it overflows bit information out of the s0 and s1 sums faster. When doing mod 0xffff then information doesn't overflow out of the sums, but it doesn't make the sums any better.

Initializing s0=1 would be simpler and achieve the same objective of ensuring the length is encoded in the s1 value. This is what adler32 does.

Characteristics of a single random byte

Note for evenly distributed random bytes, for a single byte we can calculate::

sum,sum2 = 0,0
for i in range(256):
  sum += i
  sum2 += i*i
meanB=sum/256.0
varB=sum2/256.0 - meanB*meanB
stddevB=sqrt(varB)

Which gives meanB=127.5, varB=5461.25, stddevB=73.9.

For a sum of N bytes, we end up with a normal distribution with::

mean = N*meanB
var = N*varB.

For a single byte multiplied by N we get::

mean = N*meanB
var = N*N*varB

Characteristics of s0 sum

The s0 sum of bytes x1,x2,...,xB is just::

s0 = x1 + x2 + ... + xB

For N evenly distributed random bytes we have::

min = 0
max = N * 255
mean = N * meanB
     = N*127.5
var = N * varB
    = N*5461.25

For a blocksize of 1kB s0 is sum of 2^10 evenly distributed random bytes giving max=0x3fc00, mean=0x1fe00 and stddev=0x93c. Since the mean and max are more than 16 bits, bit information has already started overflowing out of a truncated sum.

Since 99.7% of possible sums will fall within a contiguous range of +-3*stddev, this means a range of 0x3768, or within less than 1/4 of the possible range of 16bit values. Also, 68% will fall within +-1*stddev, or a range of only 0x1278, or just over 1/16 of the possible values. Any truncation or modulus of this sum just means the common range wraps around, and any offset added just shifts where the range starts. It's only when B>21843 or 21KB that 99.7% of samples use the full 16bit range, and B>196611 or 192kB till 68% of samples use the whole 16bit range.

So the s0 sum will have a poor hash distribution for normal block sizes. It's only for block sizes >21kB that it starts to look OK, and only for block sizes >192kB that it is pretty good.

Characteristics of s1 sum

The s1 sum of bytes x1,x2,...,xB is::

s1 = B*x1 + (B-1)*x2 + ... + xB

This means the second last byte is shifted left 1, the 4th last left 2, the 8th last 3, ... the first left N for block size B=2^N. After truncating to 16bits that means for a 1KB block the high end 2 bits of the first byte have overflowed out of the sum. Larger block sizes means even more of the first byte will have overflowed out, and for a 64KB block size there is no remnant of the first byte left in s1.

For N evenly distributed random bytes we have::

min = 0
max = N*255 + (N-1)*255 + ... + 1*255
    = (N + (N-1) + ... + 1) * 255
    = N*(N-1)/2 * 255
mean = N*meanB + (N-1)*meanB + ... 1*meanB
     = (N + (N-1) + ... + 1) * meanB
     = N*(N-1)/2 * meanB
     = N*(N-1)/2 * 127.5
var = N^2*varB + (N-1)^2*varB + ... + varB
    = (N^2 + (N-1)^2 + ... + 1^2)*varB
    = N*(N+1)*(2*N+1)/6 * varB
    = N*(N+1)*(2*N+1)/6 * 5461.25

See https://proofwiki.org/wiki/Sum_of_Sequence_of_Squares for why the variance sum reduces to that value.

For a blocksize of 1kB of 2^10 evenly distributed random bytes, s1 has max=0x7f60200, mean=0x3fb0100, stddev=0x15594a. Since the mean is 28bits, quite a lot of bit information has overflowed out of a truncated 16bit sum. Since the stddev is also much more than 16bits the truncated 16bit sums should be well distributed over the whole 16bit range.

So the s1 sum should have a pretty good hash distribution for even small block sizes.

Using the combined sum as hashtable index

The overall 32bit rollsum is rollsum = (s1 & 0xffff) << 16 + (s0 & 0xffff), so s1 is the upper 16bits, and s0 is the lower 16 bits.

Taking a naive modulus of the rollsum as the index into a hash table means for hashtables sized smaller than 2^16 you only use the s0 value, and only the lowest bits. The lowest bits should give good diversity, but since s0 only uses a fraction of the 16bit range the higher bits will not be well distributed. So for large hash tables and small block sizes you can easily end up only populating a fraction of the hashtable with lots of collisions.

Using a hashtable sizes of (2^N-1) and using a simple mod (2^N-1) of the combined 32bit rollsum will mix s1 into s0. Since s1 will have a good distribution, this should give a pretty good hash distribution and minimize collisions. This should be as good as any sort of integer hashing algorithm applied to the rollsum, and much simpler.

Summary

To make the rollsum simpler and better we could:

  1. drop the +31 to each byte and replace it with initializing s0=1.
  2. change the sums to use mod 0xffff instead of just truncating to 16bits.

These will simplify and improve the quality of the rollsum, but it's unclear by how much. These changes could be implemented easily with little to no extra overheads (maybe even less overheads). However, they will break backwards compatibility with older versions. I'm not sure how to handle that.

When Using the rollsum as a hashtable index:

  1. Using 2^N hashtable sizes and just taking the lower N bits will give a bad hash distribution with lots of collisions.
  2. Using 2^N-1 hashtable sizes and doing mod 2^N-1 should give a good hash distribution, and will be much simpler than using integer hash functions.