jjhash is a pretty mediocre string hash function. Here are some reasons why you would want to use it nevertheless:
- It’s simple, and the implementation is very small and consists of a single header file. If you don't like including some fancy code, the inner workings of which you don’t completely understand, into your project, this might be your thing.
- It’s faster than FNV-2: 6x faster on pointer-and-length strings, 3x faster on null-terminated strings (see bench directory).
- We believe that it is either on par with, or better than, FNV-2 on statistical properties (see quality directory).
- It’s streamable, by which we mean
hash(A concatenated with B)can be calculated inO(length(B))if we knowA’s hashing state (which must beO(1)in space) and have access toA’s length and content. Note than FNV-2 is also streamable under this definition. See jjhashx.h and jjhash_64/jjhashx64.h. - The implementation does not contain any compiler- or platform-dependent code (in particular, checks for endianness), free of any kinds of undefined, implementation-defined and/or unspecified behaviors according to both C and C++ standards, and conforming to C99 and C++98.
- The hashes are consistent between little-endian and big-endian platforms.
- It has both 32-bit hash and 64-bit hash variants (see jjhash_64 directory for 64-bit version of jjhash).
- It’s licensed under public domain-like license (Unlicense).
jjhash is defined as follows:
uint64_t a = JJ_OFFSET;
for each 4-byte chunk of input as 'v' (as little-endian):
a ^= v;
a *= JJ_PRIME;
if any bytes of input left:
v = remaining bytes of input (as little-endian, unused bytes set to zero);
a ^= v;
a *= JJ_PRIME;
a ^= a >> 16;
a ^= a >> 8;
return a & MASK;
where:
JJ_OFFSETis 0x100000000;JJ_PRIMEis 2752750471;HASH_TYPEisuint32_tfor 32-bit hash version anduint64_tfor 64-bit hash version;MASKisUINT32_MAX(0xFFFFFFFF) for 32-bit hash version andUINT64_MAX(0xFFFFFFFFFFFFFFFF) for 64-bit hash version.
We check the following things:
jjhash_s(function to hash a null-terminated string) andjjhash_b(function to hash a pointer-and-length string) agree on the hash of the same string;- calculating the hash of concatenation from the previous state and the new string works as expected;
- the functions do not make unsafe reads past their data (this may lead to a segmentation fault in a real-world program):
we check it by placing the string just before a “poisoned page” (first we allocate two normal pages with
mmap(), then poison the second page withmprotect(..., prot=PROT_NONE)); - all the properties above are invariant over the alignment of the pointer to the beginning of the string.
See validate directory for more information.
We generate 200 random words of approximately given length L (namely, of length L - random() % 4),
then measure the time it takes to hash all the generated words N times,
where N = floor(15000000 / L).
We perform this measurement:
- for various
Ls: namely, of formL = round_up_4(floor(1.6 ^ i))fori=3...23, whereround_up_4(n) = n + (4 - n % 4) % 4rounds up its argument to a multiple of 4; - for both jjhash and FNV-2;
- for both hashing a null-terminated string and for hashing a pointer-and-length string.
See bench directory for more information and instructions on how to reproduce.
We use the following χ² (chi-squared) test given in the Dragon Book to evaluate the quality of a given hash function:
$$
\frac{
\sum\limits_{j=0}^{m-1} b_j (b_j + 1) / 2
}{
(n / 2m)(n + 2m - 1)
},$$where $n$ is the number of words, $m$ is the number of buckets, $b_j$ is the number of words in the bucket $j$.
Here’s a rendered formula of the above LaTeX code:
We test on the corpus of W=146'728 words extracted from the OANC American English language corpus.
To prevent “overfitting”, we chose JJ_PRIME from a large set of prime numbers based on the performance on a smaller dictionary of 27k words;
this evaluation is thus a “validation”.
For each i=1...30, we downsample the corpus to min(2^i, W) words (i.e. choose a random subset of this size) and
map these words into 2^i buckets with the hash function.
We then calculate the χ² statistic using the LaTeX formula above.
For illustrative purposes, we then take base-2 logarithm of the result and then divide it by 1.5^(30-i).
See quality directory for more information and instructions on how to reproduce.
The definition and implementation of jjhash, and the rest of the code in this repo, is licensed under Unlicense, a public domain-like license.





