MinimalPerfectHashA minimal perfect hash function tool. It needs about 1.98 bits per key.The algorithm is recursive: sets that contain no or only one entry are not processed as no conflicts are possible. For sets that contain between 2 and 12 entries, a number of hash functions are tested to check if they can store the data without conflict. If no function was found, and for larger sets, the set is split into a (possibly high) number of smaller set, which are processed recursively. The average size of a top-level bucket is about 216 entries, and the maximum recursion level is typically 5. At the end of the generation process, the data is compressed using a general purpose compression tool (Deflate / Huffman coding) down to 2.0 bits per key. The uncompressed data is around 2.2 bits per key. With arithmetic coding, about 1.9 bits per key are needed. Generating the hash function takes about 2.5 seconds per million keys with 8 cores (multithreaded). The algorithm automatically scales with the number of available CPUs (using as many threads as there are processors). At the expense of processing time, a lower number of bits per key would be possible (for example 1.84 bits per key with 100000 keys, using 32 seconds generation time, with Huffman coding). The memory usage to efficiently calculate hash values is around 2.5 bits per key (the space needed for the uncompressed description, plus 8 bytes for every top-level bucket). At each level, only one user defined hash function per object is called (about 3 hash functions per key). The result is further processed using a supplemental hash function, so that the default user defined hash function doesn't need to be sophisticated (it doesn't need to be non-linear, have a good avalanche effect, or generate random looking data; it just should produce few conflicts if possible). To protect against hash flooding and similar attacks, a secure random seed per hash table is used. For further protection, cryptographically secure functions such as SipHash or SHA-256 can be used. However, such (slower) functions only need to be used if regular hash functions produce too many conflicts. This case is detected when generating the perfect hash function, by checking if there are too many conflicts (more than 2160 entries in one top-level bucket). In this case, the next hash function is used. That way, in the normal case, where no attack is happening, only fast, but less secure, hash functions are called. It is fine to use the regular hashCode method as the level 0 hash function. However, just relying on the regular hashCode method does not work if the key has more than 32 bits, because the risk of collisions is too high. Incorrect universal hash functions are detected (an exception is thrown if there are more than 32 recursion levels).
In-place updating of the hash table is not implemented but possible in
theory, by patching the hash function description. With a small change,
non-minimal perfect hash functions can be calculated (for example 1.22 bits
per key at a fill rate of 81%).
| ||||||||||||||||||