Replace dict's hash table with HAMT #8700

zuiderkwast · 2021-03-25T16:52:09Z

A drop-in replacement of the hash tables in dict, affecting the key space and all other uses of the dict. Fixes #8698.

Status: It's working-ish. The the dict tests are passing, but the Redis test suites are not passing. Most tests are passing ocationally but Redis seems to hang at various times until it's killed by the test runner. Debugging help is most welcome.

The dict in hiredis gets it's function names prefixed with hi_ to avoid collisions. This needs to be done upstream in hiredis...

Performance comparison

Comparison using ./redis-server test dict 10000000 (compiled using make CFLAGS=-DREDIS_TEST) on my laptop (64-bit Linux, 32G RAM).

Classic dict:

Inserting: 10000000 items in 4746 ms
Linear access of existing elements: 10000000 items in 2474 ms
Linear access of existing elements (2nd round): 10000000 items in 2459 ms
Random access of existing elements: 10000000 items in 3726 ms
Accessing random keys: 10000000 items in 1287 ms
Accessing missing: 10000000 items in 2638 ms
Removing and adding: 10000000 items in 4516 ms

HAMT:

Inserting: 10000000 items in 3253 ms
Linear access of existing elements: 10000000 items in 2793 ms
Linear access of existing elements (2nd round): 10000000 items in 2805 ms
Random access of existing elements: 10000000 items in 3676 ms
Accessing random keys: 10000000 items in 1103 ms
Accessing missing: 10000000 items in 2240 ms
Removing and adding: 10000000 items in 6467 ms

Structure

+--------------------+
| dict               |
|--------------------|   +--------------------+   +--------------------+
| bitmaps | children---->| bitmaps | children---->| key     | value    |
| size               |   | key     | value    |   | ...     | ...      |
| dictType           |   | key     | value    |   +--------------------+
| privdata           |   | key     | value    |
+--------------------+   | key     | value    |   +--------------------+
                         | bitmaps | children---->| key     | value    |
                         | key     | value    |   | bitmaps | children--->..
                         | ...     | ...      |   | value   | key      |
                         +--------------------+   | ...     | ...      |
                                                  +--------------------+

In each level, 5 bits of the hash is used as an index into the children. A
bitmap indicates which of the children exist (1) and which don't (0). Whether
the child is a leaf or a subnode is indicated by a second bitmap, "leafmap".

hash(key) = 0010 10010 01011 00101 01001 10110 11101
(64 bits)   10101 11001 01001 01101 10111 11010
                                          ^^^^^
                                            26

bitmap    = 00100101 00000010 00010100 01100001
(level 0)        ^
                 bit 26 is 1, so there is a child node or leaf

leafmap   = 00000100 00000010 00010000 01100000
(level 0)        ^
                 bit 26 is 1, so the child is a key-value entry

The children array is compact and its length is equal to the number of 1s in
the bitmap. In the bitmap above, our bit is the 2nd 1-bit from the left, so
we look at the 2nd child.

children  = +--------------------+
            | bitmaps | children |
            | key     | value    | <-- The 2nd child is a key-value entry. If
            | bitmaps | children |     the key matches our key, then it's our
            | ...     | ...      |     key. Otherwise, the dict doesn't
            +--------------------+     contain our key.

If, instead, our child is a subnode, we use the next 5 bits from the hash to
index into the next level of children.

The classic dict has this structure:

+------+
| dict |
|------|    +-------+     +-------+       +----------+
| ht[0]---->| slot ------>| key   |   ,-->| key      |
| ...  |    | NULL  |     | value |  /    | value    |
+------+    | slot  |     | next----'     | next=NULL|
            | ...   |     +-------+       +----------+
            +-------+

Memory comparison: The classic dict has an entry pointer for every slot and a
'next' pointer in every entry. The HAMT stores the entries within the nodes, so
the only overhead is the subnode pointers and the bitmaps for each subnode. The
number of subnodes is logarithmic to the number of keys.

Possible improvements

To increase speed for large HAMTs, a possible optimization is to resize the root
table. This can be done incrementally and without rehashing the keys, so it's
much faster than resizing a regular hash table. It's explained in the paper.
It's not implemented (yet) and the performance comparision above is done without
this optimization.

kukey · 2021-04-30T09:44:57Z

Cool ！

yossigo · 2021-05-02T14:37:35Z

@zuiderkwast This is really great work!
I think we need to think and come up with a plan of what it takes to get it ready to ship in Redis 7, as it's a non-trivial change naturally. The immediate things I can consider before giving that any real thought are these:

Consider/test impact on memory fragmentation and maybe CoW
Run some more real-world benchmarks to demonstrate/measure where HAMT shines and where it fails
Do we necessarily want it as an all-in replacement for dict, or have them co-exist? The benefits for the large global dicts are clear, but what about hash keys or other specific use cases like the command table?

zuiderkwast · 2021-05-02T21:19:50Z

@zuiderkwast This is really great work!

@yossigo I'm happy you like the idea.

I think we need to think and come up with a plan of what it takes to get it ready to ship in Redis 7, as it's a non-trivial change naturally. The immediate things I can consider before giving that any real thought are these:

Consider/test impact on memory fragmentation and maybe CoW

The defrag code should be updated to defrag/reallocate the HAMT nodes. I consider it a TODO. The impact on copy-on-write should be just positive. It would be good to be able to test and compare though, of course.

Run some more real-world benchmarks to demonstrate/measure where HAMT shines and where it fails

Yes, I'd need help with that. Do you have any realistic test data to run such tests?

Do we necessarily want it as an all-in replacement for dict, or have them co-exist? The benefits for the large global dicts are clear, but what about hash keys or other specific use cases like the command table?

I made a drop-in replacement because it was easier (a smaller change) than breaking out the key space dict (and the expire dict) from the other ones. The alternative would involve much more work and more changes AFAICT. Large hashes probably benefit from a HAMT just like the main key space (while small hashes use ziplist/listpack) and for the remaining dicts (command table, etc.), I guess it doesn't really matter if they're changed or not, but the total complexity would be reduced if we can get rid of the old dict altogether. Any pros and cons should be considered though.

The main blocker right now IMO is the failing builds/tests. Possibly, there's a Heisenbug. They fail at seemingly random test cases and I don't know why. Redis nodes hang and need to be killed by the test runner.

One more TODO is to handle the (unlikely) scenario of 64-bit hash colissions. It can be handled with an array of the duplicates on the last level (or alternatively by using a second hash function when we run out of bits).

yossigo · 2021-05-03T07:58:08Z

The defrag code should be updated to defrag/reallocate the HAMT nodes. I consider it a TODO. The impact on copy-on-write should be just positive. It would be good to be able to test and compare though, of course.

I was actually referring to the impact on fragmentation itself, due to more allocations and deallocations per key operation. Anyway yes, it's really down to testing.

I made a drop-in replacement because it was easier (a smaller change) than breaking out the key space dict (and the expire dict) from the other ones. The alternative would involve much more work and more changes AFAICT. Large hashes probably benefit from a HAMT just like the main key space (while small hashes use ziplist/listpack) and for the remaining dicts (command table, etc.), I guess it doesn't really matter if they're changed or not, but the total complexity would be reduced if we can get rid of the old dict altogether. Any pros and cons should be considered though.

I agree with this reasoning as a way to get to a working HAMT dict with as little friction as possible, and I'd probably take the same approach. But once it's there and stabilized and we need to consider the next step, I'm not sure if the simplicity of having just a single dict will surpass potential drawbacks. This will take some analysis I believe.

The main blocker right now IMO is the failing builds/tests. Possibly, there's a Heisenbug. They fail at seemingly random test cases and I don't know why. Redis nodes hang and need to be killed by the test runner.

Did you try to kill them with SIGSEGV and look at the stack trace? Also, did you try to fuzz the new implementation at a low level (i.e. directly calling dict*)?

romange · 2021-05-04T16:26:44Z

@zuiderkwast Would you like to add your implementation to https://github.com/romange/hash-table-shootout/tree/simple ?
I added there redis-dict, you can see it under "redis" directory. The wrapper code is https://github.com/romange/hash-table-shootout/blob/simple/src/redis_dict.cc

Once yours will be added, I will be able to share an extensive benchmark report showing how well HAMT performs compared to vanilla redis-dict on dozens of use-cases.

zuiderkwast · 2021-05-04T21:22:53Z

@romange Nice, can you show some example of the stats you can produce? When I add it, I'd need to rename all the functions so they don't clash with the original dict functions you added, right? Anyway, I want to make it work properly (Redis tests pass) before focusing on benchmarking further.

@yossigo I've tried killing it with SIGSEGV and got stack traces at seemingly random locations. I've tried running the tests with gdb attached and looked at the stack traces at random times also without finding anything useful. I've also tried compiling with address santitizer and other sanitizers.

Fragmentation may be higher than before and the paper actually recommends using active defrag, so it's nice that we have the defrag framework in place already.

romange · 2021-05-04T21:41:00Z

@zuiderkwast you can read the article that uses this framework to benchmark other hash tables:
https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html
btw, you may also find this repo useful for finding the bugs in your code.

you do not have to rename the symbols because each hashtable is represented via a separate binary.

JimB123

Left a few review comments.

Regarding the performance analysis, I would note the following:

This structure is likely superior on memory usage as the (mostly unused) next pointer is removed and the base hash table loading was 0.5-1.0 which meant, on average, a half pointer (4 bytes) was wasted for each entry. The new structure, essentially replaces this with 8 bytes of bitmap which are shared across 32 possible entries.
On a test of insertions into a fresh map, I would expect this structure to outperform the existing dict as the hash table will undergo several rounds of rehashing as the table expands. However, once the hash table reaches steady-state, I would not expect this to outperform the existing dict. The measurement for 10,000,000 insertions is suspect for this reason.
I don't understand why the measurements show HAMT as being faster in several scenarios.
- The existing dict performs a hashing operation, and then dereferences 3 pointers to reach the dictEntry. Finally the key is directly compared. (We could reduce this to 2 dereferences by simply combining dictht into dict.)
- The HAMT will perform the same hashing operation and then dereference a pointer for each level in the tree. For 10M items, this requires 5 dereferences (32^5 = 33M). Finally, the key is directly compared.
- Possibly the difference is due to some cases where there is a hash collision in the existing dict. This would require traversal to the next item and a second string comparison against the 2nd key. Note that with a load factor of 0.5-1.0, we don't expect many hash collisions, but there will be some measurable percentage of buckets with more than 1 key.

Do we have any additional understanding or intuition on this?

Have we done any representative memory analysis? My intuition says that the HAMT will use less memory, however I have no understanding of how much during realistic scenarios.

PROS:

more memory efficient
eliminates sudden memory allocations for rehashing
eliminates CPU overhead for rehashing
eliminates some code complexity due to rehashing
"miss" lookups might be a little faster
iteration is cleaner/simpler

CONS:

slightly slower overall for most typical operations
code is more complex (to understand and maintain) than the existing dict
returning dictEntry* is risky since object locations may change

NOTE: The (backwards compatible) API exposes a dictEntry pointer in several cases. However, with the HAMT, any insertion or deletion has the potential to invalidate existing dictEntry pointers (by moving the physical values). Since only some dictEntrys might be affected, this could create difficult to debug problems when code is maintaining a dictEntry* while making additional modifications or by calling functions or recursive actions which make such modifications.

JimB123 · 2021-05-14T17:18:55Z

src/dict.h

This is a perfect time to start improving design practices by doing a better job at data hiding through the use of opaque data structures. Is there any reason that the internal node structures need to be exposed in the .h file? Can these be moved into the .c file? (Maybe only leaving a forward declaration in the .h if needed.)

Yes, agree, I'll move as much as possible into the .c file, after more urgent issues have been solved. Since exposing dictEntry pointers is risky, as you say, it would be good to change the API to avoid exposing it and make dictEntry opaque too. Not sure if it's feasible or not.

JimB123 · 2021-05-14T18:07:37Z

src/dict.c

Please remove this cruft from the .c file. It makes it harder to review the new code.

If there are functions (like dictDisableResize) which need to be cleaned from the old code, take care of that at the end of the .h file with either a #define or a static inline.

I kept this for now, since there is a possibility to resize the root table (see the paper). It adds to complexity but may improve speed. Not sure we needed it. If not, I'll delete this for sure.

JimB123 · 2021-05-14T23:02:11Z

src/dict.c

As mentioned above, this would be easier to read if the function was totally cleaned from the .c file.

In the .h file, just do something like:

#define dictRehash(d, n) 0

Yes, I'll even remove all calls to it.

JimB123 · 2021-05-14T23:17:36Z

src/dict.c

Can't forget to implement this. Also, at this point, there are 4 bits of the original hash still unused.

True. I didn't do this exactly right yet. There are 12 levels of 5 bits each, one level of 4 bits. Then the next level will be a flat sequence of collisions.

JimB123 · 2021-05-14T23:32:16Z

src/dict.c

Probably more readable as a for loop:

for (; level >= 0; level--) {

src/dict.c

JimB123 · 2021-05-14T23:55:20Z

src/dict.c

Can we?, should we?, add the additional hash bits to the cursor value? Could the cursor maintain the full (extended) hash of the last value returned?

There can be 64 bit hash collisions, for which storing the duplicates is not yet implemented. (The idea is to just store all the entries in level 13 (is it?) in a compact array and use one of the bitmaps for its length.)

OK - so you're planning to diverge from the HAMT strategy of generating a 2nd hash to add more bits.

In this case, for iteration, maybe consider keeping the key and/or index and/or next-key pointer in the iterator. Then, when the iterator hits such a list you scan until you hit the key you last iterated on. This would avoid the weird pattern where an iterator can return multiple items in this rare case. It's simpler for users if the iterator only returns 1 item each time.

If you're worried about the values changing/moving during iteration, maybe you could clone the list of keys into the iterator at this point, then, for each key, check that it still exists in the map before returning it.

Good ideas. Yes, using a second hash function was the initial idea, but I think just storing the collisions is fine and simpler. I don't have a strong preference though.

I don't think the scan cursor should be more than 64 bits (maybe only 32 bits for 32-bit builds?) so SCAN would need to sometimes return more than the requested number of elements. An iterator should of course only return one element each time though, so it would need to store the duplicate index or a clone of the duplicates list in the iterator, yes.

zuiderkwast · 2021-05-15T22:37:30Z

Thanks @JimB123 for review and analysis! I agree with your points and I haven't really done any more analysis than this. For smaller test runs I get similar results, but it's worth doing more.

I think the root level will always be in the CPU cache, so the root level lookup is likely faster then the other levels. Perhaps the over all lower memory usage might allow more of the table to be in the cache, which might explain why it's faster in some cases.

Thanks for pointing that returned dictEntry pointers become invalid whenever anything is added or deleted. I've tried (earlier) to check all code which calls the dict to see that it's safe, but perhaps refactoring the API to not expose the dictEntry at all would be safer.

huangzhw · 2021-05-16T11:16:05Z

src/dict.c

Whether we can make the root always be a subnode, so that we can remove the special case d->size == 1?

I guess it's possible, but then we need some other special cases in other parts of the code, I think...

I feel special cases will be less. Maybe I'm wrong.

Thanks for the suggestion. I will look into it, but first I want to make the build pass. :-)

This PR is difficult. It takes me some time to understand it.

src/dict.c

kukey · 2021-07-15T16:06:26Z

src/dict.c

#include <stdarg.h>
#include <limits.h>

these 2 header file are not used

CaixinGong · 2021-08-13T08:53:32Z

The above tests lack the memory usages of Redis hash dict and HAMT that is the most important metric for HAMT. Could you add the test results of memory usage?

zuiderkwast · 2021-08-13T09:51:35Z

The above tests lack the memory usages of Redis hash dict and HAMT that is the most important metric for HAMT. Could you add the test results of memory usage?

@CaixinGong I need help. How do you suggest to do this test?

CaixinGong · 2021-08-13T12:15:49Z

The above tests lack the memory usages of Redis hash dict and HAMT that is the most important metric for HAMT. Could you add the test results of memory usage?

@CaixinGong I need help. How do you suggest to do this test?

The default Redis benchmark redis-benchmark may help for simple test.
For more complicated test, you can use the YCSB tool: https://help.aliyun.com/document_detail/185197.html?spm=a2c4g.11186623.6.888.5c0e5fe1BEEbtE

Prefix hiredis dict funcs with hi

zuiderkwast · 2021-08-17T14:13:54Z

I've rebased and squashed the commits. It's not ready to merge, but I've performed a ...

Memory usage test

Redis started with default config, empty db and then populated with 1M keys
using debug populate 1000000. Then info memory.

Branch	used_memory	used_memory_dataset	used_memory_dataset_perc
unstable	81237016	32020368	39.81%
dict-hamt	70521704	37708784 (incorrect)	54.08% (incorrect)
Diff	10715312 (13% save)

Note: The dataset memory usage reported for the HAMT by dictEstimateMem() is incorrect. That's new code that I didn't focus on yet.

CLAassistant · 2024-03-24T23:19:06Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

JimB123 reviewed May 15, 2021

View reviewed changes

huangzhw reviewed May 16, 2021

View reviewed changes

zuiderkwast force-pushed the dict-hamt branch from 6294f8d to 0f6cda8 Compare May 26, 2021 19:46

kukey reviewed Jul 15, 2021

View reviewed changes

src/dict.c Outdated

Copy link

Contributor

kukey Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <stdarg.h>
#include <limits.h>

these 2 header file are not used

oranagra mentioned this pull request Jul 16, 2021

add dict-ht-initial-size to reduce memory size during dictExpand #2812

Closed

zuiderkwast force-pushed the dict-hamt branch from 0f6cda8 to b485169 Compare August 17, 2021 13:51

Dict HAMT: A drop-in preplacement of the dict hashtable implementation

9979edc

Prefix hiredis dict funcs with hi

zuiderkwast force-pushed the dict-hamt branch from b485169 to 9979edc Compare August 17, 2021 14:07

zuiderkwast mentioned this pull request Sep 7, 2021

Key embedded in dictEntry - CPU cache friendly #9464

Closed

3 tasks

oranagra mentioned this pull request Sep 19, 2021

A perfect implementation of a hashtable with a very large number of keys #9517

Open

madolson mentioned this pull request May 31, 2022

Rethinking the main Redis dictionary #10802

Open

oranagra mentioned this pull request Oct 5, 2023

[NEW] Rehash causing spikes in main thread to 100% #12602

Open

moticless mentioned this pull request Mar 26, 2024

Hash Field Expiration #13172

Merged

zuiderkwast closed this Aug 27, 2024

zuiderkwast deleted the dict-hamt branch August 27, 2024 06:49

Replace dict's hash table with HAMT #8700

Replace dict's hash table with HAMT #8700

Uh oh!

Conversation

zuiderkwast commented Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance comparison

Structure

Possible improvements

Uh oh!

kukey commented Apr 30, 2021

Uh oh!

yossigo commented May 2, 2021

Uh oh!

zuiderkwast commented May 2, 2021

Uh oh!

yossigo commented May 3, 2021

Uh oh!

romange commented May 4, 2021

Uh oh!

zuiderkwast commented May 4, 2021

Uh oh!

romange commented May 4, 2021

Uh oh!

JimB123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast May 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast commented May 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaixinGong commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Aug 13, 2021

Uh oh!

CaixinGong commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Mar 25, 2021 •

edited

Loading

zuiderkwast May 15, 2021 •

edited

Loading

CaixinGong commented Aug 13, 2021 •

edited

Loading

CaixinGong commented Aug 13, 2021 •

edited

Loading