Skip to content

Conversation

@zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented Mar 25, 2021

A drop-in replacement of the hash tables in dict, affecting the key space and all other uses of the dict. Fixes #8698.

Status: It's working-ish. The the dict tests are passing, but the Redis test suites are not passing. Most tests are passing ocationally but Redis seems to hang at various times until it's killed by the test runner. Debugging help is most welcome.

The dict in hiredis gets it's function names prefixed with hi_ to avoid collisions. This needs to be done upstream in hiredis...

Performance comparison

Comparison using ./redis-server test dict 10000000 (compiled using make CFLAGS=-DREDIS_TEST) on my laptop (64-bit Linux, 32G RAM).

Classic dict:

Inserting: 10000000 items in 4746 ms
Linear access of existing elements: 10000000 items in 2474 ms
Linear access of existing elements (2nd round): 10000000 items in 2459 ms
Random access of existing elements: 10000000 items in 3726 ms
Accessing random keys: 10000000 items in 1287 ms
Accessing missing: 10000000 items in 2638 ms
Removing and adding: 10000000 items in 4516 ms

HAMT:

Inserting: 10000000 items in 3253 ms
Linear access of existing elements: 10000000 items in 2793 ms
Linear access of existing elements (2nd round): 10000000 items in 2805 ms
Random access of existing elements: 10000000 items in 3676 ms
Accessing random keys: 10000000 items in 1103 ms
Accessing missing: 10000000 items in 2240 ms
Removing and adding: 10000000 items in 6467 ms

Structure

+--------------------+
| dict               |
|--------------------|   +--------------------+   +--------------------+
| bitmaps | children---->| bitmaps | children---->| key     | value    |
| size               |   | key     | value    |   | ...     | ...      |
| dictType           |   | key     | value    |   +--------------------+
| privdata           |   | key     | value    |
+--------------------+   | key     | value    |   +--------------------+
                         | bitmaps | children---->| key     | value    |
                         | key     | value    |   | bitmaps | children--->..
                         | ...     | ...      |   | value   | key      |
                         +--------------------+   | ...     | ...      |
                                                  +--------------------+

In each level, 5 bits of the hash is used as an index into the children. A
bitmap indicates which of the children exist (1) and which don't (0). Whether
the child is a leaf or a subnode is indicated by a second bitmap, "leafmap".

hash(key) = 0010 10010 01011 00101 01001 10110 11101
(64 bits)   10101 11001 01001 01101 10111 11010
                                          ^^^^^
                                            26

bitmap    = 00100101 00000010 00010100 01100001
(level 0)        ^
                 bit 26 is 1, so there is a child node or leaf

leafmap   = 00000100 00000010 00010000 01100000
(level 0)        ^
                 bit 26 is 1, so the child is a key-value entry

The children array is compact and its length is equal to the number of 1s in
the bitmap. In the bitmap above, our bit is the 2nd 1-bit from the left, so
we look at the 2nd child.

children  = +--------------------+
            | bitmaps | children |
            | key     | value    | <-- The 2nd child is a key-value entry. If
            | bitmaps | children |     the key matches our key, then it's our
            | ...     | ...      |     key. Otherwise, the dict doesn't
            +--------------------+     contain our key.

If, instead, our child is a subnode, we use the next 5 bits from the hash to
index into the next level of children.

The classic dict has this structure:

+------+
| dict |
|------|    +-------+     +-------+       +----------+
| ht[0]---->| slot ------>| key   |   ,-->| key      |
| ...  |    | NULL  |     | value |  /    | value    |
+------+    | slot  |     | next----'     | next=NULL|
            | ...   |     +-------+       +----------+
            +-------+

Memory comparison: The classic dict has an entry pointer for every slot and a
'next' pointer in every entry. The HAMT stores the entries within the nodes, so
the only overhead is the subnode pointers and the bitmaps for each subnode. The
number of subnodes is logarithmic to the number of keys.

Possible improvements

To increase speed for large HAMTs, a possible optimization is to resize the root
table. This can be done incrementally and without rehashing the keys, so it's
much faster than resizing a regular hash table. It's explained in the paper.
It's not implemented (yet) and the performance comparision above is done without
this optimization.

@kukey
Copy link
Contributor

kukey commented Apr 30, 2021

Cool !

@yossigo
Copy link
Collaborator

yossigo commented May 2, 2021

@zuiderkwast This is really great work!
I think we need to think and come up with a plan of what it takes to get it ready to ship in Redis 7, as it's a non-trivial change naturally. The immediate things I can consider before giving that any real thought are these:

  • Consider/test impact on memory fragmentation and maybe CoW
  • Run some more real-world benchmarks to demonstrate/measure where HAMT shines and where it fails
  • Do we necessarily want it as an all-in replacement for dict, or have them co-exist? The benefits for the large global dicts are clear, but what about hash keys or other specific use cases like the command table?

@zuiderkwast
Copy link
Contributor Author

@zuiderkwast This is really great work!

@yossigo I'm happy you like the idea.

I think we need to think and come up with a plan of what it takes to get it ready to ship in Redis 7, as it's a non-trivial change naturally. The immediate things I can consider before giving that any real thought are these:

  • Consider/test impact on memory fragmentation and maybe CoW

The defrag code should be updated to defrag/reallocate the HAMT nodes. I consider it a TODO. The impact on copy-on-write should be just positive. It would be good to be able to test and compare though, of course.

  • Run some more real-world benchmarks to demonstrate/measure where HAMT shines and where it fails

Yes, I'd need help with that. Do you have any realistic test data to run such tests?

  • Do we necessarily want it as an all-in replacement for dict, or have them co-exist? The benefits for the large global dicts are clear, but what about hash keys or other specific use cases like the command table?

I made a drop-in replacement because it was easier (a smaller change) than breaking out the key space dict (and the expire dict) from the other ones. The alternative would involve much more work and more changes AFAICT. Large hashes probably benefit from a HAMT just like the main key space (while small hashes use ziplist/listpack) and for the remaining dicts (command table, etc.), I guess it doesn't really matter if they're changed or not, but the total complexity would be reduced if we can get rid of the old dict altogether. Any pros and cons should be considered though.

The main blocker right now IMO is the failing builds/tests. Possibly, there's a Heisenbug. They fail at seemingly random test cases and I don't know why. Redis nodes hang and need to be killed by the test runner.

One more TODO is to handle the (unlikely) scenario of 64-bit hash colissions. It can be handled with an array of the duplicates on the last level (or alternatively by using a second hash function when we run out of bits).

@yossigo
Copy link
Collaborator

yossigo commented May 3, 2021

The defrag code should be updated to defrag/reallocate the HAMT nodes. I consider it a TODO. The impact on copy-on-write should be just positive. It would be good to be able to test and compare though, of course.

I was actually referring to the impact on fragmentation itself, due to more allocations and deallocations per key operation. Anyway yes, it's really down to testing.

I made a drop-in replacement because it was easier (a smaller change) than breaking out the key space dict (and the expire dict) from the other ones. The alternative would involve much more work and more changes AFAICT. Large hashes probably benefit from a HAMT just like the main key space (while small hashes use ziplist/listpack) and for the remaining dicts (command table, etc.), I guess it doesn't really matter if they're changed or not, but the total complexity would be reduced if we can get rid of the old dict altogether. Any pros and cons should be considered though.

I agree with this reasoning as a way to get to a working HAMT dict with as little friction as possible, and I'd probably take the same approach. But once it's there and stabilized and we need to consider the next step, I'm not sure if the simplicity of having just a single dict will surpass potential drawbacks. This will take some analysis I believe.

The main blocker right now IMO is the failing builds/tests. Possibly, there's a Heisenbug. They fail at seemingly random test cases and I don't know why. Redis nodes hang and need to be killed by the test runner.

Did you try to kill them with SIGSEGV and look at the stack trace? Also, did you try to fuzz the new implementation at a low level (i.e. directly calling dict*)?

@romange
Copy link

romange commented May 4, 2021

@zuiderkwast Would you like to add your implementation to https://github.com/romange/hash-table-shootout/tree/simple ?
I added there redis-dict, you can see it under "redis" directory. The wrapper code is https://github.com/romange/hash-table-shootout/blob/simple/src/redis_dict.cc

Once yours will be added, I will be able to share an extensive benchmark report showing how well HAMT performs compared to vanilla redis-dict on dozens of use-cases.

@zuiderkwast
Copy link
Contributor Author

@romange Nice, can you show some example of the stats you can produce? When I add it, I'd need to rename all the functions so they don't clash with the original dict functions you added, right? Anyway, I want to make it work properly (Redis tests pass) before focusing on benchmarking further.

@yossigo I've tried killing it with SIGSEGV and got stack traces at seemingly random locations. I've tried running the tests with gdb attached and looked at the stack traces at random times also without finding anything useful. I've also tried compiling with address santitizer and other sanitizers.

Fragmentation may be higher than before and the paper actually recommends using active defrag, so it's nice that we have the defrag framework in place already.

@romange
Copy link

romange commented May 4, 2021

@zuiderkwast you can read the article that uses this framework to benchmark other hash tables:
https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html
btw, you may also find this repo useful for finding the bugs in your code.

you do not have to rename the symbols because each hashtable is represented via a separate binary.

Copy link
Contributor

@JimB123 JimB123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few review comments.

Regarding the performance analysis, I would note the following:

  • This structure is likely superior on memory usage as the (mostly unused) next pointer is removed and the base hash table loading was 0.5-1.0 which meant, on average, a half pointer (4 bytes) was wasted for each entry. The new structure, essentially replaces this with 8 bytes of bitmap which are shared across 32 possible entries.
  • On a test of insertions into a fresh map, I would expect this structure to outperform the existing dict as the hash table will undergo several rounds of rehashing as the table expands. However, once the hash table reaches steady-state, I would not expect this to outperform the existing dict. The measurement for 10,000,000 insertions is suspect for this reason.
  • I don't understand why the measurements show HAMT as being faster in several scenarios.
    • The existing dict performs a hashing operation, and then dereferences 3 pointers to reach the dictEntry. Finally the key is directly compared. (We could reduce this to 2 dereferences by simply combining dictht into dict.)
    • The HAMT will perform the same hashing operation and then dereference a pointer for each level in the tree. For 10M items, this requires 5 dereferences (32^5 = 33M). Finally, the key is directly compared.
    • Possibly the difference is due to some cases where there is a hash collision in the existing dict. This would require traversal to the next item and a second string comparison against the 2nd key. Note that with a load factor of 0.5-1.0, we don't expect many hash collisions, but there will be some measurable percentage of buckets with more than 1 key.

Do we have any additional understanding or intuition on this?

Have we done any representative memory analysis? My intuition says that the HAMT will use less memory, however I have no understanding of how much during realistic scenarios.

PROS:

  • more memory efficient
  • eliminates sudden memory allocations for rehashing
  • eliminates CPU overhead for rehashing
  • eliminates some code complexity due to rehashing
  • "miss" lookups might be a little faster
  • iteration is cleaner/simpler

CONS:

  • slightly slower overall for most typical operations
  • code is more complex (to understand and maintain) than the existing dict
  • returning dictEntry* is risky since object locations may change

NOTE: The (backwards compatible) API exposes a dictEntry pointer in several cases. However, with the HAMT, any insertion or deletion has the potential to invalidate existing dictEntry pointers (by moving the physical values). Since only some dictEntrys might be affected, this could create difficult to debug problems when code is maintaining a dictEntry* while making additional modifications or by calling functions or recursive actions which make such modifications.

src/dict.h Outdated
Comment on lines 72 to 85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a perfect time to start improving design practices by doing a better job at data hiding through the use of opaque data structures. Is there any reason that the internal node structures need to be exposed in the .h file? Can these be moved into the .c file? (Maybe only leaving a forward declaration in the .h if needed.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree, I'll move as much as possible into the .c file, after more urgent issues have been solved. Since exposing dictEntry pointers is risky, as you say, it would be good to change the API to avoid exposing it and make dictEntry opaque too. Not sure if it's feasible or not.

src/dict.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this cruft from the .c file. It makes it harder to review the new code.

If there are functions (like dictDisableResize) which need to be cleaned from the old code, take care of that at the end of the .h file with either a #define or a static inline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this for now, since there is a possibility to resize the root table (see the paper). It adds to complexity but may improve speed. Not sure we needed it. If not, I'll delete this for sure.

src/dict.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, this would be easier to read if the function was totally cleaned from the .c file.

In the .h file, just do something like:

#define dictRehash(d, n) 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll even remove all calls to it.

src/dict.c Outdated
Comment on lines +331 to +303
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't forget to implement this. Also, at this point, there are 4 bits of the original hash still unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I didn't do this exactly right yet. There are 12 levels of 5 bits each, one level of 4 bits. Then the next level will be a flat sequence of collisions.

src/dict.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more readable as a for loop:

for (; level >= 0;  level--) {

src/dict.c Outdated
Comment on lines +830 to +800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we?, should we?, add the additional hash bits to the cursor value? Could the cursor maintain the full (extended) hash of the last value returned?

Copy link
Contributor Author

@zuiderkwast zuiderkwast May 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be 64 bit hash collisions, for which storing the duplicates is not yet implemented. (The idea is to just store all the entries in level 13 (is it?) in a compact array and use one of the bitmaps for its length.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - so you're planning to diverge from the HAMT strategy of generating a 2nd hash to add more bits.

In this case, for iteration, maybe consider keeping the key and/or index and/or next-key pointer in the iterator. Then, when the iterator hits such a list you scan until you hit the key you last iterated on. This would avoid the weird pattern where an iterator can return multiple items in this rare case. It's simpler for users if the iterator only returns 1 item each time.

If you're worried about the values changing/moving during iteration, maybe you could clone the list of keys into the iterator at this point, then, for each key, check that it still exists in the map before returning it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ideas. Yes, using a second hash function was the initial idea, but I think just storing the collisions is fine and simpler. I don't have a strong preference though.

I don't think the scan cursor should be more than 64 bits (maybe only 32 bits for 32-bit builds?) so SCAN would need to sometimes return more than the requested number of elements. An iterator should of course only return one element each time though, so it would need to store the duplicate index or a clone of the duplicates list in the iterator, yes.

@zuiderkwast
Copy link
Contributor Author

Thanks @JimB123 for review and analysis! I agree with your points and I haven't really done any more analysis than this. For smaller test runs I get similar results, but it's worth doing more.

I think the root level will always be in the CPU cache, so the root level lookup is likely faster then the other levels. Perhaps the over all lower memory usage might allow more of the table to be in the cache, which might explain why it's faster in some cases.

Thanks for pointing that returned dictEntry pointers become invalid whenever anything is added or deleted. I've tried (earlier) to check all code which calls the dict to see that it's safe, but perhaps refactoring the API to not expose the dictEntry at all would be safer.

src/dict.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether we can make the root always be a subnode, so that we can remove the special case d->size == 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's possible, but then we need some other special cases in other parts of the code, I think...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel special cases will be less. Maybe I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I will look into it, but first I want to make the build pass. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is difficult. It takes me some time to understand it.

src/dict.c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <stdarg.h>
#include <limits.h>

these 2 header file are not used

@CaixinGong
Copy link

CaixinGong commented Aug 13, 2021

The above tests lack the memory usages of Redis hash dict and HAMT that is the most important metric for HAMT. Could you add the test results of memory usage?

@zuiderkwast
Copy link
Contributor Author

The above tests lack the memory usages of Redis hash dict and HAMT that is the most important metric for HAMT. Could you add the test results of memory usage?

@CaixinGong I need help. How do you suggest to do this test?

@CaixinGong
Copy link

CaixinGong commented Aug 13, 2021

The above tests lack the memory usages of Redis hash dict and HAMT that is the most important metric for HAMT. Could you add the test results of memory usage?

@CaixinGong I need help. How do you suggest to do this test?

The default Redis benchmark redis-benchmark may help for simple test.
For more complicated test, you can use the YCSB tool: https://help.aliyun.com/document_detail/185197.html?spm=a2c4g.11186623.6.888.5c0e5fe1BEEbtE

@zuiderkwast
Copy link
Contributor Author

I've rebased and squashed the commits. It's not ready to merge, but I've performed a ...

Memory usage test

Redis started with default config, empty db and then populated with 1M keys
using debug populate 1000000. Then info memory.

Branch used_memory used_memory_dataset used_memory_dataset_perc
unstable 81237016 32020368 39.81%
dict-hamt 70521704 37708784 (incorrect) 54.08% (incorrect)
Diff 10715312 (13% save)

Note: The dataset memory usage reported for the HAMT by dictEstimateMem() is incorrect. That's new code that I didn't focus on yet.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@zuiderkwast zuiderkwast deleted the dict-hamt branch August 27, 2024 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NEW] Replace dict's hash table with HAMT

8 participants