-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Replace dict's hash table with HAMT #8700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Cool ! |
|
@zuiderkwast This is really great work!
|
@yossigo I'm happy you like the idea.
The defrag code should be updated to defrag/reallocate the HAMT nodes. I consider it a TODO. The impact on copy-on-write should be just positive. It would be good to be able to test and compare though, of course.
Yes, I'd need help with that. Do you have any realistic test data to run such tests?
I made a drop-in replacement because it was easier (a smaller change) than breaking out the key space dict (and the expire dict) from the other ones. The alternative would involve much more work and more changes AFAICT. Large hashes probably benefit from a HAMT just like the main key space (while small hashes use ziplist/listpack) and for the remaining dicts (command table, etc.), I guess it doesn't really matter if they're changed or not, but the total complexity would be reduced if we can get rid of the old dict altogether. Any pros and cons should be considered though. The main blocker right now IMO is the failing builds/tests. Possibly, there's a Heisenbug. They fail at seemingly random test cases and I don't know why. Redis nodes hang and need to be killed by the test runner. One more TODO is to handle the (unlikely) scenario of 64-bit hash colissions. It can be handled with an array of the duplicates on the last level (or alternatively by using a second hash function when we run out of bits). |
I was actually referring to the impact on fragmentation itself, due to more allocations and deallocations per key operation. Anyway yes, it's really down to testing.
I agree with this reasoning as a way to get to a working HAMT dict with as little friction as possible, and I'd probably take the same approach. But once it's there and stabilized and we need to consider the next step, I'm not sure if the simplicity of having just a single dict will surpass potential drawbacks. This will take some analysis I believe.
Did you try to kill them with |
|
@zuiderkwast Would you like to add your implementation to https://github.com/romange/hash-table-shootout/tree/simple ? Once yours will be added, I will be able to share an extensive benchmark report showing how well HAMT performs compared to vanilla redis-dict on dozens of use-cases. |
|
@romange Nice, can you show some example of the stats you can produce? When I add it, I'd need to rename all the functions so they don't clash with the original dict functions you added, right? Anyway, I want to make it work properly (Redis tests pass) before focusing on benchmarking further. @yossigo I've tried killing it with SIGSEGV and got stack traces at seemingly random locations. I've tried running the tests with gdb attached and looked at the stack traces at random times also without finding anything useful. I've also tried compiling with address santitizer and other sanitizers. Fragmentation may be higher than before and the paper actually recommends using active defrag, so it's nice that we have the defrag framework in place already. |
|
@zuiderkwast you can read the article that uses this framework to benchmark other hash tables: you do not have to rename the symbols because each hashtable is represented via a separate binary. |
JimB123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few review comments.
Regarding the performance analysis, I would note the following:
- This structure is likely superior on memory usage as the (mostly unused)
nextpointer is removed and the base hash table loading was 0.5-1.0 which meant, on average, a half pointer (4 bytes) was wasted for each entry. The new structure, essentially replaces this with 8 bytes of bitmap which are shared across 32 possible entries. - On a test of insertions into a fresh map, I would expect this structure to outperform the existing
dictas the hash table will undergo several rounds of rehashing as the table expands. However, once the hash table reaches steady-state, I would not expect this to outperform the existingdict. The measurement for 10,000,000 insertions is suspect for this reason. - I don't understand why the measurements show HAMT as being faster in several scenarios.
- The existing
dictperforms a hashing operation, and then dereferences 3 pointers to reach thedictEntry. Finally the key is directly compared. (We could reduce this to 2 dereferences by simply combiningdicthtintodict.) - The HAMT will perform the same hashing operation and then dereference a pointer for each level in the tree. For 10M items, this requires 5 dereferences (32^5 = 33M). Finally, the key is directly compared.
- Possibly the difference is due to some cases where there is a hash collision in the existing
dict. This would require traversal to thenextitem and a second string comparison against the 2nd key. Note that with a load factor of 0.5-1.0, we don't expect many hash collisions, but there will be some measurable percentage of buckets with more than 1 key.
- The existing
Do we have any additional understanding or intuition on this?
Have we done any representative memory analysis? My intuition says that the HAMT will use less memory, however I have no understanding of how much during realistic scenarios.
PROS:
- more memory efficient
- eliminates sudden memory allocations for rehashing
- eliminates CPU overhead for rehashing
- eliminates some code complexity due to rehashing
- "miss" lookups might be a little faster
- iteration is cleaner/simpler
CONS:
- slightly slower overall for most typical operations
- code is more complex (to understand and maintain) than the existing
dict - returning
dictEntry*is risky since object locations may change
NOTE: The (backwards compatible) API exposes a dictEntry pointer in several cases. However, with the HAMT, any insertion or deletion has the potential to invalidate existing dictEntry pointers (by moving the physical values). Since only some dictEntrys might be affected, this could create difficult to debug problems when code is maintaining a dictEntry* while making additional modifications or by calling functions or recursive actions which make such modifications.
src/dict.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a perfect time to start improving design practices by doing a better job at data hiding through the use of opaque data structures. Is there any reason that the internal node structures need to be exposed in the .h file? Can these be moved into the .c file? (Maybe only leaving a forward declaration in the .h if needed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree, I'll move as much as possible into the .c file, after more urgent issues have been solved. Since exposing dictEntry pointers is risky, as you say, it would be good to change the API to avoid exposing it and make dictEntry opaque too. Not sure if it's feasible or not.
src/dict.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this cruft from the .c file. It makes it harder to review the new code.
If there are functions (like dictDisableResize) which need to be cleaned from the old code, take care of that at the end of the .h file with either a #define or a static inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this for now, since there is a possibility to resize the root table (see the paper). It adds to complexity but may improve speed. Not sure we needed it. If not, I'll delete this for sure.
src/dict.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, this would be easier to read if the function was totally cleaned from the .c file.
In the .h file, just do something like:
#define dictRehash(d, n) 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll even remove all calls to it.
src/dict.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't forget to implement this. Also, at this point, there are 4 bits of the original hash still unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I didn't do this exactly right yet. There are 12 levels of 5 bits each, one level of 4 bits. Then the next level will be a flat sequence of collisions.
src/dict.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably more readable as a for loop:
for (; level >= 0; level--) {
src/dict.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we?, should we?, add the additional hash bits to the cursor value? Could the cursor maintain the full (extended) hash of the last value returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be 64 bit hash collisions, for which storing the duplicates is not yet implemented. (The idea is to just store all the entries in level 13 (is it?) in a compact array and use one of the bitmaps for its length.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - so you're planning to diverge from the HAMT strategy of generating a 2nd hash to add more bits.
In this case, for iteration, maybe consider keeping the key and/or index and/or next-key pointer in the iterator. Then, when the iterator hits such a list you scan until you hit the key you last iterated on. This would avoid the weird pattern where an iterator can return multiple items in this rare case. It's simpler for users if the iterator only returns 1 item each time.
If you're worried about the values changing/moving during iteration, maybe you could clone the list of keys into the iterator at this point, then, for each key, check that it still exists in the map before returning it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good ideas. Yes, using a second hash function was the initial idea, but I think just storing the collisions is fine and simpler. I don't have a strong preference though.
I don't think the scan cursor should be more than 64 bits (maybe only 32 bits for 32-bit builds?) so SCAN would need to sometimes return more than the requested number of elements. An iterator should of course only return one element each time though, so it would need to store the duplicate index or a clone of the duplicates list in the iterator, yes.
|
Thanks @JimB123 for review and analysis! I agree with your points and I haven't really done any more analysis than this. For smaller test runs I get similar results, but it's worth doing more. I think the root level will always be in the CPU cache, so the root level lookup is likely faster then the other levels. Perhaps the over all lower memory usage might allow more of the table to be in the cache, which might explain why it's faster in some cases. Thanks for pointing that returned dictEntry pointers become invalid whenever anything is added or deleted. I've tried (earlier) to check all code which calls the dict to see that it's safe, but perhaps refactoring the API to not expose the dictEntry at all would be safer. |
src/dict.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether we can make the root always be a subnode, so that we can remove the special case d->size == 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's possible, but then we need some other special cases in other parts of the code, I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel special cases will be less. Maybe I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I will look into it, but first I want to make the build pass. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is difficult. It takes me some time to understand it.
src/dict.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <stdarg.h>
#include <limits.h>
these 2 header file are not used
|
The above tests lack the memory usages of Redis hash dict and HAMT that is the most important metric for HAMT. Could you add the test results of memory usage? |
@CaixinGong I need help. How do you suggest to do this test? |
The default Redis benchmark redis-benchmark may help for simple test. |
Prefix hiredis dict funcs with hi
|
I've rebased and squashed the commits. It's not ready to merge, but I've performed a ... Memory usage testRedis started with default config, empty db and then populated with 1M keys
Note: The dataset memory usage reported for the HAMT by |
|
|
A drop-in replacement of the hash tables in dict, affecting the key space and all other uses of the dict. Fixes #8698.
Status: It's working-ish. The the dict tests are passing, but the Redis test suites are not passing. Most tests are passing ocationally but Redis seems to hang at various times until it's killed by the test runner. Debugging help is most welcome.
The dict in hiredis gets it's function names prefixed with
hi_to avoid collisions. This needs to be done upstream in hiredis...Performance comparison
Comparison using
./redis-server test dict 10000000(compiled usingmake CFLAGS=-DREDIS_TEST) on my laptop (64-bit Linux, 32G RAM).Classic dict:
HAMT:
Structure
In each level, 5 bits of the hash is used as an index into the children. A
bitmap indicates which of the children exist (1) and which don't (0). Whether
the child is a leaf or a subnode is indicated by a second bitmap, "leafmap".
The children array is compact and its length is equal to the number of 1s in
the bitmap. In the bitmap above, our bit is the 2nd 1-bit from the left, so
we look at the 2nd child.
If, instead, our child is a subnode, we use the next 5 bits from the hash to
index into the next level of children.
The classic dict has this structure:
Memory comparison: The classic dict has an entry pointer for every slot and a
'next' pointer in every entry. The HAMT stores the entries within the nodes, so
the only overhead is the subnode pointers and the bitmaps for each subnode. The
number of subnodes is logarithmic to the number of keys.
Possible improvements
To increase speed for large HAMTs, a possible optimization is to resize the root
table. This can be done incrementally and without rehashing the keys, so it's
much faster than resizing a regular hash table. It's explained in the paper.
It's not implemented (yet) and the performance comparision above is done without
this optimization.