-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Key embedded in dictEntry - CPU cache friendly #9464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Whenever a key is looked up, the dictEntry (hashtable bucket) and the key are accessed together. This PR embeds the key in the same allocation as the dictEntry, which makes sure they are in the same cache line, thus saving one slow memory access. This compensates for the potential regression from redis#9356 for key lengths 15-22 in cluster mode. Furthermore, latency is expected to improve for other key lengths in cluster mode as well as in plain Redis. Benchmarks will show. The callbacks in dictType are changed in the following ways: * dictEntryMetadataBytes is replaced by createEntry, which (if present) takes care of allocating the entry and setting the key. * keyDestructor and valueDestructor are passed the dictEntry as an extra argument, so they can check if the key is embedded in the dictEntry and then not try to free it. * repairEntry is added, which is called from active defrag code when an entry has been reallocated, moving custom logic from the active defrag code to the dict type. TODO: * [x] Working implamentation. Tests passing. * [ ] Benchmarks for cluster/plain Redis, key lengths of different allocation sizes. * [ ] Memory usage comparison. Open questions: * What's the maxumum key length we want to embed in a dictEntry? * Do we want a similar optimization for other dicts, e.g. hashes?
|
It's unfortunate that we're paying 8 bytes for the key ptr even though it doesn't do anything. We could stuff a bit in the bottom 3 bits of 8 bit aligned systems, since their pointers always end in 000, which could be used to indicate if it's the new address/old address scheme. We would need to go through and make sure everyone only uses deGetKey, which would handle the pointer maintenance. I'm not sure if that would be too much of over optimizing though. I think it makes sense besides that though, I'll take a complete look once the tests pass (: |
|
I'm concerned that we may be going too far towards coupling more and more parts of Redis with that dict, which we also wanna replace one day (due to the rehashing memory issue) |
@madolson An sds string is an odd pointer, since its header (an odd number of bytes) is before the pointer. There might still be bits to use though, for example in the 'value' or the 'next' pointer. (We'd need to move around the dictEntry fields, since the embedded key needs to be last.) We have the same waste of bits in the robj with embedded sds (many DB values). I think it can be a different PR.
@oranagra I'm well aware. My #8700 was stalled since I couldn't make the build pass, but now it's diverged when #9356 was merged. It's still possible to rebase the HAMT. We'd need to wrap the key-with-cluster-metadata-and-maybe-embedded-key in another struct (basically the dictEntry without the value and next-pointer). The build's passing now, so it's ready for (pre-)review and benchmarks. |
|
Memory usage analysis Details (TL;DR Fewer allocations, no impact on total usage)We have the following allocator bin sizes: 8 16 24 32 40 48 56 64 In non-cluster mode, the dictEntry is 24 bytes and in cluster mode, it's 40 bytes. Keys up to length 31 have 2 bytes overhead (SDS type 5) and lengths 32-255 have 4 bytes overhead (SDS type 8). We only embed sds type 5 keys and only if the combined size is up to 64 bytes.
As seen above, the total allocation size is the same for combined and separate allocations, for the cases where we use embedded keys. |
|
Benchmark: 14 bytes keys, integer values, non-cluster (on laptop) TL;DR 10% latency win!This PR: Unstable: |
|
Cluster Benchmark: 17 bytes keys, integer values (on laptop) TL;DR It's pretty goodDisclaimer: The cluster nodes and redis-benchmark ran on the same laptop. Here's a throughput chart. Details below. Cluster: 3 masters, no replicas, no AOF, no rdb.
This PR ("embedded key"): Unstable ("dict metadata"): Old unstable (just before #9356 "dict metadata" was merged): |
|
Cluster Benchmark: 22 bytes keys, integer values (on laptop) This is the one with a regression in #9356Details below. Explanation: In this case, "old unstable" has 24 bytes dictEntry and 24 bytes key allocation. The later ones have 40+24 or 64 bytes allocation(s). The extra slot-to-key metadata in the dictEntry is not used for GET. Cluster: 3 masters, no replicas, no AOF, no rdb. This PR: Unstable ("dict metadata") Old unstable (just before #9356 was merged): |
|
Benchmark: 17 bytes keys, integer values, non-cluster (on laptop) TL;DR 10% better than unstable here tooThis PR: Unstable: |
Don't we expect to see a regression in unstable and an improvement of this PR (rather than two improvements)? |
In that benchmark, there's a regression for GET and no improvement in this PR for that case. I didn't get a regression for updates. I think it depends on how the benchmark was done. |
|
I think i'm interested, but i don't have the attention span for it right now (too many big things to move in order to get RC1 in time). |
BTW, I'm a bit curious there seems nearly no performance promotion for GET comparing with "dict metadata", which is different from the the above 17 bytes key, the cache miss involved by "dict metadata" is expected mitigated as the dictEntry and key are allocated together. |
|
|



Whenever a key is looked up, the dictEntry (hashtable bucket) and the key
are accessed together. This PR embeds the key in the same allocation as the
dictEntry, which makes sure they are in the same cache line, thus saving
one slow memory access.
This compensates for the potential regression from #9356 for key lengths
15-22 in cluster mode. Furthermore, latency is expected to improve for
other key lengths in cluster mode as well as in plain Redis. Benchmarks
will show.
The callbacks in dictType are changed in the following ways:
takes care of allocating the entry and setting the key.
argument, so they can check if the key is embedded in the dictEntry and
then not try to free it.
entry has been reallocated, moving custom logic from the active defrag
code to the dict type.
TODO:
allocation sizes.
Open questions: