-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Slot-to-keys using dict entry metadata #9356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slot-to-keys using dict entry metadata #9356
Conversation
5ee0d4d to
41a830d
Compare
madolson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks solid!
|
I didn't initially think of this, but I think we also need a change in defrag. Here Line 898 in 1221f7c
|
madolson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oranagra Does this help make it more concrete what the value of the metadata is?
|
@redis/core-team Probably worth pinging the group about this, as it's a rather intrusive change and I can see the argument the complexity is not worthwhile. @zuiderkwast If you get a chance, I also would expect this to perform better as well. If you don't get a chance I can try running some benchmarks tomorrow. |
oranagra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, few minor suggestions for improvement.
BenchmarksCluster of 3 nodes, no replicas, running locally on laptop. Latency
Result: No significant difference compared to [Edit] I forgot to use -r here, so only one key was created. :-/ See new benchmark below! Memory
The keys generated by |
db3fc88 to
0790f5f
Compare
|
Haven't done a CR, but this looks good as long as there's no regression in performance - the rax was put there to replace a skip list that didn't perform well (1409c54). |
yossigo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have only quickly looked at the code, but LGTM.
|
Regarding the performance regression concerns, I increased the numbers and ran some more benchmarks. Benchmark: Set and get
|
|
Ooh, that is more like what I was expecting. So it looks like we see about a ~40/50% speedup for writes (which is great) at the cost of about a 10% get performance regression. For the R=1000, almost everything is probably in L1/L2 CPU cache, so that likely explains why their wasn't much difference there. I think that 10% regression is acceptable for the memory savings and the write speed improvement. @redis/core-team Going to ping everyone again since this is definitely something to discuss. |
|
The read performance degradation is marginal and not a major concern IMO. On the other hand, the current approach of using a RAX to store the slot to keys structure ensures that the keys in the slot are strictly ordered in lexicographic order. With this approach that property is foregone. There is no longer any ordering that can be done given a specific key which makes async iteration on the keys in a slot to be harder when we need to resume the operation later on after the slot is modified. Looked through More specifically, I was thinking of a use-case when we are iterating the keys in the slot, and we do that in an incremental manner with an iterator. When the iterator is somewhere in the middle of the slot, is there an efficient way to determine if a given item comes "before" or "after" the iterator? With RAX when the keys are ordered, we can simply do a Does anyone know why |
|
@madolson i agree that considering the memory improvement and the write speed improvement we can accept the read degradation, but maybe we should give it a quick optimization attempt to see if we can solve that too. |
|
If you do look into performance degradation of the read operations, I would suggest finding out where the additional latency comes from. Looking at this CR there is nothing that changed directly in the Thinking about this more, the key range of 1000 where everything fits into L1/L2 cache is a very small redis instance and quite unusual usage scenario nowadays especially given the increase in the memory capacity over time from various service offerings. It is very likely that people would store much larger number of keys in Redis in the range of millions, where they will see this 10% slow down in the read path. |
I've updated it. Does it clarify why you decided to accept them? If not, please update. |
Whenever a key is looked up, the dictEntry (hashtable bucket) and the key are accessed together. This PR embeds the key in the same allocation as the dictEntry, which makes sure they are in the same cache line, thus saving one slow memory access. This compensates for the potential regression from redis#9356 for key lengths 15-22 in cluster mode. Furthermore, latency is expected to improve for other key lengths in cluster mode as well as in plain Redis. Benchmarks will show. The callbacks in dictType are changed in the following ways: * dictEntryMetadataBytes is replaced by createEntry, which (if present) takes care of allocating the entry and setting the key. * keyDestructor and valueDestructor are passed the dictEntry as an extra argument, so they can check if the key is embedded in the dictEntry and then not try to free it. * repairEntry is added, which is called from active defrag code when an entry has been reallocated, moving custom logic from the active defrag code to the dict type. TODO: * [x] Working implamentation. Tests passing. * [ ] Benchmarks for cluster/plain Redis, key lengths of different allocation sizes. * [ ] Memory usage comparison. Open questions: * What's the maxumum key length we want to embed in a dictEntry? * Do we want a similar optimization for other dicts, e.g. hashes?
* Enhance dict to support arbitrary metadata carried in dictEntry Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> * Rewrite slot-to-keys mapping to linked lists using dict entry metadata This is a memory enhancement for Redis Cluster. The radix tree slots_to_keys (which duplicates all key names prefixed with their slot number) is replaced with a linked list for each slot. The dict entries of the same cluster slot form a linked list and the pointers are stored as metadata in each dict entry of the main DB dict. This commit also moves the slot-to-key API from db.c to cluster.c. Co-authored-by: Jim Brunner <brunnerj@amazon.com>
|
For the record, i just realized this this change also fixes an issue. |
|
Nice! Maybe that's why my colleagues didn't see any benefit of defrag and turned it off. (They also wanted to turn off the allocator statistics because of the costs to keep them updated at every alloc and free. I think I mentioned this once.) |
|
That's awesome! We've also seen limited benefit in the past, good to know this might have been part of the problem. |
|
Can someone have a look at the above mentioned ticket and confirm it makes sense that this is due the cluster slots mapping? IIRC it's redis 4, which used to use rax, and the fragmentation was mainly in the 24 byte bin. |
|
Excuse me for being late... |
|
@funny-falcon Maybe it is simpler but now the other solution is already done. :-) Do you want to investigate the dicts idea and benchmark it? I think 16384 dicts (or 5400 dicts, a 3rd of the slots) is a lot of memory allocations, especially if there is only one key in each slot. |
The question this solves is: "give me all the keys in slot X". Mapping slot-to-key is only used rarely - for things like slot-migration. Normally, we look up keys without regard to slot. Having 16k dictionaries (in place of the main dictionary) would increase lookup time as we'd need to find the correct dictionary first. It would also be fairly disruptive to the code base as there are a lot of operations which use the main dictionary which would have to be redesigned to hash to slot first and then lookup. Anything that iterates on the main dictionary would have to be redesigned to iterate over 16k dictionaries (or the portion of 16k which are on a given node). Having 16k dicts in addition to the main dictionary is much more memory than the linked list solution and the direct item access isn't needed for the problem being solved. IMO stringing a linked list through the existing dictionary is a good solution. And, of course, this is FAR more efficient (both time & space) than the old RAX table implementation. |
|
Sorry, I had deal with Redis instances with 6-10GB per instance - ~10M keys. And ~500 slots per instance (yes, 33 partitions). |
|
If we replace the db dict with these 16K dicts, then I can agree it doesn't use more memory. I assumed you wanted to use separate dicts just as to reference the set of keys in each slot. If we want a two-level hash table anyway to avoid huge allocations, this could be a possible way forward. It is similar to #9517. However, in the extreme case that a node has only one slot (16K shards) there will be only one dict again. |
|
@funny-falcon I'm going to promote this to it's own issue, I think it's a good idea, #10589. |
|
finally followed.
This solution will increase the CPU overhead of the child process in step 2, but considering the overall ROI, compared with the performance and storage that can be improved in the main process in most of the time, this overhead is worth it. |
|
@uvletter cool job! If convenient, can you tell me about the memory savings after |
|
I don't know… migrating with full replication is mainly for mitigating the impact involved by original migration like blocking with big key, or resource contention between business and operations. Removing |
Yes, I only roughly viewed the raw proposal before as when I look at it the discussion list is quite long... I think the general direction and HLD is pretty well(silence is also an attitude I guess), looking forward the next moving. I will look into the whole discussions when I have some time. |
…for shard channels. (#12804) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (#9356), and then replacing the list with slot specific dictionaries for keys (#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…for shard channels. (redis#12804) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (redis#9356), and then replacing the list with slot specific dictionaries for keys (redis#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Pick from redis#9356 * Enhance dict to support arbitrary metadata carried in dictEntry * Rewrite slot-to-keys mapping to linked lists using dict entry metadata This is a memory enhancement for Redis Cluster. The radix tree slots_to_keys (which duplicates all key names prefixed with their slot number) is replaced with a linked list for each slot. The dict entries of the same cluster slot form a linked list and the pointers are stored as metadata in each dict entry of the main DB dict. This commit also moves the slot-to-key API from db.c to cluster.c.
…for shard channels. (redis#12804) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (redis#9356), and then replacing the list with slot specific dictionaries for keys (redis#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR replaces the radix tree used for the mapping from cluster slot to keys.
In cluster mode, there is a mapping from cluster slot to the keys in the slot. Until now, this was stored in a radix tree, duplicating the keys. This PR replaces the parallel structure with a new structure which has the following design:
This new structure saves up to 20% memory and reduces the latency for adding and deleting keys by up to 50% (for keys of length 16 and very small values). This is at the cost of a latency regression of 5-10% for accessing and updating existing keys. For details, see the benchmark results posted in comments below in this PR.
Another side effect of this change is that now this data structure gets implicitly defragged when defrag.c reallocates the dictEntries, so this solves a problem in which earlier versions may have been unable to resolve fragmentation issues when redis was used in cluster mode.
This work is based on #8210 which prepares for allocating extra memory in the dict entries.
Additionally, the slot-to-keys API is moved to cluster.c and cluster.h (from server.h and db.c).