Skip to content

Conversation

@zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented Aug 11, 2021

This PR replaces the radix tree used for the mapping from cluster slot to keys.

In cluster mode, there is a mapping from cluster slot to the keys in the slot. Until now, this was stored in a radix tree, duplicating the keys. This PR replaces the parallel structure with a new structure which has the following design:

  • In each dictEntry (hashtable bucket) in the main DB dict, memory for two extra pointers is allocated.
  • The entries with keys belonging to the same cluster slot form a linked list using these pointers.
  • For each of the 16384 cluster slots, there is a pointer to the beginning of each of these lists.

This new structure saves up to 20% memory and reduces the latency for adding and deleting keys by up to 50% (for keys of length 16 and very small values). This is at the cost of a latency regression of 5-10% for accessing and updating existing keys. For details, see the benchmark results posted in comments below in this PR.

Another side effect of this change is that now this data structure gets implicitly defragged when defrag.c reallocates the dictEntries, so this solves a problem in which earlier versions may have been unable to resolve fragmentation issues when redis was used in cluster mode.

This work is based on #8210 which prepares for allocating extra memory in the dict entries.

Additionally, the slot-to-keys API is moved to cluster.c and cluster.h (from server.h and db.c).

@zuiderkwast zuiderkwast requested a review from madolson August 11, 2021 08:07
@zuiderkwast zuiderkwast marked this pull request as draft August 11, 2021 10:57
@zuiderkwast zuiderkwast force-pushed the slot-to-keys-using-dict-entry-metadata branch 3 times, most recently from 5ee0d4d to 41a830d Compare August 11, 2021 13:01
@zuiderkwast zuiderkwast marked this pull request as ready for review August 11, 2021 13:02
@madolson madolson requested a review from JimB123 August 11, 2021 15:41
Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks solid!

@madolson
Copy link
Contributor

I didn't initially think of this, but I think we also need a change in defrag. Here

void defragDictBucketCallback(void *privdata, dictEntry **bucketref) {
we might reallocate the dictEntry, but the clustermeta data pointers will still be pointing to the previous invalid locations.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra Does this help make it more concrete what the value of the metadata is?

@madolson
Copy link
Contributor

@redis/core-team Probably worth pinging the group about this, as it's a rather intrusive change and I can see the argument the complexity is not worthwhile.

@zuiderkwast If you get a chance, I also would expect this to perform better as well. If you don't get a chance I can try running some benchmarks tomorrow.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, few minor suggestions for improvement.

@oranagra oranagra added release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus approval-needed Waiting for core team approval to be merged labels Aug 13, 2021
@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Aug 16, 2021

Benchmarks

Cluster of 3 nodes, no replicas, running locally on laptop.

Latency

redis-benchmark -p 30001 --cluster -t set,get -n 1000000 --threads 8 -P 100 -q

Result: No significant difference compared to unstable.

[Edit] I forgot to use -r here, so only one key was created. :-/ See new benchmark below!

Memory

DEBUG POPULATE 1000000 on one of the cluster nodes. Then INFO MEMORY.

Result used_memory_human
Unstable 116.37M
This PR 93.51M
Difference 22.86M (20% save)

The keys generated by DEBUG POPULATE are on the form key:NNNNNN. With a longer common key prefix (10 bytes), the memory saving is around 18%. For long keys without a shared prefix, the saving is expected to be even higher but I haven't written any script for generating such data.

@zuiderkwast zuiderkwast force-pushed the slot-to-keys-using-dict-entry-metadata branch from db3fc88 to 0790f5f Compare August 16, 2021 22:50
@madolson
Copy link
Contributor

@yossigo @itamarhaber @soloestoy ?

@itamarhaber
Copy link
Member

Haven't done a CR, but this looks good as long as there's no regression in performance - the rax was put there to replace a skip list that didn't perform well (1409c54).

@madolson madolson added state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten and removed approval-needed Waiting for core team approval to be merged labels Aug 24, 2021
Copy link
Collaborator

@yossigo yossigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have only quickly looked at the code, but LGTM.

@zuiderkwast
Copy link
Contributor Author

Regarding the performance regression concerns, I increased the numbers and ran some more benchmarks.

Benchmark: Set and get
Number of requests (-n): 10,000,000
Keyspace length (-r): 1G, 1M, 1K (see table)
Cluster: 3 masters, 0 replicas
Command: redis-benchmark -p 30001 --cluster -t set,get -n 10000000 --threads 8 -P 100 -q -r $r
The database was flushed on all nodes between each run.

Keyspace length unstable This PR
r=1,000,000,000 SET 1327140.00 requests per second, p50=3.351 msec 1990049.75 requests per second, p50=2.239 msec
GET 3619254.50 requests per second, p50=1.159 msec 3317850.00 requests per second, p50=1.239 msec
r=1,000,000 SET 1473839.38 requests per second, p50=2.167 msec 2094240.75 requests per second, p50=2.167 msec
GET 3620564.75 requests per second, p50=1.263 msec 3320053.00 requests per second, p50=1.367 msec
r=1,000 SET 2844950.25 requests per second, p50=1.623 msec 2840909.00 requests per second, p50=1.647 msec
GET 4424779.00 requests per second, p50=0.879 msec 4422822.00 requests per second, p50=0.895 msec

@madolson
Copy link
Contributor

Ooh, that is more like what I was expecting.

So it looks like we see about a ~40/50% speedup for writes (which is great) at the cost of about a 10% get performance regression. For the R=1000, almost everything is probably in L1/L2 CPU cache, so that likely explains why their wasn't much difference there.

I think that 10% regression is acceptable for the memory savings and the write speed improvement. @redis/core-team Going to ping everyone again since this is definitely something to discuss.

@QuChen88
Copy link
Contributor

QuChen88 commented Aug 25, 2021

The read performance degradation is marginal and not a major concern IMO.

On the other hand, the current approach of using a RAX to store the slot to keys structure ensures that the keys in the slot are strictly ordered in lexicographic order. With this approach that property is foregone. There is no longer any ordering that can be done given a specific key which makes async iteration on the keys in a slot to be harder when we need to resume the operation later on after the slot is modified. Looked through db.c but can't seem to find a good place where redis is currently relying on that today though.

More specifically, I was thinking of a use-case when we are iterating the keys in the slot, and we do that in an incremental manner with an iterator. When the iterator is somewhere in the middle of the slot, is there an efficient way to determine if a given item comes "before" or "after" the iterator? With RAX when the keys are ordered, we can simply do a > comparator on the key name with the iterator key name, which I think is kind of nice. Also today, the raxIterator is able to store the current key name so that even if the underlying slot_to_keys changed across iterator runs, we can do a quick re-seek the next time around and resume iteration. I don't think we will have the same safe iteration mechanism available on the double linked list with this change.

Does anyone know why slot_to_keys is ordered today? There might be a reason for it that I am not aware of. I wonder if there can be future potential use-cases for keeping the keys in the slot ordered which will be harder to support after this change.

@oranagra
Copy link
Member

@madolson i agree that considering the memory improvement and the write speed improvement we can accept the read degradation, but maybe we should give it a quick optimization attempt to see if we can solve that too.
i.e. invest a small effort rather than give up without trying.
it could be an extra indirection that's unavoidable (although RAX had an abundance of these), but maybe there's something simple that can be done to make it more cache friendly, or do some function inlining tip / macro.
@zuiderkwast can you do some profiling to see if you happen to find something.

@QuChen88
Copy link
Contributor

QuChen88 commented Aug 25, 2021

If you do look into performance degradation of the read operations, I would suggest finding out where the additional latency comes from. Looking at this CR there is nothing that changed directly in the lookupKeyRead code path apart from the additional metadata in the dictEntry so I am curious about it. It could potentially be coming from the way benchmark was set up but I am not sure assuming you are keeping all other factors the same in your test run besides the redis-server binary.

Thinking about this more, the key range of 1000 where everything fits into L1/L2 cache is a very small redis instance and quite unusual usage scenario nowadays especially given the increase in the memory capacity over time from various service offerings. It is very likely that people would store much larger number of keys in Redis in the range of millions, where they will see this 10% slow down in the read path.

@zuiderkwast
Copy link
Contributor Author

meanwhile, can you please update the summary at the top (will be linked in the release notes), with a summary of what was done, what are the implications, and why we decided to accept them.

I've updated it. Does it clarify why you decided to accept them? If not, please update.

zuiderkwast added a commit to zuiderkwast/redis that referenced this pull request Sep 6, 2021
Whenever a key is looked up, the dictEntry (hashtable bucket) and the key
are accessed together. This PR embeds the key in the same allocation as the
dictEntry, which makes sure they are in the same cache line, thus saving
one slow memory access.

This compensates for the potential regression from redis#9356 for key lengths
15-22 in cluster mode. Furthermore, latency is expected to improve for
other key lengths in cluster mode as well as in plain Redis. Benchmarks
will show.

The callbacks in dictType are changed in the following ways:

* dictEntryMetadataBytes is replaced by createEntry, which (if present)
  takes care of allocating the entry and setting the key.
* keyDestructor and valueDestructor are passed the dictEntry as an extra
  argument, so they can check if the key is embedded in the dictEntry and
  then not try to free it.
* repairEntry is added, which is called from active defrag code when an
  entry has been reallocated, moving custom logic from the active defrag
  code to the dict type.

TODO:

* [x] Working implamentation. Tests passing.
* [ ] Benchmarks for cluster/plain Redis, key lengths of different
      allocation sizes.
* [ ] Memory usage comparison.

Open questions:

* What's the maxumum key length we want to embed in a dictEntry?
* Do we want a similar optimization for other dicts, e.g. hashes?
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Sep 8, 2021
* Enhance dict to support arbitrary metadata carried in dictEntry

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

* Rewrite slot-to-keys mapping to linked lists using dict entry metadata

This is a memory enhancement for Redis Cluster.

The radix tree slots_to_keys (which duplicates all key names prefixed with their
slot number) is replaced with a linked list for each slot. The dict entries of
the same cluster slot form a linked list and the pointers are stored as metadata
in each dict entry of the main DB dict.

This commit also moves the slot-to-key API from db.c to cluster.c.

Co-authored-by: Jim Brunner <brunnerj@amazon.com>
@oranagra
Copy link
Member

For the record, i just realized this this change also fixes an issue.
in the past, defrag.c wasn't defragging the slot-to-keys, so these allocations could have prevented the allocator from releasing pages and result in a defragger that's consuming a lot of CPU and sin't gaining anything, see #9773.
Now since they're part of the dictEntry, they'll get implicitly defragged with it, so problem solved!
i'll update the top comment for the sake of release notes.

@zuiderkwast
Copy link
Contributor Author

Nice! Maybe that's why my colleagues didn't see any benefit of defrag and turned it off. (They also wanted to turn off the allocator statistics because of the costs to keep them updated at every alloc and free. I think I mentioned this once.)

@madolson
Copy link
Contributor

That's awesome! We've also seen limited benefit in the past, good to know this might have been part of the problem.

@zuiderkwast zuiderkwast deleted the slot-to-keys-using-dict-entry-metadata branch November 29, 2021 19:14
@oranagra
Copy link
Member

Can someone have a look at the above mentioned ticket and confirm it makes sense that this is due the cluster slots mapping? IIRC it's redis 4, which used to use rax, and the fragmentation was mainly in the 24 byte bin.

@funny-falcon
Copy link
Contributor

funny-falcon commented Mar 31, 2022

Excuse me for being late...
Wasn't it simpler to have 16384 dicts (in worst case) - one dict for cluster slot? Empty slot could live without dict.

@zuiderkwast
Copy link
Contributor Author

@funny-falcon Maybe it is simpler but now the other solution is already done. :-) Do you want to investigate the dicts idea and benchmark it?

I think 16384 dicts (or 5400 dicts, a 3rd of the slots) is a lot of memory allocations, especially if there is only one key in each slot.

@JimB123
Copy link
Contributor

JimB123 commented Mar 31, 2022

Wasn't it simpler to have 16384 dicts (in worst case) - one dict for cluster slot? Empty slot could live without dict.

The question this solves is: "give me all the keys in slot X".

Mapping slot-to-key is only used rarely - for things like slot-migration. Normally, we look up keys without regard to slot.

Having 16k dictionaries (in place of the main dictionary) would increase lookup time as we'd need to find the correct dictionary first. It would also be fairly disruptive to the code base as there are a lot of operations which use the main dictionary which would have to be redesigned to hash to slot first and then lookup. Anything that iterates on the main dictionary would have to be redesigned to iterate over 16k dictionaries (or the portion of 16k which are on a given node).

Having 16k dicts in addition to the main dictionary is much more memory than the linked list solution and the direct item access isn't needed for the problem being solved.

IMO stringing a linked list through the existing dictionary is a good solution.

And, of course, this is FAR more efficient (both time & space) than the old RAX table implementation.

@funny-falcon
Copy link
Contributor

Sorry, I had deal with Redis instances with 6-10GB per instance - ~10M keys. And ~500 slots per instance (yes, 33 partitions).
Doubtfully linked list links (16*10M=160M) consumes less memory than per-slot dictionary.

@zuiderkwast
Copy link
Contributor Author

If we replace the db dict with these 16K dicts, then I can agree it doesn't use more memory. I assumed you wanted to use separate dicts just as to reference the set of keys in each slot.

If we want a two-level hash table anyway to avoid huge allocations, this could be a possible way forward. It is similar to #9517. However, in the extreme case that a node has only one slot (16K shards) there will be only one dict again.

@madolson
Copy link
Contributor

@funny-falcon I'm going to promote this to it's own issue, I think it's a good idea, #10589.

@judeng
Copy link
Contributor

judeng commented Oct 20, 2022

finally followed.
we could fundamentally consider the application scenario of slottokey, basically in the scenario of cluster rehash and this function is used infrequent(In our practical experience, 90% of the clusters do not use the rehash once a year.).
we can make a tradeoff and use the new migraite mechanism to completely remove the storage and performance overhead of the slottokey
This is a preliminary design:

  1. fork a child
  2. scan the keys in child process, find the slot key and write to dest node
  3. Synchronize the incremental data of this slot

This solution will increase the CPU overhead of the child process in step 2, but considering the overall ROI, compared with the performance and storage that can be improved in the main process in most of the time, this overhead is worth it.

@uvletter
Copy link
Contributor

@judeng your idea is somewhat like what #10933 proposes, or the implementation in Tencent or Bytedance' internal forks, which regards migration as a special full-sync replication

@judeng
Copy link
Contributor

judeng commented Nov 7, 2022

@uvletter cool job! If convenient, can you tell me about the memory savings after slottokey be deleted? :-)

@uvletter
Copy link
Contributor

uvletter commented Nov 7, 2022

I don't know… migrating with full replication is mainly for mitigating the impact involved by original migration like blocking with big key, or resource contention between business and operations. Removing slottokey is not the main concern, but of course slottokey can be removed in that way depending on if you concern memory or CPU/time resumption more.

@madolson
Copy link
Contributor

madolson commented Nov 7, 2022

@uvletter I didn't see any comments from you on #10933, did you have any additional insights you wanted to provide?

@uvletter
Copy link
Contributor

uvletter commented Nov 8, 2022

@uvletter I didn't see any comments from you on #10933, did you have any additional insights you wanted to provide?

Yes, I only roughly viewed the raw proposal before as when I look at it the discussion list is quite long... I think the general direction and HLD is pretty well(silence is also an attitude I guess), looking forward the next moving. I will look into the whole discussions when I have some time.

soloestoy pushed a commit that referenced this pull request Dec 27, 2023
…for shard channels. (#12804)

We have achieved replacing `slots_to_keys` radix tree with key->slot
linked list (#9356), and then replacing the list with slot specific
dictionaries for keys (#11695).

Shard channels behave just like keys in many ways, and we also need a
slots->channels mapping. Currently this is still done by using a radix
tree. So we should split `server.pubsubshard_channels` into 16384 dicts
and drop the radix tree, just like what we did to DBs.

Some benefits (basically the benefits of what we've done to DBs):
1. Optimize counting channels in a slot. This is currently used only in
removing channels in a slot. But this is potentially more useful:
sometimes we need to know how many channels there are in a specific slot
when doing slot migration. Counting is now implemented by traversing the
radix tree, and with this PR it will be as simple as calling `dictSize`,
from O(n) to O(1).
2. The radix tree in the cluster has been removed. The shard channel
names no longer require additional storage, which can save memory.
3. Potentially useful in slot migration, as shard channels are logically
split by slots, thus making it easier to migrate, remove or add as a
whole.
4. Avoid rehashing a big dict when there is a large number of channels.

Drawbacks:
1. Takes more memory than using radix tree when there are relatively few
shard channels.

What this PR does:
1. in cluster mode, split `server.pubsubshard_channels` into 16384
dicts, in standalone mode, still use only one dict.
2. drop the `slots_to_channels` radix tree.
3. to save memory (to solve the drawback above), all 16384 dicts are
created lazily, which means only when a channel is about to be inserted
to the dict will the dict be initialized, and when all channels are
deleted, the dict would delete itself.
5. use `server.shard_channel_count` to keep track of the number of all
shard channels.

---------

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
…for shard channels. (redis#12804)

We have achieved replacing `slots_to_keys` radix tree with key->slot
linked list (redis#9356), and then replacing the list with slot specific
dictionaries for keys (redis#11695).

Shard channels behave just like keys in many ways, and we also need a
slots->channels mapping. Currently this is still done by using a radix
tree. So we should split `server.pubsubshard_channels` into 16384 dicts
and drop the radix tree, just like what we did to DBs.

Some benefits (basically the benefits of what we've done to DBs):
1. Optimize counting channels in a slot. This is currently used only in
removing channels in a slot. But this is potentially more useful:
sometimes we need to know how many channels there are in a specific slot
when doing slot migration. Counting is now implemented by traversing the
radix tree, and with this PR it will be as simple as calling `dictSize`,
from O(n) to O(1).
2. The radix tree in the cluster has been removed. The shard channel
names no longer require additional storage, which can save memory.
3. Potentially useful in slot migration, as shard channels are logically
split by slots, thus making it easier to migrate, remove or add as a
whole.
4. Avoid rehashing a big dict when there is a large number of channels.

Drawbacks:
1. Takes more memory than using radix tree when there are relatively few
shard channels.

What this PR does:
1. in cluster mode, split `server.pubsubshard_channels` into 16384
dicts, in standalone mode, still use only one dict.
2. drop the `slots_to_channels` radix tree.
3. to save memory (to solve the drawback above), all 16384 dicts are
created lazily, which means only when a channel is about to be inserted
to the dict will the dict be initialized, and when all channels are
deleted, the dict would delete itself.
5. use `server.shard_channel_count` to keep track of the number of all
shard channels.

---------

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 14, 2025
Pick from redis#9356

* Enhance dict to support arbitrary metadata carried in dictEntry

* Rewrite slot-to-keys mapping to linked lists using dict entry metadata

This is a memory enhancement for Redis Cluster.

The radix tree slots_to_keys (which duplicates all key names prefixed with their
slot number) is replaced with a linked list for each slot. The dict entries of
the same cluster slot form a linked list and the pointers are stored as metadata
in each dict entry of the main DB dict.

This commit also moves the slot-to-key API from db.c to cluster.c.
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…for shard channels. (redis#12804)

We have achieved replacing `slots_to_keys` radix tree with key->slot
linked list (redis#9356), and then replacing the list with slot specific
dictionaries for keys (redis#11695).

Shard channels behave just like keys in many ways, and we also need a
slots->channels mapping. Currently this is still done by using a radix
tree. So we should split `server.pubsubshard_channels` into 16384 dicts
and drop the radix tree, just like what we did to DBs.

Some benefits (basically the benefits of what we've done to DBs):
1. Optimize counting channels in a slot. This is currently used only in
removing channels in a slot. But this is potentially more useful:
sometimes we need to know how many channels there are in a specific slot
when doing slot migration. Counting is now implemented by traversing the
radix tree, and with this PR it will be as simple as calling `dictSize`,
from O(n) to O(1).
2. The radix tree in the cluster has been removed. The shard channel
names no longer require additional storage, which can save memory.
3. Potentially useful in slot migration, as shard channels are logically
split by slots, thus making it easier to migrate, remove or add as a
whole.
4. Avoid rehashing a big dict when there is a large number of channels.

Drawbacks:
1. Takes more memory than using radix tree when there are relatively few
shard channels.

What this PR does:
1. in cluster mode, split `server.pubsubshard_channels` into 16384
dicts, in standalone mode, still use only one dict.
2. drop the `slots_to_channels` radix tree.
3. to save memory (to solve the drawback above), all 16384 dicts are
created lazily, which means only when a channel is about to be inserted
to the dict will the dict be initialized, and when all channels are
deleted, the dict would delete itself.
5. use `server.shard_channel_count` to keep track of the number of all
shard channels.

---------

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.