Replace cluster metadata with slot specific dictionaries #11695

vitarb · 2023-01-06T02:04:54Z

This is an implementation of #10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the dictEntry. The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.

Important changes

Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index that allows us to search cumulative frequencies of slot ranges, better explanation in comments here. With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
DB size - besides DBSIZE API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into redisDb that keeps track of key_count. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.

Performance

This change improves SET performance in cluster mode by ~5%, most of gains come from us not having to maintain linked lists for keys in slot, non-cluster mode should have same performance. See benchmarks in the comments below.

Interface changes

Removed overhead.hashtable.slot-to-keys to MEMORY STATS
Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.

Follow up items:

: Performance improvements Followup performance testing for per-slot dictionary #12654
: We are introducing an incompatible change for the SCAN API. During an engine upgrade, the scan cursor will become invalidated as it will not include slot information (See SCAN, SSCAN, HSCAN, and ZSCAN are not stable across failovers #12440). Currently we don't do anything to reject old cursors, since the cursor will just look like a valid cursor for slot 0. I think we don't care, but just pinning to make sure folks agree.

sjpotter · 2023-01-07T22:30:53Z

getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. In my implementation this uses shuffling algorithm that randomly selects any non-empty owned dict with the same probability and then applies existing logic for random key selection there.

Note, this isn't making keys evenly distributed, as you note, different buckets can have different amount of keys. This might not be a big deal, but should be noted.

zuiderkwast

I have skimmed though the diff. This looks very promising!

Regarding the topics in the top comment:

Incremental rehashing

I think your solution looks good.

dbRandomKey

I think it needs to be more fair. I have commented on the diff.

scan API

See comment on the diff.

Avoid calculating hash every time

Yes I think we need to do this, at least for checking if it affects performance. A simple way could be to cache the last computed hash inside hashKeySlot itself:

unsigned int keyHashSlot(char *key, int keylen) {
    static char *lastkey = NULL;
    static int lastslot = 0;
    if (key == lastkey) return lastslot;
    lastkey = key;
    /* ... compute the slot ... */
    lastslot = slot;
    return slot;
}

Skip slots that aren't owned during iteration. In this implementation I've decided for simplicity, to always allocate an array of 16k dicts in the main dictionary, this results in a bit of extra memory as well as makes iteration over all dictionaries slightly slower.

For simplicity, I like that we have all the dicts allocated, but if we use an array of dict structs (see also comment in the diff) instead of an array of dict pointers, we probably get a bit faster as there are less pointers to follow thus less memory accesses.

src/db.c

src/dict.c

src/lazyfree.c

src/dict.c

src/rdb.c

src/server.h

vitarb · 2023-01-10T18:51:51Z

@zuiderkwast @sjpotter thanks for looking at this and providing feedback. As it stands couple areas for improvement are:

Making random more fair.
Caching/pushing through slot id so we don't need to calculate hashes again.

Another small debt on my side might be writing some tests for SCAN in cluster mode, to ensure that combined cursor+slot logic works properly (I've manually tested it so far and was relying on few tests that already exist, I haven't checked if this case is covered by existing tests).

I'll try to address these issues, meanwhile let's continue discussion in the comments and see if any other area needs to be improved as well.

vitarb · 2023-01-10T23:49:35Z

@zuiderkwast I've tested your approach with caching hashes for the last key, and at first couldn't believe the numbers, read performance doubled, write performance increased 20-30%! But then I've realized that it's unfortunately broken, the problem seems to be that addresses are getting reused and we can't be sure that pointers being the same mean actually keys being the same.

On this screenshot you can see lastkey is pointing to the same place as the key, but hash is actually different from the one calculated before. (I've added extra call to _keyHashSlot just to catch this issue more easily)

In fact jemalloc seem to be reusing same address over and over again for different keys.
So we'll probably need to use a different approach for this.

molian6

Thanks for publishing the PR @vitarb. I'm really excited for this change!
I reviewed db.c and left some questions, will continue reviewing the remaining files tmr.

src/db.c

src/server.h

src/db.c

vitarb · 2023-01-11T18:37:16Z

@molian6

+1 that 48 bits should be good enough, the public doc suggested max number of keys is 2^32 (though I didn't find this hard limit > in the code, I guess it is the limit in a 32 bit system)
I'm curious where the other 2 bits goes. We should have 64 bits in the unsigned long long, if cursor takes 48 bits and slot takes > 14 bits, that adds up to 62 bits.

If we do what @zuiderkwast suggested and move slot id to LSB then cursor can occupy remaining 50 bits. My reasoning for using 48 was to leave sign bit alone to make sure that numbers look the same on the client, even if signed long is used there, also left 1 extra bit for parity and potential future improvements.

molian6 · 2023-01-11T19:59:01Z

@molian6

+1 that 48 bits should be good enough, the public doc suggested max number of keys is 2^32 (though I didn't find this hard limit > in the code, I guess it is the limit in a 32 bit system)
I'm curious where the other 2 bits goes. We should have 64 bits in the unsigned long long, if cursor takes 48 bits and slot takes > 14 bits, that adds up to 62 bits.

If we do what @zuiderkwast suggested and move slot id to LSB then cursor can occupy remaining 50 bits. My reasoning for using 48 was to leave sign bit alone to make sure that numbers look the same on the client, even if signed long is used there, also left 1 extra bit for parity and potential future improvements.

@vitarb Got you. I do like the idea of LSB because in reality cursor will not take 48 or 50 bits, we would have more MSB left for other potential uses in future. It gave use more flexibilities.

src/aof.c

madolson · 2023-01-12T00:46:05Z

Yes I think we need to do this, at least for checking if it affects performance. A simple way could be to cache the last computed hash inside hashKeySlot itself:

We had some conversation in another thread about a key handle that would cache the crc and the hash value after the initial lookup. Re-using this handle would remove the extra computation. I think this is still the most promising way forward.

src/server.c

src/server.h

src/dict.h

src/server.h

src/aof.c

vitarb · 2023-01-27T07:38:08Z

Folks, I think I've addressed all feedback, the last bit remaining is performance. I'm going to run a set of benchmarks and try to optimize any rough edges. Please take another look and let me know if there are any more suggestions or concerns.

vitarb · 2023-02-02T06:58:32Z

As promised before, I've done some performance analysis.

Methodology
I've tested GETs/SETs with cluster mode yes/no, using request pipelining to make any dictionary related issues more distinctive, 50M keys and 100 bytes of data per entry using redis-benchmark. (see exact commands on screenshots below). Tests were ran on c5.9xlarge x86_64 Linux cloud machine.

Here is a summary of results.

Cluster mode enabled

SETs are ~7-8% faster on this branch.
GETs are ~1% faster on this branch.

Benchmark comparison in cluster mode:

Here are flamegraphs for SETs in cluster mode:
this branch

unstable

Looking at the flamegraph, we can see that primary reason for SETs being faster is because slotToKeyAddEntry has been eliminated which was eating ~5% of performance.

GETs:
this branch

unstable

One curious thing to note here is that dictSdsKeyCompare reduced by almost 8%, but other code paths, including dictFind gained slightly. Need to look a bit more into this, maybe there is some potential to improve it even further.

In cluster mode disabled

SETs have about same performance.
GETs are ~1-2% slower on this branch.

Looking at the flamegraphs I couldn't pinpoint the reason for slowness on GET.
this branch

unstable

I'm going to investigate this added slowness on GET when cluster is mode disabled, meanwhile let me know if you have any thoughts or recommendations on testing process or methodology.

zuiderkwast · 2023-02-02T09:58:58Z

@filipecosta90 might have an idea about performance. Do any of the benchmarks on grafana run in cluster mode?

oranagra · 2023-02-03T10:00:00Z

Hi, i quickly skimmed though the discussion (not the code, or code-review comments).
i'm quite concerned about this change, here's a random set of concerns:

as you said, we have an impact on dbsize, which used to be very fast and maybe callers abuse it. let's see a benchmark and consider optimizing it.
the effects on incremental rehashing, random and probably other things we didn't realize yet need some thinking and may take time till we can fully trust this.
i would like to learn more about the possible negative impacts of this on non-clustered deployments. the first thing that comes to mind is memory usage, specifically on small databases.
i didn't yet look at the code and the attempt to avoid re-hashing or re-finding the slot (if we have a cache that cleared after each command), i worry that maybe you made some optimizations to reduce the negative impacts of this PR on some cases, but can't use these tricks for other cases and they're left unoptimized and the PR has a negative impact. e.g. multi-slot scripts, MGET, SUNIONSTORE, etc.

i'd note that in both 6.2 and 7.2 we added some severe performance regression for certain (even popular) workloads without realizing it, and it took some time until the community reported these regressions, so this drastic change worries me and i would want to seek more confidence.

zuiderkwast · 2023-02-03T11:59:47Z

Some thoughts on points 3-4:

3: I don't think there's any memory impact in standalone mode. There's only one dict, not 16K as in cluster mode.

4: MGET and SUNIONSTORE raise -CROSSSLOT if keys are of different slots, so that shoudn't be an issue. Multi-slot scripts might be a problem though, if they're supported in cluster mode. I'm not sure what we allow in this area.

oranagra · 2023-02-03T12:04:51Z

Ohh. I didn't know that in non cluster mode there's one dictionary (didn't look at the code yet).
So what impact does it have on non cluster mode, or any other negative impact on cluster mode?

zuiderkwast · 2023-02-03T12:12:47Z

For performance impact in cluster and non-cluster mode, see vitarb's comment above, the one with the flame graphs. (Theoretically, there shouldn't be any impact on non-cluster. I don't know where it comes from.)

For cluster mode, there is a large fixed memory impact (16K dicts), but a 16B/key save and improved latency.

madolson

Took another look. The only think I still don't really like is the random key selection, but I'm not sure how much it really matters. Do we have any performance testing for eviction? That might be more heavily impacted by the increased time spent selecting a random DB. I would still bias towards making the random db selection faster, and random over a long time frame, then necessarily accurate at a given point in time.

src/aof.c

src/dict.c

src/cluster.c

src/db.c

src/dict.c

src/dict.h

…for shard channels. (#12804) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (#9356), and then replacing the list with slot specific dictionaries for keys (#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

) The function `tryResizeHashTables` only attempts to shrink the dicts that has keys (change from #11695), this was a serious problem until the change in #12850 since it meant if all keys are deleted, we won't shrink the dick. But still, both dictShrink and dictExpand may be blocked by a fork child process, therefore, the cron job needs to perform both dictShrink and dictExpand, for not just non-empty dicts, but all dicts in DBs. What this PR does: 1. Try to resize all dicts in DBs (not just non-empty ones, as it was since #12850) 2. handle both shrink and expand (not just shrink, as it was since forever) 3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink` `htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded` `dictExpandIfNeeded` which already contains all the code of those functions we get rid of, to make APIs more neat) 4. In the `Don't rehash if redis has child process` test, now that cron would do resizing, we no longer need to write to DB after the child process got killed, and can wait for the cron to expand the hash table.

…re (#12822) # Description Gather most of the scattered `redisDb`-related code from the per-slot dict PR (#11695) and turn it to a new data structure, `kvstore`. i.e. it's a class that represents an array of dictionaries. # Motivation The main motivation is code cleanliness, the idea of using an array of dictionaries is very well-suited to becoming a self-contained data structure. This allowed cleaning some ugly code, among others: loops that run twice on the main dict and expires dict, and duplicate code for allocating and releasing this data structure. # Notes 1. This PR reverts the part of #12848 where the `rehashing` list is global (handling rehashing `dict`s is under the responsibility of `kvstore`, and should not be managed by the server) 2. This PR also replaces the type of `server.pubsubshard_channels` from `dict**` to `kvstore` (original PR: #12804). After that was done, server.pubsub_channels was also chosen to be a `kvstore` (with only one `dict`, which seems odd) just to make the code cleaner by making it the same type as `server.pubsubshard_channels`, see `pubsubtype.serverPubSubChannels` 3. the keys and expires kvstores are currenlty configured to allocate the individual dicts only when the first key is added (unlike before, in which they allocated them in advance), but they won't release them when the last key is deleted. Worth mentioning that due to the recent change the reply of DEBUG HTSTATS changed, in case no keys were ever added to the db. before: ``` 127.0.0.1:6379> DEBUG htstats 9 [Dictionary HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ``` after: ``` 127.0.0.1:6379> DEBUG htstats 9 [Dictionary HT] [Expires HT] ```

After redis#11695, we added two functions `rehashingStarted` and `rehashingCompleted` to the dict structure. We also registered two handlers for the main database's dict and expire structures. This allows the main database to record the dict in `rehashing` list when rehashing starts. Later, in `serverCron`, the `incrementallyRehash` function is continuously called to perform the rehashing operation. However, currently, when rehashing is completed, `rehashingCompleted` does not remove the dict from the `rehashing` list. This results in the `rehashing` list containing many invalid dicts. Although subsequent cron checks and removes dicts that don't require rehashing, it is still inefficient. This PR implements the functionality to remove the dict from the `rehashing` list in `rehashingCompleted`. This is achieved by adding `metadata` to the dict structure, which keeps track of its position in the `rehashing` list, allowing for quick removal. This approach avoids storing duplicate dicts in the `rehashing` list. Additionally, there are other modifications: 1. Whether in standalone or cluster mode, the dict in database is inserted into the rehashing linked list when rehashing starts. This eliminates the need to distinguish between standalone and cluster mode in `incrementallyRehash`. The function only needs to focus on the dicts in the `rehashing` list that require rehashing. 2. `rehashing` list is moved from per-database to Redis server level. This decouples `incrementallyRehash` from the database ID, and in standalone mode, there is no need to iterate over all databases, avoiding unnecessary access to databases that do not require rehashing. In the future, even if unsharded-cluster mode supports multiple databases, there will be no risk involved. 3. The insertion and removal operations of dict structures in the `rehashing` list are decoupled from `activerehashing` config. `activerehashing` only controls whether `incrementallyRehash` is executed in serverCron. There is no need for additional steps when modifying the `activerehashing` switch, as in redis#12705.

…for shard channels. (redis#12804) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (redis#9356), and then replacing the list with slot specific dictionaries for keys (redis#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…is#12819) The function `tryResizeHashTables` only attempts to shrink the dicts that has keys (change from redis#11695), this was a serious problem until the change in redis#12850 since it meant if all keys are deleted, we won't shrink the dick. But still, both dictShrink and dictExpand may be blocked by a fork child process, therefore, the cron job needs to perform both dictShrink and dictExpand, for not just non-empty dicts, but all dicts in DBs. What this PR does: 1. Try to resize all dicts in DBs (not just non-empty ones, as it was since redis#12850) 2. handle both shrink and expand (not just shrink, as it was since forever) 3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink` `htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded` `dictExpandIfNeeded` which already contains all the code of those functions we get rid of, to make APIs more neat) 4. In the `Don't rehash if redis has child process` test, now that cron would do resizing, we no longer need to write to DB after the child process got killed, and can wait for the cron to expand the hash table.

…re (redis#12822) # Description Gather most of the scattered `redisDb`-related code from the per-slot dict PR (redis#11695) and turn it to a new data structure, `kvstore`. i.e. it's a class that represents an array of dictionaries. # Motivation The main motivation is code cleanliness, the idea of using an array of dictionaries is very well-suited to becoming a self-contained data structure. This allowed cleaning some ugly code, among others: loops that run twice on the main dict and expires dict, and duplicate code for allocating and releasing this data structure. # Notes 1. This PR reverts the part of redis#12848 where the `rehashing` list is global (handling rehashing `dict`s is under the responsibility of `kvstore`, and should not be managed by the server) 2. This PR also replaces the type of `server.pubsubshard_channels` from `dict**` to `kvstore` (original PR: redis#12804). After that was done, server.pubsub_channels was also chosen to be a `kvstore` (with only one `dict`, which seems odd) just to make the code cleaner by making it the same type as `server.pubsubshard_channels`, see `pubsubtype.serverPubSubChannels` 3. the keys and expires kvstores are currenlty configured to allocate the individual dicts only when the first key is added (unlike before, in which they allocated them in advance), but they won't release them when the last key is deleted. Worth mentioning that due to the recent change the reply of DEBUG HTSTATS changed, in case no keys were ever added to the db. before: ``` 127.0.0.1:6379> DEBUG htstats 9 [Dictionary HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ``` after: ``` 127.0.0.1:6379> DEBUG htstats 9 [Dictionary HT] [Expires HT] ```

Fail CI: https://github.com/redis/redis/actions/runs/7837608438/job/21387609715 ## Why defragment tests only failed under 32-bit First of all, under 32-bit jemalloc will allocate more small bins and less large bins, which will also lead to more external fragmentation, therefore, the fragmentation ratio is higher in 32-bit than in 64-bit, so the defragment tests(`Active defrag eval scripts: cluster` and `Active defrag big keys: cluster`) always fails in 32-bit. ## Why defragment tests only failed with cluster The fowllowing is the result of `Active defrag eval scripts: cluster` test. 1) Before #11695, the fragmentation ratio is 3.11%. 2) After #11695, the fragmentation ratio grew to 4.58%. Since we are using per-slot dictionary to manage slots, we will only defragment the contents of these dictionaries (keys, values), but not the dictionaries' struct and ht_table, which means that frequent shrinking and expanding of the dictionaries, will make more fragments. 3) After #12850 and #12948, In cluster mode, a large number of cluster slot dicts will be shrunk, creating additional fragmention, and the dictionary will not be defragged. ## Solution * Add defragmentation of the per-slot dictionary's own structures, dict struct and ht_table. ## Other change * Increase floating point print precision of `frags` and `rss` in debug logs for defrag --------- Co-authored-by: Oran Agra <oran@redislabs.com>

without timeout Fix an overlook in redis#11695, if defragment doesn't reach the end time, we should wait for the current db's keys and expires to finish before leaving, now it's possible to exit early when the keys are defragmented.

After #13013 ### This PR make effort to defrag the pubsub kvstore in the following ways: 1. Till now server.pubsub(shard)_channels only share channel name obj with the first subscribed client, now change it so that the clients and the pubsub kvstore share the channel name robj. This would save a lot of memory when there are many subscribers to the same channel. It also means that we only need to defrag the channel name robj in the pubsub kvstore, and then update all client references for the current channel, avoiding the need to iterate through all the clients to do the same things. 2. Refactor the code to defragment pubsub(shard) in the same way as defragment of keys and EXPIRES, with the exception that we only defragment pubsub(without shard) when slot is zero. ### Other Fix an overlook in #11695, if defragment doesn't reach the end time, we should wait for the current db's keys and expires, pubsub and pubsubshard to finish before leaving, now it's possible to exit early when the keys are defragmented. --------- Co-authored-by: oranagra <oran@redislabs.com>

…13072) Currently (following #11695, and #12822), keys kvstore and expires kvstore both flag with ON_DEMAND, it means that a cluster node will only allocate a dict when the slot is assigned to it and populated, but on the other hand, when the slot is unassigned, the dict will remain allocated. We considered releasing the dict when the slot is unassigned, but it causes complications on replicas. On the other hand, from benchmarks we conducted, it looks like the performance impact of releasing the dict when it becomes empty and re-allocate it when a key is added again, isn't huge. This PR add KVSTORE_FREE_EMPTY_DICTS to cluster mode keys / expires kvstore. The impact is about about 2% performance drop, for this hopefully uncommon scenario. --------- Co-authored-by: Oran Agra <oran@redislabs.com>

… reset (#13315) this PR fixes two crashes: 1. Fix missing slotToKeyInit() when using `flushdb async` under cluster mode. #13205 2. Fix missing expires_cursor reset when stopping active defrag in the middle of defragment. #13307 If we stop active defrag in the middle of defragging db->expires, if `expires_cursor` is not reset to 0, the next time we enable active defrag again, defragLaterStep(db, ...) will be entered. However, at this time, `db` has been reset to NULL, which results in crash. The affected code were removed by #11695 and #13058 in usntable, so we just need backport this to 7.2.

This is an implementation of redis#10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data. * Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms. * getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time. * Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree. * scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue. * Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot. * Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading. * DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well. This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict. RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load. * Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS` * Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored. * New RDB version to support the new op code for SLOT information.

Dictionary iterator logic in the `tryResizeHashTables` method is picking the next (incorrect) dictionary while the cursor is at a given slot. This could lead to some dictionary/slot getting skipped from resizing. Also stabilize the test. problem introduced recently in redis#11695

Fixing issues described in redis#12672, started after redis#11695 Related to redis#12674 Fixes the `defrag didn't stop' issue. In some cases of how the keys were stored in memory defrag_later_item_in_progress was not getting reset once we finish defragging the later items and we move to the next slot. This stopped the scan to happen in the later slots and did not get

…for shard channels. (redis#12804) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (redis#9356), and then replacing the list with slot specific dictionaries for keys (redis#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…is#12819) The function `tryResizeHashTables` only attempts to shrink the dicts that has keys (change from redis#11695), this was a serious problem until the change in redis#12850 since it meant if all keys are deleted, we won't shrink the dick. But still, both dictShrink and dictExpand may be blocked by a fork child process, therefore, the cron job needs to perform both dictShrink and dictExpand, for not just non-empty dicts, but all dicts in DBs. What this PR does: 1. Try to resize all dicts in DBs (not just non-empty ones, as it was since redis#12850) 2. handle both shrink and expand (not just shrink, as it was since forever) 3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink` `htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded` `dictExpandIfNeeded` which already contains all the code of those functions we get rid of, to make APIs more neat) 4. In the `Don't rehash if redis has child process` test, now that cron would do resizing, we no longer need to write to DB after the child process got killed, and can wait for the cron to expand the hash table.

…re (redis#12822) # Description Gather most of the scattered `redisDb`-related code from the per-slot dict PR (redis#11695) and turn it to a new data structure, `kvstore`. i.e. it's a class that represents an array of dictionaries. # Motivation The main motivation is code cleanliness, the idea of using an array of dictionaries is very well-suited to becoming a self-contained data structure. This allowed cleaning some ugly code, among others: loops that run twice on the main dict and expires dict, and duplicate code for allocating and releasing this data structure. # Notes 1. This PR reverts the part of redis#12848 where the `rehashing` list is global (handling rehashing `dict`s is under the responsibility of `kvstore`, and should not be managed by the server) 2. This PR also replaces the type of `server.pubsubshard_channels` from `dict**` to `kvstore` (original PR: redis#12804). After that was done, server.pubsub_channels was also chosen to be a `kvstore` (with only one `dict`, which seems odd) just to make the code cleaner by making it the same type as `server.pubsubshard_channels`, see `pubsubtype.serverPubSubChannels` 3. the keys and expires kvstores are currenlty configured to allocate the individual dicts only when the first key is added (unlike before, in which they allocated them in advance), but they won't release them when the last key is deleted. Worth mentioning that due to the recent change the reply of DEBUG HTSTATS changed, in case no keys were ever added to the db. before: ``` 127.0.0.1:6379> DEBUG htstats 9 [Dictionary HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries [Expires HT] Hash table 0 stats (main hash table): No stats available for empty dictionaries ``` after: ``` 127.0.0.1:6379> DEBUG htstats 9 [Dictionary HT] [Expires HT] ```

Fail CI: https://github.com/redis/redis/actions/runs/7837608438/job/21387609715 ## Why defragment tests only failed under 32-bit First of all, under 32-bit jemalloc will allocate more small bins and less large bins, which will also lead to more external fragmentation, therefore, the fragmentation ratio is higher in 32-bit than in 64-bit, so the defragment tests(`Active defrag eval scripts: cluster` and `Active defrag big keys: cluster`) always fails in 32-bit. ## Why defragment tests only failed with cluster The fowllowing is the result of `Active defrag eval scripts: cluster` test. 1) Before redis#11695, the fragmentation ratio is 3.11%. 2) After redis#11695, the fragmentation ratio grew to 4.58%. Since we are using per-slot dictionary to manage slots, we will only defragment the contents of these dictionaries (keys, values), but not the dictionaries' struct and ht_table, which means that frequent shrinking and expanding of the dictionaries, will make more fragments. 3) After redis#12850 and redis#12948, In cluster mode, a large number of cluster slot dicts will be shrunk, creating additional fragmention, and the dictionary will not be defragged. ## Solution * Add defragmentation of the per-slot dictionary's own structures, dict struct and ht_table. ## Other change * Increase floating point print precision of `frags` and `rss` in debug logs for defrag --------- Co-authored-by: Oran Agra <oran@redislabs.com>

After redis#13013 ### This PR make effort to defrag the pubsub kvstore in the following ways: 1. Till now server.pubsub(shard)_channels only share channel name obj with the first subscribed client, now change it so that the clients and the pubsub kvstore share the channel name robj. This would save a lot of memory when there are many subscribers to the same channel. It also means that we only need to defrag the channel name robj in the pubsub kvstore, and then update all client references for the current channel, avoiding the need to iterate through all the clients to do the same things. 2. Refactor the code to defragment pubsub(shard) in the same way as defragment of keys and EXPIRES, with the exception that we only defragment pubsub(without shard) when slot is zero. ### Other Fix an overlook in redis#11695, if defragment doesn't reach the end time, we should wait for the current db's keys and expires, pubsub and pubsubshard to finish before leaving, now it's possible to exit early when the keys are defragmented. --------- Co-authored-by: oranagra <oran@redislabs.com>

…edis#13072) Currently (following redis#11695, and redis#12822), keys kvstore and expires kvstore both flag with ON_DEMAND, it means that a cluster node will only allocate a dict when the slot is assigned to it and populated, but on the other hand, when the slot is unassigned, the dict will remain allocated. We considered releasing the dict when the slot is unassigned, but it causes complications on replicas. On the other hand, from benchmarks we conducted, it looks like the performance impact of releasing the dict when it becomes empty and re-allocate it when a key is added again, isn't huge. This PR add KVSTORE_FREE_EMPTY_DICTS to cluster mode keys / expires kvstore. The impact is about about 2% performance drop, for this hopefully uncommon scenario. --------- Co-authored-by: Oran Agra <oran@redislabs.com>

zuiderkwast reviewed Jan 10, 2023

View reviewed changes

molian6 reviewed Jan 11, 2023

View reviewed changes

src/db.c Outdated Show resolved Hide resolved

src/server.h Outdated Show resolved Hide resolved

src/db.c Outdated Show resolved Hide resolved

src/db.c Outdated Show resolved Hide resolved

molian6 reviewed Jan 11, 2023

View reviewed changes

src/aof.c Outdated Show resolved Hide resolved

src/aof.c Outdated Show resolved Hide resolved

vitarb force-pushed the dict-split-by-slot branch from f1e26f0 to 16d4eb3 Compare January 13, 2023 06:55

This comment was marked as outdated.

Sign in to view

zuiderkwast reviewed Jan 13, 2023

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

zuiderkwast reviewed Jan 13, 2023

View reviewed changes

src/server.h Outdated Show resolved Hide resolved

vitarb force-pushed the dict-split-by-slot branch 2 times, most recently from 6070a1a to 58800d6 Compare January 17, 2023 20:08

vitarb marked this pull request as ready for review January 17, 2023 20:50

madolson added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Jan 17, 2023

zuiderkwast reviewed Jan 18, 2023

View reviewed changes

src/dict.h Outdated Show resolved Hide resolved

zuiderkwast reviewed Jan 20, 2023

View reviewed changes

src/server.h Outdated Show resolved Hide resolved

zuiderkwast reviewed Jan 20, 2023

View reviewed changes

src/aof.c Outdated Show resolved Hide resolved

vitarb force-pushed the dict-split-by-slot branch 2 times, most recently from 8c4bdee to a20c56b Compare January 25, 2023 07:03

madolson reviewed Feb 4, 2023

View reviewed changes

oranagra mentioned this pull request Dec 26, 2023

Optimize resizing hash table to resize not only non-empty dicts. #12819

Merged

oranagra mentioned this pull request Feb 5, 2024

Fix active expire timeout when db done the scanning #13030

Merged

oranagra mentioned this pull request Feb 6, 2024

Fix the failure of defrag test under 32-bit #13013

Merged

sundb mentioned this pull request Feb 27, 2024

Implement defragmentation for pubsub kvstore #13058

Merged

oranagra mentioned this pull request Mar 11, 2024

Add KVSTORE_FREE_EMPTY_DICTS to cluster mode keys / expires kvstore #13072

Merged

sundb mentioned this pull request Jun 3, 2024

Fixed crashes due to missed slotToKeyInit() and missed expires_cursor reset #13315

Merged

Replace cluster metadata with slot specific dictionaries #11695

Replace cluster metadata with slot specific dictionaries #11695

Uh oh!

Conversation

vitarb commented Jan 6, 2023 • edited by madolson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Important changes

Performance

Interface changes

Uh oh!

sjpotter commented Jan 7, 2023

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vitarb commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitarb commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

molian6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vitarb commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

molian6 commented Jan 11, 2023

Uh oh!

Uh oh!

Uh oh!

madolson commented Jan 12, 2023

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vitarb commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitarb commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Feb 2, 2023

Uh oh!

oranagra commented Feb 3, 2023

Uh oh!

zuiderkwast commented Feb 3, 2023

Uh oh!

oranagra commented Feb 3, 2023

Uh oh!

zuiderkwast commented Feb 3, 2023

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

vitarb commented Jan 6, 2023 •

edited by madolson

Loading

vitarb commented Jan 10, 2023 •

edited

Loading

vitarb commented Jan 10, 2023 •

edited

Loading

vitarb commented Jan 11, 2023 •

edited

Loading

vitarb commented Jan 27, 2023 •

edited

Loading

vitarb commented Feb 2, 2023 •

edited

Loading