Skip to content

Conversation

@vitarb
Copy link
Contributor

@vitarb vitarb commented Jan 6, 2023

This is an implementation of #10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the dictEntry. The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.

Important changes

  • Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
  • getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index that allows us to search cumulative frequencies of slot ranges, better explanation in comments here. With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
  • Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
  • scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
  • Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
  • Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
  • DB size - besides DBSIZE API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into redisDb that keeps track of key_count. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.

Performance

This change improves SET performance in cluster mode by ~5%, most of gains come from us not having to maintain linked lists for keys in slot, non-cluster mode should have same performance. See benchmarks in the comments below.

Interface changes

  • Removed overhead.hashtable.slot-to-keys to MEMORY STATS
  • Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.

Follow up items:

@sjpotter
Copy link
Contributor

sjpotter commented Jan 7, 2023

getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. In my implementation this uses shuffling algorithm that randomly selects any non-empty owned dict with the same probability and then applies existing logic for random key selection there.

Note, this isn't making keys evenly distributed, as you note, different buckets can have different amount of keys. This might not be a big deal, but should be noted.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have skimmed though the diff. This looks very promising!

Regarding the topics in the top comment:

Incremental rehashing

I think your solution looks good.

dbRandomKey

I think it needs to be more fair. I have commented on the diff.

scan API

See comment on the diff.

Avoid calculating hash every time

Yes I think we need to do this, at least for checking if it affects performance. A simple way could be to cache the last computed hash inside hashKeySlot itself:

unsigned int keyHashSlot(char *key, int keylen) {
    static char *lastkey = NULL;
    static int lastslot = 0;
    if (key == lastkey) return lastslot;
    lastkey = key;
    /* ... compute the slot ... */
    lastslot = slot;
    return slot;
}

Skip slots that aren't owned during iteration. In this implementation I've decided for simplicity, to always allocate an array of 16k dicts in the main dictionary, this results in a bit of extra memory as well as makes iteration over all dictionaries slightly slower.

For simplicity, I like that we have all the dicts allocated, but if we use an array of dict structs (see also comment in the diff) instead of an array of dict pointers, we probably get a bit faster as there are less pointers to follow thus less memory accesses.

@vitarb
Copy link
Contributor Author

vitarb commented Jan 10, 2023

@zuiderkwast @sjpotter thanks for looking at this and providing feedback. As it stands couple areas for improvement are:

  • Making random more fair.
  • Caching/pushing through slot id so we don't need to calculate hashes again.

Another small debt on my side might be writing some tests for SCAN in cluster mode, to ensure that combined cursor+slot logic works properly (I've manually tested it so far and was relying on few tests that already exist, I haven't checked if this case is covered by existing tests).

I'll try to address these issues, meanwhile let's continue discussion in the comments and see if any other area needs to be improved as well.

@vitarb
Copy link
Contributor Author

vitarb commented Jan 10, 2023

@zuiderkwast I've tested your approach with caching hashes for the last key, and at first couldn't believe the numbers, read performance doubled, write performance increased 20-30%! But then I've realized that it's unfortunately broken, the problem seems to be that addresses are getting reused and we can't be sure that pointers being the same mean actually keys being the same.

Screenshot from 2023-01-10 15-47-34

On this screenshot you can see lastkey is pointing to the same place as the key, but hash is actually different from the one calculated before. (I've added extra call to _keyHashSlot just to catch this issue more easily)

In fact jemalloc seem to be reusing same address over and over again for different keys.
So we'll probably need to use a different approach for this.

Copy link

@molian6 molian6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for publishing the PR @vitarb. I'm really excited for this change!
I reviewed db.c and left some questions, will continue reviewing the remaining files tmr.

@vitarb
Copy link
Contributor Author

vitarb commented Jan 11, 2023

@molian6

+1 that 48 bits should be good enough, the public doc suggested max number of keys is 2^32 (though I didn't find this hard limit > in the code, I guess it is the limit in a 32 bit system)
I'm curious where the other 2 bits goes. We should have 64 bits in the unsigned long long, if cursor takes 48 bits and slot takes > 14 bits, that adds up to 62 bits.

If we do what @zuiderkwast suggested and move slot id to LSB then cursor can occupy remaining 50 bits. My reasoning for using 48 was to leave sign bit alone to make sure that numbers look the same on the client, even if signed long is used there, also left 1 extra bit for parity and potential future improvements.

@molian6
Copy link

molian6 commented Jan 11, 2023

@molian6

+1 that 48 bits should be good enough, the public doc suggested max number of keys is 2^32 (though I didn't find this hard limit > in the code, I guess it is the limit in a 32 bit system)
I'm curious where the other 2 bits goes. We should have 64 bits in the unsigned long long, if cursor takes 48 bits and slot takes > 14 bits, that adds up to 62 bits.

If we do what @zuiderkwast suggested and move slot id to LSB then cursor can occupy remaining 50 bits. My reasoning for using 48 was to leave sign bit alone to make sure that numbers look the same on the client, even if signed long is used there, also left 1 extra bit for parity and potential future improvements.

@vitarb Got you. I do like the idea of LSB because in reality cursor will not take 48 or 50 bits, we would have more MSB left for other potential uses in future. It gave use more flexibilities.

@madolson
Copy link
Contributor

Yes I think we need to do this, at least for checking if it affects performance. A simple way could be to cache the last computed hash inside hashKeySlot itself:

We had some conversation in another thread about a key handle that would cache the crc and the hash value after the initial lookup. Re-using this handle would remove the extra computation. I think this is still the most promising way forward.

@vitarb vitarb force-pushed the dict-split-by-slot branch from f1e26f0 to 16d4eb3 Compare January 13, 2023 06:55
@vitarb

This comment was marked as outdated.

@vitarb vitarb force-pushed the dict-split-by-slot branch 2 times, most recently from 6070a1a to 58800d6 Compare January 17, 2023 20:08
@vitarb vitarb marked this pull request as ready for review January 17, 2023 20:50
@madolson madolson added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Jan 17, 2023
@vitarb vitarb force-pushed the dict-split-by-slot branch 2 times, most recently from 8c4bdee to a20c56b Compare January 25, 2023 07:03
@vitarb
Copy link
Contributor Author

vitarb commented Jan 27, 2023

Folks, I think I've addressed all feedback, the last bit remaining is performance. I'm going to run a set of benchmarks and try to optimize any rough edges. Please take another look and let me know if there are any more suggestions or concerns.

@vitarb
Copy link
Contributor Author

vitarb commented Feb 2, 2023

As promised before, I've done some performance analysis.

Methodology
I've tested GETs/SETs with cluster mode yes/no, using request pipelining to make any dictionary related issues more distinctive, 50M keys and 100 bytes of data per entry using redis-benchmark. (see exact commands on screenshots below). Tests were ran on c5.9xlarge x86_64 Linux cloud machine.

Here is a summary of results.

Cluster mode enabled

  • SETs are ~7-8% faster on this branch.
  • GETs are ~1% faster on this branch.

Benchmark comparison in cluster mode:
image

Here are flamegraphs for SETs in cluster mode:
this branch
dss set cluster yes
unstable
unstable set cluster yes

Looking at the flamegraph, we can see that primary reason for SETs being faster is because slotToKeyAddEntry has been eliminated which was eating ~5% of performance.

GETs:
this branch
dss get cluster yes
unstable
unstable get cluster yes

One curious thing to note here is that dictSdsKeyCompare reduced by almost 8%, but other code paths, including dictFind gained slightly. Need to look a bit more into this, maybe there is some potential to improve it even further.

In cluster mode disabled

  • SETs have about same performance.
  • GETs are ~1-2% slower on this branch.

Screenshot from 2023-02-01 22-44-06

Looking at the flamegraphs I couldn't pinpoint the reason for slowness on GET.
this branch
dss get cluster no
unstable
unstable get cluster no

I'm going to investigate this added slowness on GET when cluster is mode disabled, meanwhile let me know if you have any thoughts or recommendations on testing process or methodology.

@zuiderkwast
Copy link
Contributor

@filipecosta90 might have an idea about performance. Do any of the benchmarks on grafana run in cluster mode?

@oranagra
Copy link
Member

oranagra commented Feb 3, 2023

Hi, i quickly skimmed though the discussion (not the code, or code-review comments).
i'm quite concerned about this change, here's a random set of concerns:

  1. as you said, we have an impact on dbsize, which used to be very fast and maybe callers abuse it. let's see a benchmark and consider optimizing it.
  2. the effects on incremental rehashing, random and probably other things we didn't realize yet need some thinking and may take time till we can fully trust this.
  3. i would like to learn more about the possible negative impacts of this on non-clustered deployments. the first thing that comes to mind is memory usage, specifically on small databases.
  4. i didn't yet look at the code and the attempt to avoid re-hashing or re-finding the slot (if we have a cache that cleared after each command), i worry that maybe you made some optimizations to reduce the negative impacts of this PR on some cases, but can't use these tricks for other cases and they're left unoptimized and the PR has a negative impact. e.g. multi-slot scripts, MGET, SUNIONSTORE, etc.

i'd note that in both 6.2 and 7.2 we added some severe performance regression for certain (even popular) workloads without realizing it, and it took some time until the community reported these regressions, so this drastic change worries me and i would want to seek more confidence.

@zuiderkwast
Copy link
Contributor

Some thoughts on points 3-4:

3: I don't think there's any memory impact in standalone mode. There's only one dict, not 16K as in cluster mode.

4: MGET and SUNIONSTORE raise -CROSSSLOT if keys are of different slots, so that shoudn't be an issue. Multi-slot scripts might be a problem though, if they're supported in cluster mode. I'm not sure what we allow in this area.

@oranagra
Copy link
Member

oranagra commented Feb 3, 2023

Ohh. I didn't know that in non cluster mode there's one dictionary (didn't look at the code yet).
So what impact does it have on non cluster mode, or any other negative impact on cluster mode?

@zuiderkwast
Copy link
Contributor

For performance impact in cluster and non-cluster mode, see vitarb's comment above, the one with the flame graphs. (Theoretically, there shouldn't be any impact on non-cluster. I don't know where it comes from.)

For cluster mode, there is a large fixed memory impact (16K dicts), but a 16B/key save and improved latency.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look. The only think I still don't really like is the random key selection, but I'm not sure how much it really matters. Do we have any performance testing for eviction? That might be more heavily impacted by the increased time spent selecting a random DB. I would still bias towards making the random db selection faster, and random over a long time frame, then necessarily accurate at a given point in time.

soloestoy pushed a commit that referenced this pull request Dec 27, 2023
…for shard channels. (#12804)

We have achieved replacing `slots_to_keys` radix tree with key->slot
linked list (#9356), and then replacing the list with slot specific
dictionaries for keys (#11695).

Shard channels behave just like keys in many ways, and we also need a
slots->channels mapping. Currently this is still done by using a radix
tree. So we should split `server.pubsubshard_channels` into 16384 dicts
and drop the radix tree, just like what we did to DBs.

Some benefits (basically the benefits of what we've done to DBs):
1. Optimize counting channels in a slot. This is currently used only in
removing channels in a slot. But this is potentially more useful:
sometimes we need to know how many channels there are in a specific slot
when doing slot migration. Counting is now implemented by traversing the
radix tree, and with this PR it will be as simple as calling `dictSize`,
from O(n) to O(1).
2. The radix tree in the cluster has been removed. The shard channel
names no longer require additional storage, which can save memory.
3. Potentially useful in slot migration, as shard channels are logically
split by slots, thus making it easier to migrate, remove or add as a
whole.
4. Avoid rehashing a big dict when there is a large number of channels.

Drawbacks:
1. Takes more memory than using radix tree when there are relatively few
shard channels.

What this PR does:
1. in cluster mode, split `server.pubsubshard_channels` into 16384
dicts, in standalone mode, still use only one dict.
2. drop the `slots_to_channels` radix tree.
3. to save memory (to solve the drawback above), all 16384 dicts are
created lazily, which means only when a channel is about to be inserted
to the dict will the dict be initialized, and when all channels are
deleted, the dict would delete itself.
5. use `server.shard_channel_count` to keep track of the number of all
shard channels.

---------

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
oranagra pushed a commit that referenced this pull request Jan 29, 2024
)

The function `tryResizeHashTables` only attempts to shrink the dicts
that has keys (change from #11695), this was a serious problem until the
change in #12850 since it meant if all keys are deleted, we won't shrink
the dick.
But still, both dictShrink and dictExpand may be blocked by a fork child
process, therefore, the cron job needs to perform both dictShrink and
dictExpand, for not just non-empty dicts, but all dicts in DBs.

What this PR does:

1. Try to resize all dicts in DBs (not just non-empty ones, as it was
since #12850)
2. handle both shrink and expand (not just shrink, as it was since
forever)
3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink`
`htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded`
`dictExpandIfNeeded` which already contains all the code of those
functions we get rid of, to make APIs more neat)
4. In the `Don't rehash if redis has child process` test, now that cron
would do resizing, we no longer need to write to DB after the child
process got killed, and can wait for the cron to expand the hash table.
oranagra pushed a commit that referenced this pull request Feb 5, 2024
…re (#12822)

# Description
Gather most of the scattered `redisDb`-related code from the per-slot
dict PR (#11695) and turn it to a new data structure, `kvstore`. i.e.
it's a class that represents an array of dictionaries.

# Motivation
The main motivation is code cleanliness, the idea of using an array of
dictionaries is very well-suited to becoming a self-contained data
structure.
This allowed cleaning some ugly code, among others: loops that run twice
on the main dict and expires dict, and duplicate code for allocating and
releasing this data structure.

# Notes
1. This PR reverts the part of #12848
where the `rehashing` list is global (handling rehashing `dict`s is
under the responsibility of `kvstore`, and should not be managed by the
server)
2. This PR also replaces the type of `server.pubsubshard_channels` from
`dict**` to `kvstore` (original PR:
#12804). After that was done,
server.pubsub_channels was also chosen to be a `kvstore` (with only one
`dict`, which seems odd) just to make the code cleaner by making it the
same type as `server.pubsubshard_channels`, see
`pubsubtype.serverPubSubChannels`
3. the keys and expires kvstores are currenlty configured to allocate
the individual dicts only when the first key is added (unlike before, in
which they allocated them in advance), but they won't release them when
the last key is deleted.

Worth mentioning that due to the recent change the reply of DEBUG
HTSTATS changed, in case no keys were ever added to the db.

before:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
```

after:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
[Expires HT]
```
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
After redis#11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.

This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.

Additionally, there are other modifications:

1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in redis#12705.
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
…for shard channels. (redis#12804)

We have achieved replacing `slots_to_keys` radix tree with key->slot
linked list (redis#9356), and then replacing the list with slot specific
dictionaries for keys (redis#11695).

Shard channels behave just like keys in many ways, and we also need a
slots->channels mapping. Currently this is still done by using a radix
tree. So we should split `server.pubsubshard_channels` into 16384 dicts
and drop the radix tree, just like what we did to DBs.

Some benefits (basically the benefits of what we've done to DBs):
1. Optimize counting channels in a slot. This is currently used only in
removing channels in a slot. But this is potentially more useful:
sometimes we need to know how many channels there are in a specific slot
when doing slot migration. Counting is now implemented by traversing the
radix tree, and with this PR it will be as simple as calling `dictSize`,
from O(n) to O(1).
2. The radix tree in the cluster has been removed. The shard channel
names no longer require additional storage, which can save memory.
3. Potentially useful in slot migration, as shard channels are logically
split by slots, thus making it easier to migrate, remove or add as a
whole.
4. Avoid rehashing a big dict when there is a large number of channels.

Drawbacks:
1. Takes more memory than using radix tree when there are relatively few
shard channels.

What this PR does:
1. in cluster mode, split `server.pubsubshard_channels` into 16384
dicts, in standalone mode, still use only one dict.
2. drop the `slots_to_channels` radix tree.
3. to save memory (to solve the drawback above), all 16384 dicts are
created lazily, which means only when a channel is about to be inserted
to the dict will the dict be initialized, and when all channels are
deleted, the dict would delete itself.
5. use `server.shard_channel_count` to keep track of the number of all
shard channels.

---------

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
…is#12819)

The function `tryResizeHashTables` only attempts to shrink the dicts
that has keys (change from redis#11695), this was a serious problem until the
change in redis#12850 since it meant if all keys are deleted, we won't shrink
the dick.
But still, both dictShrink and dictExpand may be blocked by a fork child
process, therefore, the cron job needs to perform both dictShrink and
dictExpand, for not just non-empty dicts, but all dicts in DBs.

What this PR does:

1. Try to resize all dicts in DBs (not just non-empty ones, as it was
since redis#12850)
2. handle both shrink and expand (not just shrink, as it was since
forever)
3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink`
`htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded`
`dictExpandIfNeeded` which already contains all the code of those
functions we get rid of, to make APIs more neat)
4. In the `Don't rehash if redis has child process` test, now that cron
would do resizing, we no longer need to write to DB after the child
process got killed, and can wait for the cron to expand the hash table.
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
…re (redis#12822)

# Description
Gather most of the scattered `redisDb`-related code from the per-slot
dict PR (redis#11695) and turn it to a new data structure, `kvstore`. i.e.
it's a class that represents an array of dictionaries.

# Motivation
The main motivation is code cleanliness, the idea of using an array of
dictionaries is very well-suited to becoming a self-contained data
structure.
This allowed cleaning some ugly code, among others: loops that run twice
on the main dict and expires dict, and duplicate code for allocating and
releasing this data structure.

# Notes
1. This PR reverts the part of redis#12848
where the `rehashing` list is global (handling rehashing `dict`s is
under the responsibility of `kvstore`, and should not be managed by the
server)
2. This PR also replaces the type of `server.pubsubshard_channels` from
`dict**` to `kvstore` (original PR:
redis#12804). After that was done,
server.pubsub_channels was also chosen to be a `kvstore` (with only one
`dict`, which seems odd) just to make the code cleaner by making it the
same type as `server.pubsubshard_channels`, see
`pubsubtype.serverPubSubChannels`
3. the keys and expires kvstores are currenlty configured to allocate
the individual dicts only when the first key is added (unlike before, in
which they allocated them in advance), but they won't release them when
the last key is deleted.

Worth mentioning that due to the recent change the reply of DEBUG
HTSTATS changed, in case no keys were ever added to the db.

before:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
```

after:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
[Expires HT]
```
oranagra added a commit that referenced this pull request Feb 11, 2024
Fail CI:
https://github.com/redis/redis/actions/runs/7837608438/job/21387609715

## Why defragment tests only failed under 32-bit

First of all, under 32-bit jemalloc will allocate more small bins and
less large bins, which will also lead to more external fragmentation,
therefore, the fragmentation ratio is higher in 32-bit than in 64-bit,
so the defragment tests(`Active defrag eval scripts: cluster` and
`Active defrag big keys: cluster`) always fails in 32-bit.

## Why defragment tests only failed with cluster
The fowllowing is the result of `Active defrag eval scripts: cluster`
test.

1) Before #11695, the fragmentation ratio is 3.11%.

2) After #11695, the fragmentation ratio grew to 4.58%.
Since we are using per-slot dictionary to manage slots, we will only
defragment the contents of these dictionaries (keys, values), but not
the dictionaries' struct and ht_table, which means that frequent
shrinking and expanding of the dictionaries, will make more fragments.

3) After #12850 and #12948, In cluster mode, a large number of cluster
slot dicts will be shrunk, creating additional fragmention, and the
dictionary will not be defragged.

## Solution
* Add defragmentation of the per-slot dictionary's own structures, dict
struct and ht_table.

## Other change
* Increase floating point print precision of `frags` and `rss` in debug
logs for defrag

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request Feb 27, 2024
without timeout

Fix an overlook in redis#11695, if defragment doesn't reach the end time, we should wait for the current db's keys and expires to finish before leaving, now it's possible to exit early when the keys are defragmented.
oranagra added a commit that referenced this pull request Mar 4, 2024
After #13013

### This PR make effort to defrag the pubsub kvstore in the following
ways:

1. Till now server.pubsub(shard)_channels only share channel name obj
with the first subscribed client, now change it so that the clients and
the pubsub kvstore share the channel name robj.
This would save a lot of memory when there are many subscribers to the
same channel.
It also means that we only need to defrag the channel name robj in the
pubsub kvstore, and then update
all client references for the current channel, avoiding the need to
iterate through all the clients to do the same things.
    
2. Refactor the code to defragment pubsub(shard) in the same way as
defragment of keys and EXPIRES, with the exception that we only
defragment pubsub(without shard) when slot is zero.


### Other
Fix an overlook in #11695, if defragment doesn't reach the end time, we
should wait for the current
db's keys and expires, pubsub and pubsubshard to finish before leaving,
now it's possible to exit
early when the keys are defragmented.

---------

Co-authored-by: oranagra <oran@redislabs.com>
oranagra added a commit that referenced this pull request Mar 13, 2024
…13072)

Currently (following #11695, and #12822), keys kvstore and expires
kvstore both flag with ON_DEMAND, it means that a cluster node will
only allocate a dict when the slot is assigned to it and populated,
but on the other hand, when the slot is unassigned, the dict will
remain allocated.

We considered releasing the dict when the slot is unassigned, but it
causes complications on replicas. On the other hand, from benchmarks
we conducted, it looks like the performance impact of releasing the
dict when it becomes empty and re-allocate it when a key is added
again, isn't huge.

This PR add KVSTORE_FREE_EMPTY_DICTS to cluster mode keys / expires
kvstore.

The impact is about about 2% performance drop, for this hopefully
uncommon scenario.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
sundb added a commit that referenced this pull request Jun 18, 2024
… reset (#13315)

this PR fixes two crashes:

1. Fix missing slotToKeyInit() when using `flushdb async` under cluster
mode.
    #13205

2. Fix missing expires_cursor reset when stopping active defrag in the
middle of defragment.
    #13307
If we stop active defrag in the middle of defragging db->expires, if
`expires_cursor` is not reset to 0, the next time we enable active
defrag again, defragLaterStep(db, ...) will be entered. However, at this
time, `db` has been reset to NULL, which results in crash.

The affected code were removed by #11695 and #13058 in usntable, so we
just need backport this to 7.2.
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 25, 2025
This is an implementation of redis#10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot.  Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.

* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.

This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.

RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.

* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 25, 2025
Dictionary iterator logic in the `tryResizeHashTables` method is picking the next
(incorrect) dictionary while the cursor is at a given slot. This could lead to some
dictionary/slot getting skipped from resizing.

Also stabilize the test.

problem introduced recently in redis#11695
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 25, 2025
Fixing issues described in redis#12672, started after redis#11695
Related to redis#12674

Fixes the `defrag didn't stop' issue.

In some cases of how the keys were stored in memory
defrag_later_item_in_progress was not getting reset once we finish
defragging the later items and we move to the next slot. This stopped
the scan to happen in the later slots and did not get
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…for shard channels. (redis#12804)

We have achieved replacing `slots_to_keys` radix tree with key->slot
linked list (redis#9356), and then replacing the list with slot specific
dictionaries for keys (redis#11695).

Shard channels behave just like keys in many ways, and we also need a
slots->channels mapping. Currently this is still done by using a radix
tree. So we should split `server.pubsubshard_channels` into 16384 dicts
and drop the radix tree, just like what we did to DBs.

Some benefits (basically the benefits of what we've done to DBs):
1. Optimize counting channels in a slot. This is currently used only in
removing channels in a slot. But this is potentially more useful:
sometimes we need to know how many channels there are in a specific slot
when doing slot migration. Counting is now implemented by traversing the
radix tree, and with this PR it will be as simple as calling `dictSize`,
from O(n) to O(1).
2. The radix tree in the cluster has been removed. The shard channel
names no longer require additional storage, which can save memory.
3. Potentially useful in slot migration, as shard channels are logically
split by slots, thus making it easier to migrate, remove or add as a
whole.
4. Avoid rehashing a big dict when there is a large number of channels.

Drawbacks:
1. Takes more memory than using radix tree when there are relatively few
shard channels.

What this PR does:
1. in cluster mode, split `server.pubsubshard_channels` into 16384
dicts, in standalone mode, still use only one dict.
2. drop the `slots_to_channels` radix tree.
3. to save memory (to solve the drawback above), all 16384 dicts are
created lazily, which means only when a channel is about to be inserted
to the dict will the dict be initialized, and when all channels are
deleted, the dict would delete itself.
5. use `server.shard_channel_count` to keep track of the number of all
shard channels.

---------

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…is#12819)

The function `tryResizeHashTables` only attempts to shrink the dicts
that has keys (change from redis#11695), this was a serious problem until the
change in redis#12850 since it meant if all keys are deleted, we won't shrink
the dick.
But still, both dictShrink and dictExpand may be blocked by a fork child
process, therefore, the cron job needs to perform both dictShrink and
dictExpand, for not just non-empty dicts, but all dicts in DBs.

What this PR does:

1. Try to resize all dicts in DBs (not just non-empty ones, as it was
since redis#12850)
2. handle both shrink and expand (not just shrink, as it was since
forever)
3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink`
`htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded`
`dictExpandIfNeeded` which already contains all the code of those
functions we get rid of, to make APIs more neat)
4. In the `Don't rehash if redis has child process` test, now that cron
would do resizing, we no longer need to write to DB after the child
process got killed, and can wait for the cron to expand the hash table.
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…re (redis#12822)

# Description
Gather most of the scattered `redisDb`-related code from the per-slot
dict PR (redis#11695) and turn it to a new data structure, `kvstore`. i.e.
it's a class that represents an array of dictionaries.

# Motivation
The main motivation is code cleanliness, the idea of using an array of
dictionaries is very well-suited to becoming a self-contained data
structure.
This allowed cleaning some ugly code, among others: loops that run twice
on the main dict and expires dict, and duplicate code for allocating and
releasing this data structure.

# Notes
1. This PR reverts the part of redis#12848
where the `rehashing` list is global (handling rehashing `dict`s is
under the responsibility of `kvstore`, and should not be managed by the
server)
2. This PR also replaces the type of `server.pubsubshard_channels` from
`dict**` to `kvstore` (original PR:
redis#12804). After that was done,
server.pubsub_channels was also chosen to be a `kvstore` (with only one
`dict`, which seems odd) just to make the code cleaner by making it the
same type as `server.pubsubshard_channels`, see
`pubsubtype.serverPubSubChannels`
3. the keys and expires kvstores are currenlty configured to allocate
the individual dicts only when the first key is added (unlike before, in
which they allocated them in advance), but they won't release them when
the last key is deleted.

Worth mentioning that due to the recent change the reply of DEBUG
HTSTATS changed, in case no keys were ever added to the db.

before:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
```

after:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
[Expires HT]
```
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
Fail CI:
https://github.com/redis/redis/actions/runs/7837608438/job/21387609715

## Why defragment tests only failed under 32-bit

First of all, under 32-bit jemalloc will allocate more small bins and
less large bins, which will also lead to more external fragmentation,
therefore, the fragmentation ratio is higher in 32-bit than in 64-bit,
so the defragment tests(`Active defrag eval scripts: cluster` and
`Active defrag big keys: cluster`) always fails in 32-bit.

## Why defragment tests only failed with cluster
The fowllowing is the result of `Active defrag eval scripts: cluster`
test.

1) Before redis#11695, the fragmentation ratio is 3.11%.

2) After redis#11695, the fragmentation ratio grew to 4.58%.
Since we are using per-slot dictionary to manage slots, we will only
defragment the contents of these dictionaries (keys, values), but not
the dictionaries' struct and ht_table, which means that frequent
shrinking and expanding of the dictionaries, will make more fragments.

3) After redis#12850 and redis#12948, In cluster mode, a large number of cluster
slot dicts will be shrunk, creating additional fragmention, and the
dictionary will not be defragged.

## Solution
* Add defragmentation of the per-slot dictionary's own structures, dict
struct and ht_table.

## Other change
* Increase floating point print precision of `frags` and `rss` in debug
logs for defrag

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
After redis#13013

### This PR make effort to defrag the pubsub kvstore in the following
ways:

1. Till now server.pubsub(shard)_channels only share channel name obj
with the first subscribed client, now change it so that the clients and
the pubsub kvstore share the channel name robj.
This would save a lot of memory when there are many subscribers to the
same channel.
It also means that we only need to defrag the channel name robj in the
pubsub kvstore, and then update
all client references for the current channel, avoiding the need to
iterate through all the clients to do the same things.
    
2. Refactor the code to defragment pubsub(shard) in the same way as
defragment of keys and EXPIRES, with the exception that we only
defragment pubsub(without shard) when slot is zero.


### Other
Fix an overlook in redis#11695, if defragment doesn't reach the end time, we
should wait for the current
db's keys and expires, pubsub and pubsubshard to finish before leaving,
now it's possible to exit
early when the keys are defragmented.

---------

Co-authored-by: oranagra <oran@redislabs.com>
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…edis#13072)

Currently (following redis#11695, and redis#12822), keys kvstore and expires
kvstore both flag with ON_DEMAND, it means that a cluster node will
only allocate a dict when the slot is assigned to it and populated,
but on the other hand, when the slot is unassigned, the dict will
remain allocated.

We considered releasing the dict when the slot is unassigned, but it
causes complications on replicas. On the other hand, from benchmarks
we conducted, it looks like the performance impact of releasing the
dict when it becomes empty and re-allocate it when a key is added
again, isn't huge.

This PR add KVSTORE_FREE_EMPTY_DICTS to cluster mode keys / expires
kvstore.

The impact is about about 2% performance drop, for this hopefully
uncommon scenario.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action:run-benchmark Triggers the benchmark suite for this Pull Request release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.