Skip to content

keyspace - Unify key and value & use dict no_value=1#13806

Merged
moticless merged 35 commits into
redis:unstablefrom
moticless:unify-keys-and-vals
May 12, 2025
Merged

keyspace - Unify key and value & use dict no_value=1#13806
moticless merged 35 commits into
redis:unstablefrom
moticless:unify-keys-and-vals

Conversation

@moticless

@moticless moticless commented Feb 16, 2025

Copy link
Copy Markdown
Collaborator

The idea of packing the key (sds), value (robj) and optionally TTL into a single struct in memory was mentioned a few times in the past by the community in various flavors. This approach improves memory efficiency, reduces pointer dereferences for faster lookups, and simplifies expiration management by keeping all relevant data in one place. This change goes along with setting keyspace's dict to no_value=1, and saving considerable amount of memory.

Two more motivations that well aligned with this unification are:

  • Prepare the groundwork for replacing EXPIRE scan based implementation and evaluate instead new ebuckets data structure that was introduced as part of Hash Field Expiration feature. Using this data structure requires embedding the ExpireMeta structure within each object.
  • Consider replacing dict with a more space efficient open addressing approach hash table that might rely on keeping a single pointer to object.

Before this PR, I POC'ed on a variant of open addressing hash-table and was surprised to find that dict with no_value actually could provide a good balance between performance, memory efficiency, and simplicity. This realization prompted the separation of the unification step from the evaluation of a new hash table to avoid introducing too many changes at once and to evaluate its impact independently before considering replacement of existing hash-table. On an earlier commit I extended dict no_value optimization (which saves keeping dictEntry where possible) to be relevant also for objects with even addresses in memory. Combining it with this unification saves a considerable amount of memory for keyspace.

kvobj

This PR adopts Valkey’s packing layout and logic for key, value, and TTL. However, unlike Valkey implementation, which retained a common robj throughout the project, this PR distinguishes between the general-purpose, overused robj, and the new kvobj, which embeds both the key and value and used by the keyspace. Conceptually, robj serves as a base class, while kvobj acts as a derived class.

Two new flags introduced into redis object, iskvobj and expirable:

struct redisObject {
    unsigned type:4;
    unsigned encoding:4;
    unsigned lru:LRU_BITS;
    unsigned iskvobj : 1;             /* new flag */
    unsigned expirable : 1;           /* new flag */
    unsigned refcount : 30;           /* modified: 32bits->30bits */
    void *ptr;
};

typedef struct redisObject robj;
typedef struct redisObject kvobj;

When the iskvobj flag is set, the object includes also the key and it is appended to the end of the object. If the expirable flag is set, an additional 8 bytes are added to the object. If the object is of type string, and the string is rather short, then it will be embedded as well.

As a result, all keys in the keyspace are promoted to be of type kvobj. This term attempts to align with the existing Redis object, robj, and the kvstore data structure.

EXPIRE Implementation

As kvobj embeds expiration time as well, looking up expiration times is now an O(1) operation. And the hash-table of EXPIRE is set now to be no_value mode, directly referencing kvobj entries, and in turn, saves memory.

Next, I plan to evaluate replacing the EXPIRE implementation with the ebuckets data structure, which would eliminate keyspace scans for expired keys. This requires embedding ExpireMeta within each kvobj of each key with expiration. In such implementation, the expirable flag will be shifted to indicate whether ExpireMeta is attached.

Benchmark

Tested briefly on my laptop. From the limited tests I conducted, there is no change in RPS, but the memory savings is significant for short strings:

NumOfKeys SET_Ops/sec (unstable) GET_Ops/sec (unstable) used_memory_human (unstable) SET_Ops/sec (unified) GET_Ops/sec (unified) used_memory_human (unified)
1,000,000 1,093,691.04 1,483,019.43 77.34M 1,157,560.14 1,448,104.29 59.87M
2,000,000 1,058,650.84 1,440,737.43 161.63M 1,137,901.74 1,415,035.60 119.07M
3,000,000 1,030,358.48 1,436,057.39 253.93M 1,080,256.58 1,380,673.88 183.08M
4,000,000 1,079,633.79 1,422,600.57 330.22M 1,087,372.86 1,362,898.23 237.47M
5,000,000 1,023,470.64 1,401,456.06 438.51M 1,068,068.43 1,272,556.06 312.37M
6,000,000 953,786.65 1,274,957.56 514.81M 1,040,961.13 1,352,156.86 365.47M
7,000,000 949,587.36 1,200,983.37 591.10M 1,029,271.15 1,265,204.82 419.46M
8,000,000 954,916.48 1,228,501.61 667.39M 931,313.21 1,111,849.72 474.25M
9,000,000 917,554.47 1,119,246.37 807.69M 950,059.29 1,019,576.66 571.71M
10,000,000 898,903.31 1,209,029.03 883.98M 908,114.36 1,058,442.09 624.07M

Implementation notes

Manipulating keyspace (find, modify, insert)

Initially, unifying the key and value into a single object and storing it in dict with no_value optimization seemed like a quick win. However, it (quickly) became clear that this change required deeper modifications to how keys are manipulated. The challenge was handling cases where a dictEntry is opt-out due to no_value optimization. In such cases, many of the APIs that return the dictEntry from a lookup become insufficient, as it just might be the key itself. To address this issue, a new-old approach of returning a "link" to the looked-up key's dictEntry instead of the dictEntry itself. The term link was already somewhat available in dict API, and is well aligned with the new dictEntLink declaration:

typedef dictEntry **dictEntLink;

This PR introduces two new function APIs to dict to leverage returned link from the search:

dictEntLink dictFindLink(dict *d, const void *key, dictEntLink *bucket);
void dictSetKeyAtLink(dict *d, void *key, dictEntLink *link, int newItem);

After calling link = dictFindLink(...), any necessary updates must be performed immediately after by calling dictSetKeyAtLink() without any intervening operations on given dict. Otherwise, dictEntLink may become invalid. Example:

/* replace existing key */
link = dictFindLink(d, key, &bucket, 0);
// ... Do something, but don't modify the dict ...
// assert(link != NULL);
dictSetKeyAtLink(d, kv, &link, 0);
     
/* Add new value (If no space for the new key, dict will be expanded and 
   bucket will be looked up again.) */  
link = dictFindLink(d, key, &bucket);
// ... Do something, but don't modify the dict ...
// assert(link == NULL);
dictSetKeyAtLink(d, kv, &bucket, 1);

dict.h

  • The dict API has became cluttered with many unused functions. I have removed these from dict.h.
  • Additionally, APIs specifically related to hash maps (no_value=0), primarily those handling key-value access, have been gathered and isolated.
  • Removed entirely internal functions ending with “*ByHash()” that were originally added for optimization and not required any more.
  • Few other legacy dict functions were adapted at API level to work with the term dictEntLink as well.
  • Simplified and generalized an optimization that related to comparison of length of keys of type strings.

Hash Field Expiration

Until now each hash object with expiration on fields needed to maintain a reference to its key-name (of the hash object), such that in case it will be active-expired, then it will be possible to resolve the key-name for the notification sake. Now there is no need anymore.

@ShooterIT ShooterIT left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t fully mastered it yet, but it looks interesting. I believe it could improve both memory usage and performance, especially for datasets with expiration times on many keys. Maybe the introduction of this PR may provide us with the opportunity to use other data types to represent the expire key dictionary. Since the kvobject already contains the expiration time, the expire key data structure only needs iteration functionality. For instance, using an rax tree to represent the expire dictionary could save memory, with each element being a combination of [expiretime + kvobject_addr].

Comment thread src/sds.c
Comment thread src/server.c Outdated
Comment thread src/sds.c Outdated
Comment thread src/object.c Outdated
Comment thread src/object.c Outdated
Comment thread src/object.c Outdated
Comment thread src/object.c Outdated
@moticless

Copy link
Copy Markdown
Collaborator Author

Since the kvobject already contains the expiration time, the expire key data structure only needs iteration functionality. For instance, using an rax tree to represent the expire dictionary could save memory, with each element being a combination of [expiretime + kvobject_addr].

@ShooterIT, that's exactly the direction I'm heading! However, the approach you suggested was already evaluated long ago by Salvatore and didn’t prove effective. My goal is to assess the ebuckets data structure for this purpose, as it has demonstrated low memory consumption per key while maintaining decent performance.

@moticless moticless added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Mar 3, 2025
@redis redis deleted a comment from fcostaoliveira Mar 5, 2025
@redis redis deleted a comment from fcostaoliveira Mar 5, 2025
@moticless moticless added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels Mar 5, 2025
@moticless

Copy link
Copy Markdown
Collaborator Author

Did some basic benchmarking. I run memtier_benchmark (-c 1 -t 1 --pipeline 1000) of SET and GET with short strings on c6i.2xlarge :

numitems Test        Step Cmd        Ops/sec_unify  Ops/sec_unstable  Ops/sec_diff_%  used_memory_unify  used_memory_unstable  used_memory_diff_%                                                                                                            
100000   SINGLE_HT   1    Hsets         1276959.81        1506863.76           18.00            7086088               6562392               -7.39
                     2    Hgets         1733102.25        1813697.04            4.65            6635864               6636456                0.01
                     3    Hdels         1507386.19        1642548.58            8.97             811888                812464                0.07
         MULTIPLE_HT 1    Hsets          293292.12         294659.30            0.47           15931872              17061008                7.09
                     2    Hgets         1683416.66        1744683.08            3.64           15931872              17061008                7.09
                     3    Hdels         1715589.56        1765754.95            2.92           15931872              17061008                7.09
         STRING      1    Sets          1553735.96        1525483.20           -1.82            6364712               8285848               30.18
                     2    Sets          1688533.17        1780087.94            5.42            6364712               8285848               30.18
                     3    Gets          2177368.43        2182214.95            0.22            6389400               8310536               30.07
                     4    Dels          2149197.27        2174291.18            1.17             886104                886680                0.07
         STRING_EX   1    Sets           829730.92         883946.64            6.53            8742168              11783864               34.79
                     2    Sets           981807.11        1176456.75           19.83            8742168              11783864               34.79
                     3    Dels          1690960.13        1557389.81           -7.90             886200                886776                0.06
         INCR        1    Incrs         1449968.83        1632972.99           12.62            6430776               5959912               -7.32
         LIST        1    Lpushs        1234674.60        1209935.99           -2.00            8055464               9184600               14.02
                     2    Lranges       1727891.63        1766441.15            2.23            8080152               9209288               13.97
         SORTED_SET  1    Zadds          862039.24         895768.39            3.91            8904992              10034128               12.68
                     2    Zranges       1547628.26        1737076.15           12.24            8929680              10058816               12.64
         SET         1    Sadds         1205865.33        1141135.66           -5.37            8154368               9283656               13.85
                     2    Smemberss     1702533.37        1921451.08           12.86            8179056               9308344               13.81
                     3    Srems         1256407.68        1258114.84            0.14            1083760               1084488                0.07
500000   SINGLE_HT   1    Hsets         1080557.74        1253333.87           15.99           29278152              29278896                0.00
                     2    Hgets         1908528.07        1993135.64            4.43           29278152              29278896                0.00
                     3    Hdels         1503772.97        1807671.03           20.21            1083760               1084488                0.07
         MULTIPLE_HT 1    Hsets          308362.74         309097.99            0.24           76113392              81278760                6.79
                     2    Hgets         1876426.08        1785880.12           -4.83           76113392              81278760                6.79
                     3    Hdels         1556018.21        1658495.81            6.59           76113392              81278760                6.79
         STRING      1    Sets          1350234.67        1466882.20            8.64           28121544              37278912               32.56
                     2    Sets          1578920.78        1773848.51           12.35           28121544              37278912               32.56
                     3    Gets          2182872.31        1981783.45           -9.21           28121544              37278912               32.56
                     4    Dels          2041841.41        2052300.83            0.51            1083912               1084640                0.07
         STRING_EX   1    Sets           702617.25         821823.69           16.97           39159272              53473280               36.55
                     2    Sets           934191.79        1128734.42           20.82           39159272              53473280               36.55
                     3    Dels          1575468.07        1479942.34           -6.06            1084008               1084736                0.07
         INCR        1    Incrs         1430828.05        1514114.58            5.82           28113544              25278912              -10.08
         LIST        1    Lpushs        1067229.03        1057636.99           -0.90           36113696              41279064               14.30
                     2    Lranges       1664851.98        1679249.85            0.86           36113696              41279064               14.30
         SORTED_SET  1    Zadds          813232.93         783066.96           -3.71           40113848              45279216               12.88
                     2    Zranges       1623160.55        1614465.61           -0.54           40113848              45279216               12.88
         SET         1    Sadds         1088191.38        1146828.45            5.39           36114000              41279368               14.30
                     2    Smemberss     1898513.08        1935561.29            1.95           36114000              41279368               14.30
                     3    Srems         1230808.62        1215690.18           -1.23            1084368               1085096                0.07
1000000  SINGLE_HT   1    Hsets         1176701.78        1312120.72           11.51           57473064              57473808                0.00
                     2    Hgets         1817497.78        1856034.80            2.12           57473064              57473808                0.00
                     3    Hdels         1469903.00        1683079.50           14.50            1084368               1085096                0.07
         MULTIPLE_HT 1    Hsets          298590.00         299179.50            0.20          151153344             161473672                6.83
                     2    Hgets         1729969.12        1737613.42            0.44          151153344             161473672                6.83

Takeaway

  • Memory Savings: This PR considerably reduces used_mem by 25-37% for short strings.
  • Performance Tradeoff: This table also reflects the tradeoff between maintaining simple dict with key and value versus optimized dict with no value. The expected RPS gain from avoiding an extra pointer indirection is offset by additional maintenance overhead in the PR’s packed key-value design.
  • Room to optimize:
    • Packing key and value efficiency and readability.
    • Consider replacing the hash table in the future.
    • SDS is pricy. Consider use in the future plain string and fix 4 bytes length for keys.

@moticless moticless added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels Mar 18, 2025
@redis redis deleted a comment from fcostaoliveira Mar 18, 2025
@moticless moticless added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels Mar 18, 2025
@redis redis deleted a comment from fcostaoliveira Mar 18, 2025
@moticless moticless added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels Mar 18, 2025
@redis redis deleted a comment from fcostaoliveira Mar 18, 2025
@moticless moticless added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels Mar 18, 2025
Comment thread src/sds.h Outdated
Comment thread src/dict.c Outdated
Comment thread src/t_string.c Outdated
Comment thread src/sentinel.c Outdated
Comment thread src/sds.c
Comment thread src/server.h Outdated
Comment thread src/dict.h Outdated
Comment thread src/t_string.c
Comment thread src/server.c Outdated
Comment thread src/t_string.c Outdated
Comment thread src/expire.c Outdated
Comment thread src/dict.c Outdated
@sundb

sundb commented May 7, 2025

Copy link
Copy Markdown
Collaborator

@moticless moticless merged commit e1789e4 into redis:unstable May 12, 2025
@moticless moticless deleted the unify-keys-and-vals branch May 28, 2025 03:43
ShooterIT added a commit that referenced this pull request Jun 5, 2025
This PR is based on: valkey-io/valkey#861

> ### Memory Access Amortization
> (Designed and implemented by [dan
touitou](https://github.com/touitou-dan))
> 
> Memory Access Amortization (MAA) is a technique designed to optimize
the performance of dynamic data structures by reducing the impact of
memory access latency. It is applicable when multiple operations need to
be executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.
> 
> Rather than executing operations sequentially, this approach
interleaves the execution of all operations. This is done in such a way
that whenever a memory access is required during an operation, the
program prefetches the necessary memory and transitions to another
operation. This ensures that when one operation is blocked awaiting
memory access, other memory accesses are executed in parallel, thereby
reducing the average access latency.
> 
> We applied this method in the development of dictPrefetch, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, dictPrefetch is an interleaved execution of dictFind for
all the keys.

### Implementation of Redis
When the main thread processes clients with ready-to-execute commands
(i.e., clients for which the IO thread has parsed the commands), a batch
of up to 16 commands is created. Initially, the command's argv, which
were allocated by the IO thread, is prefetched to the main thread's L1
cache. Subsequently, all the dict entries and values required for the
commands are prefetched from the dictionary before the command
execution.

#### Memory prefetching for main hash table
As shown in the picture, after #13806
, we unify key value and the dict uses no_value optimization, so the
memory prefetching has 4 steps:

1. prefetch the bucket of the hash table
2. prefetch the entry associated with the given key's hash
3. prefetch the kv object of the entry
4. prefetch the value data of the kv object

we also need to handle the case that the dict entry is the pointer of kv
object, just skip step 3.

MAA can improves single-threaded memory access efficiency by
interleaving the execution of multiple independent operations, allowing
memory-level parallelism and better CPU utilization. Its key point is
batch-wise interleaved execution. Split a batch of independent
operations (such as multiple key lookups) into multiple state machines,
and interleave their progress within a single thread to hide the memory
access latency of individual requests.

The difference between serial execution and interleaved execution:
**naive serial execution**
```
key1: step1 → wait → step2 → wait → done
key2: step1 → wait → step2 → wait → done
```
**interleaved execution**
```
key1: step1   → step2   → done
key2:   step1 → step2   → done
key3:     step1 → step2 → done
         ↑ While waiting for key1’s memory, progress key2/key3
```

#### New configuration
This PR involves a new configuration `prefetch-batch-max-size`, but we
think it is a low level optimization, so we hide this config:
When multiple commands are parsed by the I/O threads and ready for
execution, we take advantage of knowing the next set of commands and
prefetch their required dictionary entries in a batch. This reduces
memory access costs. The optimal batch size depends on the specific
workflow of the user. The default batch size is 16, which can be
modified using the 'prefetch-batch-max-size' config.
When the config is set to 0, prefetching is disabled.

---------

Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
sundb added a commit that referenced this pull request Jul 30, 2025
…4233)

Introduced by #13806

Fixed a crash in the MOVE command when moving hash objects that have
both key expiration and field expiration.

The issue occurred in the following scenario:
1. A hash has both key expiration and field expiration.
2. During MOVE command, `setExpireByLink()` is called to set the
expiration time for the target hash, which may reallocate the kvobj of
hash.
3. Since the hash has field expiration, `hashTypeAddToExpires()` is
called to update the minimum field expiration time

Issue:
However, the kvobj pointer wasn't updated with the return value from
`setExpireByLink()`, causing `hashTypeAddToExpires()` to use freed
memory.
@sundb sundb removed the action:run-benchmark Triggers the benchmark suite for this Pull Request label Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants