Key embedded in dictEntry - CPU cache friendly #9464

zuiderkwast · 2021-09-06T10:29:28Z

Whenever a key is looked up, the dictEntry (hashtable bucket) and the key
are accessed together. This PR embeds the key in the same allocation as the
dictEntry, which makes sure they are in the same cache line, thus saving
one slow memory access.

This compensates for the potential regression from #9356 for key lengths
15-22 in cluster mode. Furthermore, latency is expected to improve for
other key lengths in cluster mode as well as in plain Redis. Benchmarks
will show.

The callbacks in dictType are changed in the following ways:

dictEntryMetadataBytes is replaced by createEntry, which (if present)
takes care of allocating the entry and setting the key.
keyDestructor and valueDestructor are passed the dictEntry as an extra
argument, so they can check if the key is embedded in the dictEntry and
then not try to free it.
repairEntry is added, which is called from active defrag code when an
entry has been reallocated, moving custom logic from the active defrag
code to the dict type.

TODO:

Working implementation. Tests passing.
Benchmarks for cluster/plain Redis, key lengths of different
allocation sizes.
Memory usage comparison.

Open questions:

What's the maximum key length we want to embed in a dictEntry?
Do we want a similar optimization for other dicts, e.g. hashes?

Whenever a key is looked up, the dictEntry (hashtable bucket) and the key are accessed together. This PR embeds the key in the same allocation as the dictEntry, which makes sure they are in the same cache line, thus saving one slow memory access. This compensates for the potential regression from redis#9356 for key lengths 15-22 in cluster mode. Furthermore, latency is expected to improve for other key lengths in cluster mode as well as in plain Redis. Benchmarks will show. The callbacks in dictType are changed in the following ways: * dictEntryMetadataBytes is replaced by createEntry, which (if present) takes care of allocating the entry and setting the key. * keyDestructor and valueDestructor are passed the dictEntry as an extra argument, so they can check if the key is embedded in the dictEntry and then not try to free it. * repairEntry is added, which is called from active defrag code when an entry has been reallocated, moving custom logic from the active defrag code to the dict type. TODO: * [x] Working implamentation. Tests passing. * [ ] Benchmarks for cluster/plain Redis, key lengths of different allocation sizes. * [ ] Memory usage comparison. Open questions: * What's the maxumum key length we want to embed in a dictEntry? * Do we want a similar optimization for other dicts, e.g. hashes?

madolson · 2021-09-06T23:30:00Z

It's unfortunate that we're paying 8 bytes for the key ptr even though it doesn't do anything. We could stuff a bit in the bottom 3 bits of 8 bit aligned systems, since their pointers always end in 000, which could be used to indicate if it's the new address/old address scheme. We would need to go through and make sure everyone only uses deGetKey, which would handle the pointer maintenance. I'm not sure if that would be too much of over optimizing though.

I think it makes sense besides that though, I'll take a complete look once the tests pass (:

oranagra · 2021-09-07T13:52:46Z

I'm concerned that we may be going too far towards coupling more and more parts of Redis with that dict, which we also wanna replace one day (due to the rehashing memory issue)
It's no reason not to do incremental changes, but I'm just stressing that when the day comes we may have more areas in the code that are tightly coupled and will need more work.
That said, let's proceed with this task and see where it leads us, and make the call when ready.

zuiderkwast · 2021-09-07T14:28:34Z

It's unfortunate that we're paying 8 bytes for the key ptr even though it doesn't do anything. We could stuff a bit in the bottom 3 bits of 8 bit aligned systems, since their pointers always end in 000, which could be used to indicate if it's the new address/old address scheme.

@madolson An sds string is an odd pointer, since its header (an odd number of bytes) is before the pointer. There might still be bits to use though, for example in the 'value' or the 'next' pointer. (We'd need to move around the dictEntry fields, since the embedded key needs to be last.) We have the same waste of bits in the robj with embedded sds (many DB values). I think it can be a different PR.

I'm concerned that we may be going too far towards coupling more and more parts of Redis with that dict, which we also wanna replace one day (due to the rehashing memory issue)

@oranagra I'm well aware. My #8700 was stalled since I couldn't make the build pass, but now it's diverged when #9356 was merged. It's still possible to rebase the HAMT. We'd need to wrap the key-with-cluster-metadata-and-maybe-embedded-key in another struct (basically the dictEntry without the value and next-pointer).

The build's passing now, so it's ready for (pre-)review and benchmarks.

zuiderkwast · 2021-09-07T14:42:27Z

Memory usage analysis

Details (TL;DR Fewer allocations, no impact on total usage)

We have the following allocator bin sizes: 8 16 24 32 40 48 56 64
80 96 112 128 160 192 224 256 ...

In non-cluster mode, the dictEntry is 24 bytes and in cluster mode, it's 40 bytes.

Keys up to length 31 have 2 bytes overhead (SDS type 5) and lengths 32-255 have 4 bytes overhead (SDS type 8). We only embed sds type 5 keys and only if the combined size is up to 64 bytes.

Key size	non-cluster, separate allocations	non-cluster, combined allocation
1-6	8 + 24 = 32	32
7-14	16 + 24 = 40	40
15-22	24 + 24 = 48	48
23-30	32 + 24 = 56	56
31	40 + 24 = 64	64
32-36	40 + 24 = 64	64 (not used)

Key size	cluster, separate allocations	cluster, combined allocation
1-6	8 + 40 = 48	48
7-14	16 + 40 = 56	56
15-22	24 + 40 = 64	64
23-30	32 + 40 = 72	80 (not used)

As seen above, the total allocation size is the same for combined and separate allocations, for the cases where we use embedded keys.

zuiderkwast · 2021-09-07T21:30:27Z

Benchmark: 14 bytes keys, integer values, non-cluster (on laptop)

TL;DR 10% latency win!

This PR:

$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int__
set k:__rand_int__ 1__rand_int__: 908843.00 requests per second, p50=0.479 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int__
set k:__rand_int__ 1__rand_int__: 1025325.50 requests per second, p50=0.431 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int__
set k:__rand_int__ 1__rand_int__: 973899.50 requests per second, p50=0.455 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q get k:__rand_int__
get k:__rand_int__: 1210067.75 requests per second, p50=0.375 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q del k:__rand_int__
del k:__rand_int__: 1140901.25 requests per second, p50=0.383 msec

Unstable:

$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int
set k:__rand_int__ 1__rand_int: 815793.75 requests per second, p50=0.503 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int
set k:__rand_int__ 1__rand_int: 814730.31 requests per second, p50=0.535 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int
set k:__rand_int__ 1__rand_int: 867980.19 requests per second, p50=0.519 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q get k:__rand_int__
get k:__rand_int__: 975419.44 requests per second, p50=0.463 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q del k:__rand_int__

zuiderkwast · 2021-09-08T08:28:55Z

Cluster Benchmark: 17 bytes keys, integer values (on laptop)

TL;DR It's pretty good

Disclaimer: The cluster nodes and redis-benchmark ran on the same laptop.

Here's a throughput chart. Details below.

Cluster: 3 masters, no replicas, no AOF, no rdb.

redis-benchmark options: OPTS='--cluster -p 30001 --threads 2 -P 10 -r 10000000 -n 10000000 -q'.

This PR ("embedded key"):

$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1904036.62 requests per second, p50=0.135 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1901140.62 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1904036.62 requests per second, p50=0.135 msec
$ ./redis-benchmark $OPTS get '{tag}__rand_int__'
get {tag}__rand_int__: 1995609.62 requests per second, p50=0.135 msec
$ ./redis-benchmark $OPTS del '{tag}__rand_int__'
del {tag}__rand_int__: 1996406.50 requests per second, p50=0.127 msec

Unstable ("dict metadata"):

$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1662786.88 requests per second, p50=0.159 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1537988.38 requests per second, p50=0.183 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1666111.38 requests per second, p50=0.167 msec
$ ./redis-benchmark $OPTS get '{tag}__rand_int__'
get {tag}__rand_int__: 1901140.62 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS del '{tag}__rand_int__'
del {tag}__rand_int__: 1904399.25 requests per second, p50=0.151 msec

Old unstable (just before #9356 "dict metadata" was merged):

$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1537751.88 requests per second, p50=0.247 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1478633.75 requests per second, p50=0.247 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1599232.38 requests per second, p50=0.191 msec
$ ./redis-benchmark $OPTS get '{tag}__rand_int__'
get {tag}__rand_int__: 1817851.38 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS del '{tag}__rand_int__'
del {tag}__rand_int__: 1596933.88 requests per second, p50=0.271 msec

zuiderkwast · 2021-09-08T09:20:38Z

Cluster Benchmark: 22 bytes keys, integer values (on laptop)

This is the one with a regression in #9356

Details below.

Explanation: In this case, "old unstable" has 24 bytes dictEntry and 24 bytes key allocation. The later ones have 40+24 or 64 bytes allocation(s). The extra slot-to-key metadata in the dictEntry is not used for GET.

Cluster: 3 masters, no replicas, no AOF, no rdb.

$ OPTS='--cluster -p 30001 --threads 2 -P 10 -r 10000000 -n 10000000 -q'

This PR:

$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1813894.38 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1736111.00 requests per second, p50=0.151 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1814552.62 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS get 'key:{tag}:__rand_int__'
get key:{tag}:__rand_int__: 1900779.25 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS del 'key:{tag}:__rand_int__'
del key:{tag}:__rand_int__: 1904036.62 requests per second, p50=0.135 msec

Unstable ("dict metadata")

$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1477759.62 requests per second, p50=0.295 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1428163.38 requests per second, p50=0.295 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1595659.75 requests per second, p50=0.279 msec
$ ./redis-benchmark $OPTS get 'key:{tag}:__rand_int__'
get key:{tag}:__rand_int__: 1899335.25 requests per second, p50=0.223 msec
$ ./redis-benchmark $OPTS del 'key:{tag}:__rand_int__'
del key:{tag}:__rand_int__: 1734605.38 requests per second, p50=0.263 msec

Old unstable (just before #9356 was merged):

$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1329787.25 requests per second, p50=0.335 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1330141.00 requests per second, p50=0.343 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1477978.12 requests per second, p50=0.311 msec
$ ./redis-benchmark $OPTS get 'key:{tag}:__rand_int__'
get key:{tag}:__rand_int__: 1994813.50 requests per second, p50=0.215 msec
$ ./redis-benchmark $OPTS del 'key:{tag}:__rand_int__'
del key:{tag}:__rand_int__: 1425110.38 requests per second, p50=0.319 msec

zuiderkwast · 2021-09-09T09:22:49Z

Benchmark: 17 bytes keys, integer values, non-cluster (on laptop)

TL;DR 10% better than unstable here too

This PR:

$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 888573.00 requests per second, p50=0.479 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 887075.31 requests per second, p50=0.479 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 998302.88 requests per second, p50=0.439 msec 
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q get key::__rand_int__
get key::__rand_int__: 1142596.00 requests per second, p50=0.383 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q del key::__rand_int__
del key::__rand_int__: 1052188.62 requests per second, p50=0.415 msec

Unstable:

$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 814730.31 requests per second, p50=0.511 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 784006.25 requests per second, p50=0.535 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 868130.94 requests per second, p50=0.503 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q get key::__rand_int__
get key::__rand_int__: 999700.06 requests per second, p50=0.439 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q del key::__rand_int__
del key::__rand_int__: 929800.06 requests per second, p50=0.479 msec

oranagra · 2021-09-09T10:49:02Z

This is the one with a regression in #9356

Don't we expect to see a regression in unstable and an improvement of this PR (rather than two improvements)?

zuiderkwast · 2021-09-09T11:55:44Z

This is the one with a regression in #9356

Don't we expect to see a regression in unstable and an improvement of this PR (rather than two improvements)?

In that benchmark, there's a regression for GET and no improvement in this PR for that case. I didn't get a regression for updates. I think it depends on how the benchmark was done.

zuiderkwast · 2021-11-24T12:06:33Z

@oranagra @madolson Are you interested in merging this PR? It has been ready for a while but now there are merge conflicts. I can solve them if you're interested but otherwise I'll close this PR.

oranagra · 2021-12-07T10:13:54Z

I think i'm interested, but i don't have the attention span for it right now (too many big things to move in order to get RC1 in time).
let's keep this on the table for later.

uvletter · 2022-04-17T14:07:15Z

Cluster Benchmark: 22 bytes keys, integer values (on laptop)

BTW, I'm a bit curious there seems nearly no performance promotion for GET comparing with "dict metadata", which is different from the the above 17 bytes key, the cache miss involved by "dict metadata" is expected mitigated as the dictEntry and key are allocated together.
Do you have any ideas @zuiderkwast. thx :-)

CLAassistant · 2024-03-24T23:17:38Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

zuiderkwast requested review from QuChen88 and madolson September 6, 2021 10:29

zuiderkwast marked this pull request as draft September 6, 2021 10:30

Fixup: Don't slotToKeyDelEntry in lazyfree (done by key destructor).

c2d859a

Don't slotToKeyDelEntry if db has be swapped out for async clearing

f20d7ca

Update expires key when defragging db entry with embedded key

e07c34b

zuiderkwast marked this pull request as ready for review September 8, 2021 07:06

zuiderkwast mentioned this pull request Sep 14, 2021

[NEW] Optimize expire #9480

Open

oranagra changed the title ~~Key embedded in dictEntry~~ Key embedded in dictEntry - CPU cache friendly Dec 7, 2021

Merge remote-tracking branch 'redis/unstable' into key-in-dictEntry

a249cb3

zuiderkwast added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Dec 13, 2021

madolson mentioned this pull request May 31, 2022

Rethinking the main Redis dictionary #10802

Open

zuiderkwast mentioned this pull request Dec 14, 2022

Key as dict entry - memory optimization for sets #11595

Merged

zuiderkwast mentioned this pull request May 23, 2023

Consider embedding keys/values into dictEntry for main dictionary #12216

Open

zuiderkwast closed this Mar 25, 2024

Key embedded in dictEntry - CPU cache friendly #9464

Key embedded in dictEntry - CPU cache friendly #9464

Uh oh!

Conversation

zuiderkwast commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madolson commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Sep 7, 2021

Uh oh!

zuiderkwast commented Sep 7, 2021

Uh oh!

zuiderkwast commented Sep 7, 2021

Uh oh!

zuiderkwast commented Sep 7, 2021

Uh oh!

zuiderkwast commented Sep 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Sep 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Sep 9, 2021

Uh oh!

oranagra commented Sep 9, 2021

Uh oh!

zuiderkwast commented Sep 9, 2021

Uh oh!

zuiderkwast commented Nov 24, 2021

Uh oh!

oranagra commented Dec 7, 2021

Uh oh!

uvletter commented Apr 17, 2022

Uh oh!

CLAassistant commented Mar 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zuiderkwast commented Sep 6, 2021 •

edited

Loading

madolson commented Sep 6, 2021 •

edited

Loading

zuiderkwast commented Sep 8, 2021 •

edited

Loading

zuiderkwast commented Sep 8, 2021 •

edited

Loading