Skip to content

Conversation

@zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented Sep 6, 2021

Whenever a key is looked up, the dictEntry (hashtable bucket) and the key
are accessed together. This PR embeds the key in the same allocation as the
dictEntry, which makes sure they are in the same cache line, thus saving
one slow memory access.

This compensates for the potential regression from #9356 for key lengths
15-22 in cluster mode. Furthermore, latency is expected to improve for
other key lengths in cluster mode as well as in plain Redis. Benchmarks
will show.

The callbacks in dictType are changed in the following ways:

  • dictEntryMetadataBytes is replaced by createEntry, which (if present)
    takes care of allocating the entry and setting the key.
  • keyDestructor and valueDestructor are passed the dictEntry as an extra
    argument, so they can check if the key is embedded in the dictEntry and
    then not try to free it.
  • repairEntry is added, which is called from active defrag code when an
    entry has been reallocated, moving custom logic from the active defrag
    code to the dict type.

TODO:

  • Working implementation. Tests passing.
  • Benchmarks for cluster/plain Redis, key lengths of different
    allocation sizes.
  • Memory usage comparison.

Open questions:

  • What's the maximum key length we want to embed in a dictEntry?
  • Do we want a similar optimization for other dicts, e.g. hashes?

Whenever a key is looked up, the dictEntry (hashtable bucket) and the key
are accessed together. This PR embeds the key in the same allocation as the
dictEntry, which makes sure they are in the same cache line, thus saving
one slow memory access.

This compensates for the potential regression from redis#9356 for key lengths
15-22 in cluster mode. Furthermore, latency is expected to improve for
other key lengths in cluster mode as well as in plain Redis. Benchmarks
will show.

The callbacks in dictType are changed in the following ways:

* dictEntryMetadataBytes is replaced by createEntry, which (if present)
  takes care of allocating the entry and setting the key.
* keyDestructor and valueDestructor are passed the dictEntry as an extra
  argument, so they can check if the key is embedded in the dictEntry and
  then not try to free it.
* repairEntry is added, which is called from active defrag code when an
  entry has been reallocated, moving custom logic from the active defrag
  code to the dict type.

TODO:

* [x] Working implamentation. Tests passing.
* [ ] Benchmarks for cluster/plain Redis, key lengths of different
      allocation sizes.
* [ ] Memory usage comparison.

Open questions:

* What's the maxumum key length we want to embed in a dictEntry?
* Do we want a similar optimization for other dicts, e.g. hashes?
@zuiderkwast zuiderkwast marked this pull request as draft September 6, 2021 10:30
@madolson
Copy link
Contributor

madolson commented Sep 6, 2021

It's unfortunate that we're paying 8 bytes for the key ptr even though it doesn't do anything. We could stuff a bit in the bottom 3 bits of 8 bit aligned systems, since their pointers always end in 000, which could be used to indicate if it's the new address/old address scheme. We would need to go through and make sure everyone only uses deGetKey, which would handle the pointer maintenance. I'm not sure if that would be too much of over optimizing though.

I think it makes sense besides that though, I'll take a complete look once the tests pass (:

@oranagra
Copy link
Member

oranagra commented Sep 7, 2021

I'm concerned that we may be going too far towards coupling more and more parts of Redis with that dict, which we also wanna replace one day (due to the rehashing memory issue)
It's no reason not to do incremental changes, but I'm just stressing that when the day comes we may have more areas in the code that are tightly coupled and will need more work.
That said, let's proceed with this task and see where it leads us, and make the call when ready.

@zuiderkwast
Copy link
Contributor Author

It's unfortunate that we're paying 8 bytes for the key ptr even though it doesn't do anything. We could stuff a bit in the bottom 3 bits of 8 bit aligned systems, since their pointers always end in 000, which could be used to indicate if it's the new address/old address scheme.

@madolson An sds string is an odd pointer, since its header (an odd number of bytes) is before the pointer. There might still be bits to use though, for example in the 'value' or the 'next' pointer. (We'd need to move around the dictEntry fields, since the embedded key needs to be last.) We have the same waste of bits in the robj with embedded sds (many DB values). I think it can be a different PR.

I'm concerned that we may be going too far towards coupling more and more parts of Redis with that dict, which we also wanna replace one day (due to the rehashing memory issue)

@oranagra I'm well aware. My #8700 was stalled since I couldn't make the build pass, but now it's diverged when #9356 was merged. It's still possible to rebase the HAMT. We'd need to wrap the key-with-cluster-metadata-and-maybe-embedded-key in another struct (basically the dictEntry without the value and next-pointer).

The build's passing now, so it's ready for (pre-)review and benchmarks.

@zuiderkwast
Copy link
Contributor Author

Memory usage analysis

Details (TL;DR Fewer allocations, no impact on total usage)

We have the following allocator bin sizes: 8 16 24 32 40 48 56 64
80 96 112 128 160 192 224 256 ...

In non-cluster mode, the dictEntry is 24 bytes and in cluster mode, it's 40 bytes.

Keys up to length 31 have 2 bytes overhead (SDS type 5) and lengths 32-255 have 4 bytes overhead (SDS type 8). We only embed sds type 5 keys and only if the combined size is up to 64 bytes.

Key size non-cluster, separate allocations non-cluster, combined allocation
1-6 8 + 24 = 32 32
7-14 16 + 24 = 40 40
15-22 24 + 24 = 48 48
23-30 32 + 24 = 56 56
31 40 + 24 = 64 64
32-36 40 + 24 = 64 64 (not used)
Key size cluster, separate allocations cluster, combined allocation
1-6 8 + 40 = 48 48
7-14 16 + 40 = 56 56
15-22 24 + 40 = 64 64
23-30 32 + 40 = 72 80 (not used)

As seen above, the total allocation size is the same for combined and separate allocations, for the cases where we use embedded keys.

@zuiderkwast
Copy link
Contributor Author

Benchmark: 14 bytes keys, integer values, non-cluster (on laptop)

TL;DR 10% latency win!

This PR:

$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int__
set k:__rand_int__ 1__rand_int__: 908843.00 requests per second, p50=0.479 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int__
set k:__rand_int__ 1__rand_int__: 1025325.50 requests per second, p50=0.431 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int__
set k:__rand_int__ 1__rand_int__: 973899.50 requests per second, p50=0.455 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q get k:__rand_int__
get k:__rand_int__: 1210067.75 requests per second, p50=0.375 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q del k:__rand_int__
del k:__rand_int__: 1140901.25 requests per second, p50=0.383 msec

Unstable:

$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int
set k:__rand_int__ 1__rand_int: 815793.75 requests per second, p50=0.503 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int
set k:__rand_int__ 1__rand_int: 814730.31 requests per second, p50=0.535 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set k:__rand_int__ 1__rand_int
set k:__rand_int__ 1__rand_int: 867980.19 requests per second, p50=0.519 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q get k:__rand_int__
get k:__rand_int__: 975419.44 requests per second, p50=0.463 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q del k:__rand_int__

@zuiderkwast zuiderkwast marked this pull request as ready for review September 8, 2021 07:06
@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Sep 8, 2021

Cluster Benchmark: 17 bytes keys, integer values (on laptop)

TL;DR It's pretty good

Disclaimer: The cluster nodes and redis-benchmark ran on the same laptop.

Here's a throughput chart. Details below.

image

Cluster: 3 masters, no replicas, no AOF, no rdb.

redis-benchmark options: OPTS='--cluster -p 30001 --threads 2 -P 10 -r 10000000 -n 10000000 -q'.

This PR ("embedded key"):

$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1904036.62 requests per second, p50=0.135 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1901140.62 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1904036.62 requests per second, p50=0.135 msec
$ ./redis-benchmark $OPTS get '{tag}__rand_int__'
get {tag}__rand_int__: 1995609.62 requests per second, p50=0.135 msec
$ ./redis-benchmark $OPTS del '{tag}__rand_int__'
del {tag}__rand_int__: 1996406.50 requests per second, p50=0.127 msec

Unstable ("dict metadata"):

$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1662786.88 requests per second, p50=0.159 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1537988.38 requests per second, p50=0.183 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1666111.38 requests per second, p50=0.167 msec
$ ./redis-benchmark $OPTS get '{tag}__rand_int__'
get {tag}__rand_int__: 1901140.62 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS del '{tag}__rand_int__'
del {tag}__rand_int__: 1904399.25 requests per second, p50=0.151 msec

Old unstable (just before #9356 "dict metadata" was merged):

$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1537751.88 requests per second, p50=0.247 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1478633.75 requests per second, p50=0.247 msec
$ ./redis-benchmark $OPTS set '{tag}__rand_int__' 1__rand_int__
set {tag}__rand_int__ 1__rand_int__: 1599232.38 requests per second, p50=0.191 msec
$ ./redis-benchmark $OPTS get '{tag}__rand_int__'
get {tag}__rand_int__: 1817851.38 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS del '{tag}__rand_int__'
del {tag}__rand_int__: 1596933.88 requests per second, p50=0.271 msec

@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Sep 8, 2021

Cluster Benchmark: 22 bytes keys, integer values (on laptop)

This is the one with a regression in #9356

Details below.

image

Explanation: In this case, "old unstable" has 24 bytes dictEntry and 24 bytes key allocation. The later ones have 40+24 or 64 bytes allocation(s). The extra slot-to-key metadata in the dictEntry is not used for GET.

Cluster: 3 masters, no replicas, no AOF, no rdb.

$ OPTS='--cluster -p 30001 --threads 2 -P 10 -r 10000000 -n 10000000 -q'

This PR:

$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1813894.38 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1736111.00 requests per second, p50=0.151 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1814552.62 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS get 'key:{tag}:__rand_int__'
get key:{tag}:__rand_int__: 1900779.25 requests per second, p50=0.143 msec
$ ./redis-benchmark $OPTS del 'key:{tag}:__rand_int__'
del key:{tag}:__rand_int__: 1904036.62 requests per second, p50=0.135 msec

Unstable ("dict metadata")

$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1477759.62 requests per second, p50=0.295 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1428163.38 requests per second, p50=0.295 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1595659.75 requests per second, p50=0.279 msec
$ ./redis-benchmark $OPTS get 'key:{tag}:__rand_int__'
get key:{tag}:__rand_int__: 1899335.25 requests per second, p50=0.223 msec
$ ./redis-benchmark $OPTS del 'key:{tag}:__rand_int__'
del key:{tag}:__rand_int__: 1734605.38 requests per second, p50=0.263 msec

Old unstable (just before #9356 was merged):

$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1329787.25 requests per second, p50=0.335 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1330141.00 requests per second, p50=0.343 msec
$ ./redis-benchmark $OPTS set 'key:{tag}:__rand_int__' 1__rand_int__
set key:{tag}:__rand_int__ 1__rand_int__: 1477978.12 requests per second, p50=0.311 msec
$ ./redis-benchmark $OPTS get 'key:{tag}:__rand_int__'
get key:{tag}:__rand_int__: 1994813.50 requests per second, p50=0.215 msec
$ ./redis-benchmark $OPTS del 'key:{tag}:__rand_int__'
del key:{tag}:__rand_int__: 1425110.38 requests per second, p50=0.319 msec

@zuiderkwast
Copy link
Contributor Author

Benchmark: 17 bytes keys, integer values, non-cluster (on laptop)

TL;DR 10% better than unstable here too

image

This PR:

$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 888573.00 requests per second, p50=0.479 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 887075.31 requests per second, p50=0.479 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 998302.88 requests per second, p50=0.439 msec 
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q get key::__rand_int__
get key::__rand_int__: 1142596.00 requests per second, p50=0.383 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q del key::__rand_int__
del key::__rand_int__: 1052188.62 requests per second, p50=0.415 msec

Unstable:

$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 814730.31 requests per second, p50=0.511 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 784006.25 requests per second, p50=0.535 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q set key::__rand_int__ 1__rand_int__
set key::__rand_int__ 1__rand_int__: 868130.94 requests per second, p50=0.503 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q get key::__rand_int__
get key::__rand_int__: 999700.06 requests per second, p50=0.439 msec
$ ./redis-benchmark --threads 2 -P 10 -r 10000000 -n 10000000 -q del key::__rand_int__
del key::__rand_int__: 929800.06 requests per second, p50=0.479 msec

@oranagra
Copy link
Member

oranagra commented Sep 9, 2021

This is the one with a regression in #9356

Don't we expect to see a regression in unstable and an improvement of this PR (rather than two improvements)?

@zuiderkwast
Copy link
Contributor Author

This is the one with a regression in #9356

Don't we expect to see a regression in unstable and an improvement of this PR (rather than two improvements)?

In that benchmark, there's a regression for GET and no improvement in this PR for that case. I didn't get a regression for updates. I think it depends on how the benchmark was done.

@zuiderkwast
Copy link
Contributor Author

@oranagra @madolson Are you interested in merging this PR? It has been ready for a while but now there are merge conflicts. I can solve them if you're interested but otherwise I'll close this PR.

@oranagra
Copy link
Member

oranagra commented Dec 7, 2021

I think i'm interested, but i don't have the attention span for it right now (too many big things to move in order to get RC1 in time).
let's keep this on the table for later.

@oranagra oranagra changed the title Key embedded in dictEntry Key embedded in dictEntry - CPU cache friendly Dec 7, 2021
@zuiderkwast zuiderkwast added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Dec 13, 2021
@uvletter
Copy link
Contributor

Cluster Benchmark: 22 bytes keys, integer values (on laptop)

BTW, I'm a bit curious there seems nearly no performance promotion for GET comparing with "dict metadata", which is different from the the above 17 bytes key, the cache miss involved by "dict metadata" is expected mitigated as the dictEntry and key are allocated together.
Do you have any ideas @zuiderkwast. thx :-)

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action:run-benchmark Triggers the benchmark suite for this Pull Request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants