Make dictEntry opaque #11465

zuiderkwast · 2022-11-03T10:47:14Z

This PR refactors the abstraction of the dictEntry by making it opaque. This enables future optimizations of the dict implementation without affecting the code using it.

The PR contains 5 commits. More detailed commit messages are found in each commit.

Make dictEntry opaque
Let active expire cycle use dictScan instead of messing with internals
activeDefragSdsDict use scan instead of iterator and drop dictSetNext
Remove the bucket-cb from dictScan and move dictEntry defrag to dictScanDefrag
Move stat_active_defrag_hits increment to activeDefragAlloc

oranagra

i like the direction here, but i'm assistant to merge it without actually using it (knowing that the next PR that relies on it is is final stages).
also, let's revisit this after #9406 is merged.

oranagra · 2022-11-06T07:41:19Z

src/db.c

this is replacing one ugly hack with another (accessing the dict->type pointer).
maybe just do dictGetVal and then decrRefcount?
your approach is better since it doesn't assume what's the main dict's destructor function, but the access to type is ugly.
maybe we can add a wrapper (like we have for dictFreeUnlinkedEntry)

Making dict opaque would prevent accessing dict->type. Making dictType opaque would prevent accessing type->valDestructor. Those are two separate pieces of work that would be useful.

Here I focused on the minimal changes to allow making dictEntry opaque.

src/defrag.c

oranagra · 2022-11-06T09:00:19Z

src/dict.c

it could be that now that that's not a macro, we lose some performance benefits.
let's make some simple benchmark to try to rule that out.

Good point. Perhaps with -O3 this will be inlined? We could check the assembly...

We compile with -flto (link-time optimization), which turns on optimizations such as inlining across compilation units, so this function is most likely inlined.

I tried adding an assert(0) in if to see the stack trace. The function dictSetKey was not part of the stack trace, meaning it was inlined.

Yeah, I don't think we should try to micro optimize this. We did some brief internal testing at AWS, and found that moving away from #defining sometimes improved performance as we got better branch prediction. Let's just trust the compiler.

IMHO, adding exlpicit inline keyword like inline void dictSetNext() to those short functions should do good to performance.

Shall we do as this or just make inline a compile option?

Fine with me. So far, the only inline we have are static inline. Not sure how much the compilers care about inline. Does anyone know?

I will stand by my earlier statement, let's just trust the compiler to be doing inlining. We have O3 and flto enabled. However, I would also be okay with moving these to the header file and do static inline of them. They are all so short, it will probably help with readability to have them in the header.

@madolson If we move them to the header file, we can't make the struct opaque.

Excellent point, sort of defeats the purpose of the whole endeavor.

i agree we want to leave that to the compiler.. just thought it's a good idea to make a minimal effort to benchmark that, so that we don't find a regression later on and start looking for the cause.
hopefully no action will be needed, but if there is, we can discuss it.

src/expire.c

src/dict.h

zuiderkwast · 2022-11-07T09:53:17Z

i like the direction here, but i'm assistant to merge it without actually using it (knowing that the next PR that relies on it is is final stages).

That's exactly why I had this lying around for a year without opening a PR.

The branch I never finished was about encoding something in the dictEntry pointer, in this case a 'has next' flag, so we can allocate 8 bytes less for the dictEntry if it doesn't have a next field.

This PR enables encoding various stuff in the pointer, since nobody can access the fields directly. The pointer manipulation can be plugged into the accessor functions.

judeng · 2022-11-08T01:07:57Z

The branch I never finished was about encoding something in the dictEntry pointer, in this case a 'has next' flag, so we can allocate 8 bytes less for the dictEntry if it doesn't have a next field.

Recently, I was thinking of mixing the next pointer and the expire time to use these 8 bytes. The malloc of jemalloc and glibc always guarantees byte alignment, so if the pointer is stored, the lowest bit always be 0; when the key that have the expire time needs to be stored and no bucket collision , we can try to set the lowest bit to 1 and use the remaining 63 bits to store the time number. Is this possible？

src/dict.h

src/debug.c

src/defrag.c

madolson · 2022-11-08T01:34:22Z

src/dict.c

Yeah, I don't think we should try to micro optimize this. We did some brief internal testing at AWS, and found that moving away from #defining sometimes improved performance as we got better branch prediction. Let's just trust the compiler.

zuiderkwast · 2022-11-08T09:13:24Z

Recently, I was thinking of mixing the next pointer and the expire time to use these 8 bytes. The malloc of jemalloc and glibc always guarantees byte alignment, so if the pointer is stored, the lowest bit always be 0; when the key that have the expire time needs to be stored and no bucket collision , we can try to set the lowest bit to 1 and use the remaining 63 bits to store the time number. Is this possible？

@judeng Sure, I think it's possible. If an entry has both a next pointer and an expire time, the expire time would have to be stored somewhere else?

Another difficulty is how to structure this nicely. Expire time only applies to the db main dict. The dict structure is not aware of this feature. Dicts are used in many places (hash, set, zset, the expire dict and many other places in redis) and every dict can have next entry, but only some dicts have expire time. (Yet another thing is LRU which now is on the redisObject.)

judeng · 2022-11-08T13:15:10Z

If an entry has both a next pointer and an expire time, the expire time would have to be stored somewhere else?

A few scenarios I can think of:
The key has no expire time and no bucket collision : next = 0x0
The key has an expiration time but no bucket conflict: next = unixtime<<1 + 0x1
The key does not have an expiration time set but there is a bucket collision : next= next_entry_ptr
The key sets the expiration time and there is a bucket conflict: next= new_listpack_ptr
The content of each entry in the new listpack: key_ptr:val_ptr:unxittime

Another difficulty is how to structure this nicely. Expire time only applies to the db main dict. The dict structure is not aware of this feature. Dicts are used in many places (hash, set, zset, the expire dict and many other places in redis) and every dict can have next entry, but only some dicts have expire time. (Yet another thing is LRU which now is on the redisObject.)

I haven't thought about whether other dicts should do similar optimizations, and the benefits may not be obvious. In my user scenario, redis is mostly used as a cache, and all of the keys have expiration time, which brings a lot of memory and performance overhead.Considering that there are few bucket collisions in db, reusing the next pointer can save 20% memory and performance boost

zuiderkwast · 2022-11-14T12:40:33Z

@judeng Keys need to be indexed by expire time too, so we still need a separate structure for that. I think we should move the expire conversation to #10802.

src/expire.c

src/dict.c

src/defrag.c

oranagra

WOW.. ~500 lines, that came out much bigger than i expected.
i still think we should keep it. not because it's a cleanup, but because it actually solves a problem for which we didn't have a solution (incrementing the defragged in a few cases).

but since it's so big, i'd like to introduce it in a separate commit, which means i'm not gonna squash-merge this PR, and instead i'll ask to squash other commits in this PR into groups according to topic.

zuiderkwast · 2022-11-21T20:05:57Z

src/defrag.c

Now we could actually just use the dict * as privdata and cut a few more lines. Similar for other scan data. Should I ...?

WOW.. ~500 lines, that came out much bigger than i expected.

Actually this commit is 3 files changed, 118 insertions(+), 175 deletions(-). But you thought it's "some 30 lines" 😆 I think the code got much cleaner without passing around the counter and I don't think we should avoid making simplifications just for the git blame.

yes, i only counted the lines doing the increments, and not the ones passing the pointer, and function declarations.
apparently when looking at a commit, git GH line stats still show the whole RP.

not sure we have to proceed cleaning the struct. maybe we'll need them someday soon. your call.

either way since we have several different topics in this PR, each with quite a lot of changed lines, i'd like to separate that change from others, so anyone reading the diff can focus on the important things (like the other changes we did in defrag.c and dict.c)

Ok, but it can be hard to reorder some of the commits since they depend on each other, especially when I've merged master.

Perhaps I can factor out this defrag hits cleanup and put it first, then opaque dict after that, so two commits in total?

we can probably do some git squashing and rebasing effort when that PR is ready.
the other alternative is to start a new branch, and copy changes from the old one in stages, but that's gonna take some work to find them.
note that i don't think we have to make sure each commit stands for itself and compiles.
i want it mainly for the purpose of code review (now and in the future), so it's easier to focus on important changes by topic without running into a huge refactoring diff.

taking it into a separate PR is also a possibility. but maybe there would still be other things in this PR that are worth splitting?

FYI, some examples where we did such things (on bigger projects): #9320, #9780, #7807

OK, I squashed them down to 6 commits. I think this is ready, unless you have other comments.

I guess you don't want to merge it until there is use for it, so I'll start working on some kind dict improvement based on this. :-)

zeekling · 2022-11-23T14:39:50Z

src/defrag.c

stat_active_defrag_hits may not be accurate. You can see function defragSet in defrag.c

long defragSet(redisDb *db, dictEntry *kde) { long defragged = 0; robj *ob = dictGetVal(kde); dict *d, *newd; serverAssert(ob->type == OBJ_SET && ob->encoding == OBJ_ENCODING_HT); d = ob->ptr; if (dictSize(d) > server.active_defrag_max_scan_fields) defragLater(db, kde); else defragged += activeDefragSdsDict(d, DEFRAG_SDS_DICT_NO_VAL); /* handle the dict struct */ if ((newd = activeDefragAlloc(ob->ptr))) defragged++, ob->ptr = newd; /* defrag the dict tables */ defragged += dictDefragTables(ob->ptr); return defragged; }

why?

p.s. note that the main purpose of that field is not just for statistics, but also for defrag loop itself (having a sense how how much work was done, and when it's time to call mstime and check the exit conditions).

madolson

LGTM now, I guess we're just waiting on the perf to validate?

zuiderkwast · 2022-11-23T23:18:42Z

@madolson I can't see any regression. The label-triggered benchmark results are here: https://benchmarksredisio.grafana.net/d/1fWbtb7nz/experimental-oss-spec-benchmarks.

I think Oran is waiting for another PR actually making use of this before merging.

oranagra · 2022-11-27T08:33:49Z

so what's the next step?
listpack entries in small dict buckets?
reduce value overhead for dict based sets?
i think once we get confidence with a POC with one of these, we can merge it.

zuiderkwast · 2022-11-27T21:30:49Z

so what's the next step?

I'm experimenting with some ideas ATM. Multiple dictEntries in one allocation (bulk chaining), dictEntries without next, encoding stuff in the LSB of the bucket pointer...

listpack entries in small dict buckets?

This seems hard to do. I don't know about the API and how to fallback to a regular dictEntry. Possibly as a pluggable bucket implementation... (bucket callbacks in the dictType)

reduce value overhead for dict based sets?

Yes, probably this. When there are no values and no hash conflict, the key pointer can be stored directly in the hashtable without a dictEntry, saving 24 bytes per set element.

i think once we get confidence with a POC with one of these, we can merge it.

Sounds good! 😄

oranagra · 2022-11-28T10:52:34Z

listpack entries in small dict buckets?

This seems hard to do. I don't know about the API and how to fallback to a regular dictEntry. Possibly as a pluggable bucket implementation... (bucket callbacks in the dictType)

it may mean that the interface isn't abstracted enough. i.e. making dictEntry opaque still means we need to return something that's unique but not specifically allocated for the caller (since we didn't require the caller to release it).

but where does it actually cause a problem? (IIRC we never use the next pointer in any way now, other than with a dictIterator)
i.e. if we would have returned a pointer to the middle of the listpack (as a dictEntery), what's missing?

or maybe i misunderstood you and you're talking of another complication?

zuiderkwast · 2022-11-28T15:04:00Z

but where does it actually cause a problem? (IIRC we never use the next pointer in any way now, other than with a dictIterator) i.e. if we would have returned a pointer to the middle of the listpack (as a dictEntery), what's missing?

or maybe i misunderstood you and you're talking of another complication?

Sure we could return a pointer to the middle of a listpack, but then things like dictSetValue(de) needs a changed API. It would need a pointer to the beginning of the listpack.

The various strings sds, robj and other strings is another complication. Currently, a dict key is just a pointer and dictType callbacks decide how they're interpreted. We'd need to change all of that, or add things dictType->getKeyLength(key) and dictType->getKeyChars(key) and then reconstruct the key/value as an sds (etc.) in dictGetKey and dictGetValue. It needs some amount of API design.

oranagra · 2022-11-28T16:07:29Z

ohh, right. i forgot about the problem with listpack values being non "sds compatible".

the other comment you made about dictSetValue needing a reference to the listpack is maybe an indication that the interface isn't abstract enough.

maybe the way around it is to introduce an alternative interface in dict.h (one that can work side by side with the current one), and then convert just t_set.c and then t_hash.c etc to use it. but actually that's not that different than just creating an alternative dict2.c.

i guess that either way, the sds problem is a huge blocker to proceed in that direction.

zuiderkwast · 2022-11-28T16:27:25Z

maybe the way around it is to introduce an alternative interface in dict.h (one that can work side by side with the current one), and then convert just t_set.c and then t_hash.c etc to use it. but actually that's not that different than just creating an alternative dict2.c.

There's value in not duplicating code though. The good bits to keep are scan and incremental rehash. We'll see the path of least resistance when we start implementing. 👁️

i guess that either way, the sds problem is a huge blocker to proceed in that direction.

Yes, I'm doing dicts without value first, for sets (and potentially useful for the keyspace dict in the future if we move the db key to robj). No embedded keys, just pointer bit trix.

zuiderkwast · 2022-12-07T17:31:35Z

Rebased (trivial merge conflicts)

Use functions for all accesses to dictEntry (except in dict.c). Dict abuses e.g. in defrag.c have been replaced by support functions provided by dict.

Also delete unused function activeDefragSdsListAndDict

…canDefrag This change deletes the dictGetNext and dictGetNextRef functions, so the dict API doesn't expose the next field at all. The bucket function in dictScan is deleted. A separate dictScanDefrag function is added which takes a defrag alloc function to defrag-reallocate the dict entries. "Dirty" code accessing the dict internals in active defrag is removed. An 'afterReplaceEntry' is added to dictType, which allows the dict user to keep the dictEntry metadata up to date after reallocation/defrag/move. Additionally, for updating the cluster slot-to-key mapping, after a dictEntry has been reallocated, we need to know which db a dict belongs to, so we store a pointer to the db in a new metadata section in the dict struct, which is a new mechanism similar to dictEntry metadata. This adds some complexity but provides better isolation.

instead of passing it around to every defrag function

Fix #13612 This bug was introduced by #11465 This PR added the incremental defragmentation for db->expires. If the time for the defragmentation cycle has not arrived, it will exit unless both cursor and expires_cursor are 0, meaning both keys and expires have been fragmented. However, the expires_cursor check has now been overlooked, which leads to we will exit immediately even if the defragmentation doesn't reach the end time, which will cause the efficiency of defragmentation to become very low. Note that this bug was already fixed in version 7.4(#13058)

zuiderkwast marked this pull request as ready for review November 3, 2022 12:14

zuiderkwast mentioned this pull request Nov 3, 2022

Rethinking the main Redis dictionary #10802

Open

oranagra reviewed Nov 6, 2022

View reviewed changes

madolson reviewed Nov 8, 2022

View reviewed changes

soloestoy mentioned this pull request Nov 11, 2022

fix typo zl to lp as ziplist was replaced by listpack. #11062

Open

oranagra reviewed Nov 16, 2022

View reviewed changes

src/expire.c Outdated Show resolved Hide resolved

zuiderkwast added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Nov 21, 2022

oranagra reviewed Nov 21, 2022

View reviewed changes

src/dict.c Outdated Show resolved Hide resolved

src/defrag.c Outdated Show resolved Hide resolved

sundb reviewed Nov 21, 2022

View reviewed changes

src/defrag.c Outdated Show resolved Hide resolved

oranagra reviewed Nov 21, 2022

View reviewed changes

zuiderkwast commented Nov 21, 2022

View reviewed changes

zuiderkwast force-pushed the opaque-dictEntry branch from 344f236 to 2795515 Compare November 22, 2022 14:03

oranagra approved these changes Nov 23, 2022

View reviewed changes

zeekling reviewed Nov 23, 2022

View reviewed changes

madolson approved these changes Nov 23, 2022

View reviewed changes

zuiderkwast mentioned this pull request Dec 7, 2022

Key as dict entry - memory optimization for sets #11595

Merged

zuiderkwast force-pushed the opaque-dictEntry branch from 2795515 to e87d354 Compare December 7, 2022 16:42

zuiderkwast added 5 commits January 11, 2023 09:59

Make dictEntry opaque

c84248b

Use functions for all accesses to dictEntry (except in dict.c). Dict abuses e.g. in defrag.c have been replaced by support functions provided by dict.

Let active expire cycle use dictScan instead of messing with internals

a67957e

activeDefragSdsDict use scan instead of iterator and drop dictSetNext

d4e9e0a

Also delete unused function activeDefragSdsListAndDict

Move stat_active_defrag_hits increment to activeDefragAlloc

2bbc891

instead of passing it around to every defrag function

zuiderkwast force-pushed the opaque-dictEntry branch from e87d354 to 2bbc891 Compare January 11, 2023 09:26

oranagra merged commit 12826fa into redis:unstable Jan 11, 2023

zuiderkwast mentioned this pull request Jan 18, 2023

Replace cluster metadata with slot specific dictionaries #11695

Merged

2 tasks

zuiderkwast deleted the opaque-dictEntry branch January 19, 2023 10:07

zuiderkwast removed the action:run-benchmark Triggers the benchmark suite for this Pull Request label Feb 2, 2023

This was referenced Aug 7, 2025

[BUG] High RSS Memory Usage Even with Active Defrag #13612

Closed

Fix missing expires_cursor check when existing defrag cycle #14270

Merged

Make dictEntry opaque #11465

Make dictEntry opaque #11465

Uh oh!

Conversation

zuiderkwast commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madolson Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madolson Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Nov 7, 2022

Uh oh!

judeng commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

judeng commented Nov 8, 2022

Uh oh!

zuiderkwast commented Nov 14, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeekling Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

zuiderkwast commented Nov 3, 2022 •

edited

Loading

madolson Nov 16, 2022 •

edited

Loading

madolson Nov 16, 2022 •

edited

Loading

judeng commented Nov 8, 2022 •

edited

Loading

zuiderkwast commented Nov 8, 2022 •

edited

Loading

zuiderkwast Nov 21, 2022 •

edited

Loading

zeekling Nov 23, 2022 •

edited

Loading