Skip to content

Conversation

@zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented Nov 3, 2022

This PR refactors the abstraction of the dictEntry by making it opaque. This enables future optimizations of the dict implementation without affecting the code using it.

The PR contains 5 commits. More detailed commit messages are found in each commit.

  • Make dictEntry opaque
  • Let active expire cycle use dictScan instead of messing with internals
  • activeDefragSdsDict use scan instead of iterator and drop dictSetNext
  • Remove the bucket-cb from dictScan and move dictEntry defrag to dictScanDefrag
  • Move stat_active_defrag_hits increment to activeDefragAlloc

@zuiderkwast zuiderkwast marked this pull request as ready for review November 3, 2022 12:14
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the direction here, but i'm assistant to merge it without actually using it (knowing that the next PR that relies on it is is final stages).
also, let's revisit this after #9406 is merged.

src/db.c Outdated
Comment on lines +241 to +255
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is replacing one ugly hack with another (accessing the dict->type pointer).
maybe just do dictGetVal and then decrRefcount?
your approach is better since it doesn't assume what's the main dict's destructor function, but the access to type is ugly.
maybe we can add a wrapper (like we have for dictFreeUnlinkedEntry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making dict opaque would prevent accessing dict->type. Making dictType opaque would prevent accessing type->valDestructor. Those are two separate pieces of work that would be useful.

Here I focused on the minimal changes to allow making dictEntry opaque.

src/dict.c Outdated
Comment on lines +548 to +629
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be that now that that's not a macro, we lose some performance benefits.
let's make some simple benchmark to try to rule that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Perhaps with -O3 this will be inlined? We could check the assembly...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We compile with -flto (link-time optimization), which turns on optimizations such as inlining across compilation units, so this function is most likely inlined.

I tried adding an assert(0) in if to see the stack trace. The function dictSetKey was not part of the stack trace, meaning it was inlined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we should try to micro optimize this. We did some brief internal testing at AWS, and found that moving away from #defining sometimes improved performance as we got better branch prediction. Let's just trust the compiler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, adding exlpicit inline keyword like inline void dictSetNext() to those short functions should do good to performance.

Shall we do as this or just make inline a compile option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. So far, the only inline we have are static inline. Not sure how much the compilers care about inline. Does anyone know?

Copy link
Contributor

@madolson madolson Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will stand by my earlier statement, let's just trust the compiler to be doing inlining. We have O3 and flto enabled. However, I would also be okay with moving these to the header file and do static inline of them. They are all so short, it will probably help with readability to have them in the header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madolson If we move them to the header file, we can't make the struct opaque.

Copy link
Contributor

@madolson madolson Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, sort of defeats the purpose of the whole endeavor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree we want to leave that to the compiler.. just thought it's a good idea to make a minimal effort to benchmark that, so that we don't find a regression later on and start looking for the cause.
hopefully no action will be needed, but if there is, we can discuss it.

@zuiderkwast
Copy link
Contributor Author

i like the direction here, but i'm assistant to merge it without actually using it (knowing that the next PR that relies on it is is final stages).

That's exactly why I had this lying around for a year without opening a PR.

The branch I never finished was about encoding something in the dictEntry pointer, in this case a 'has next' flag, so we can allocate 8 bytes less for the dictEntry if it doesn't have a next field.

This PR enables encoding various stuff in the pointer, since nobody can access the fields directly. The pointer manipulation can be plugged into the accessor functions.

@judeng
Copy link
Contributor

judeng commented Nov 8, 2022

The branch I never finished was about encoding something in the dictEntry pointer, in this case a 'has next' flag, so we can allocate 8 bytes less for the dictEntry if it doesn't have a next field.

Recently, I was thinking of mixing the next pointer and the expire time to use these 8 bytes. The malloc of jemalloc and glibc always guarantees byte alignment, so if the pointer is stored, the lowest bit always be 0; when the key that have the expire time needs to be stored and no bucket collision , we can try to set the lowest bit to 1 and use the remaining 63 bits to store the time number. Is this possible?

src/dict.c Outdated
Comment on lines +548 to +629
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we should try to micro optimize this. We did some brief internal testing at AWS, and found that moving away from #defining sometimes improved performance as we got better branch prediction. Let's just trust the compiler.

@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Nov 8, 2022

Recently, I was thinking of mixing the next pointer and the expire time to use these 8 bytes. The malloc of jemalloc and glibc always guarantees byte alignment, so if the pointer is stored, the lowest bit always be 0; when the key that have the expire time needs to be stored and no bucket collision , we can try to set the lowest bit to 1 and use the remaining 63 bits to store the time number. Is this possible?

@judeng Sure, I think it's possible. If an entry has both a next pointer and an expire time, the expire time would have to be stored somewhere else?

Another difficulty is how to structure this nicely. Expire time only applies to the db main dict. The dict structure is not aware of this feature. Dicts are used in many places (hash, set, zset, the expire dict and many other places in redis) and every dict can have next entry, but only some dicts have expire time. (Yet another thing is LRU which now is on the redisObject.)

@judeng
Copy link
Contributor

judeng commented Nov 8, 2022

If an entry has both a next pointer and an expire time, the expire time would have to be stored somewhere else?

A few scenarios I can think of:
The key has no expire time and no bucket collision : next = 0x0
The key has an expiration time but no bucket conflict: next = unixtime<<1 + 0x1
The key does not have an expiration time set but there is a bucket collision : next= next_entry_ptr
The key sets the expiration time and there is a bucket conflict: next= new_listpack_ptr
The content of each entry in the new listpack: key_ptr:val_ptr:unxittime

Another difficulty is how to structure this nicely. Expire time only applies to the db main dict. The dict structure is not aware of this feature. Dicts are used in many places (hash, set, zset, the expire dict and many other places in redis) and every dict can have next entry, but only some dicts have expire time. (Yet another thing is LRU which now is on the redisObject.)

I haven't thought about whether other dicts should do similar optimizations, and the benefits may not be obvious. In my user scenario, redis is mostly used as a cache, and all of the keys have expiration time, which brings a lot of memory and performance overhead.Considering that there are few bucket collisions in db, reusing the next pointer can save 20% memory and performance boost

@zuiderkwast
Copy link
Contributor Author

@judeng Keys need to be indexed by expire time too, so we still need a separate structure for that. I think we should move the expire conversation to #10802.

@zuiderkwast zuiderkwast added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Nov 21, 2022
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WOW.. ~500 lines, that came out much bigger than i expected.
i still think we should keep it. not because it's a cleanup, but because it actually solves a problem for which we didn't have a solution (incrementing the defragged in a few cases).

but since it's so big, i'd like to introduce it in a separate commit, which means i'm not gonna squash-merge this PR, and instead i'll ask to squash other commits in this PR into groups according to topic.

src/defrag.c Outdated
Comment on lines 451 to 431
Copy link
Contributor Author

@zuiderkwast zuiderkwast Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we could actually just use the dict * as privdata and cut a few more lines. Similar for other scan data. Should I ...?

WOW.. ~500 lines, that came out much bigger than i expected.

Actually this commit is 3 files changed, 118 insertions(+), 175 deletions(-). But you thought it's "some 30 lines" 😆 I think the code got much cleaner without passing around the counter and I don't think we should avoid making simplifications just for the git blame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i only counted the lines doing the increments, and not the ones passing the pointer, and function declarations.
apparently when looking at a commit, git GH line stats still show the whole RP.

not sure we have to proceed cleaning the struct. maybe we'll need them someday soon. your call.

either way since we have several different topics in this PR, each with quite a lot of changed lines, i'd like to separate that change from others, so anyone reading the diff can focus on the important things (like the other changes we did in defrag.c and dict.c)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but it can be hard to reorder some of the commits since they depend on each other, especially when I've merged master.

Perhaps I can factor out this defrag hits cleanup and put it first, then opaque dict after that, so two commits in total?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably do some git squashing and rebasing effort when that PR is ready.
the other alternative is to start a new branch, and copy changes from the old one in stages, but that's gonna take some work to find them.
note that i don't think we have to make sure each commit stands for itself and compiles.
i want it mainly for the purpose of code review (now and in the future), so it's easier to focus on important changes by topic without running into a huge refactoring diff.

taking it into a separate PR is also a possibility. but maybe there would still be other things in this PR that are worth splitting?

FYI, some examples where we did such things (on bigger projects): #9320, #9780, #7807

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I squashed them down to 6 commits. I think this is ready, unless you have other comments.

I guess you don't want to merge it until there is use for it, so I'll start working on some kind dict improvement based on this. :-)

src/defrag.c Outdated
Copy link
Contributor

@zeekling zeekling Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stat_active_defrag_hits may not be accurate. You can see function defragSet in defrag.c

long defragSet(redisDb *db, dictEntry *kde) {
    long defragged = 0;
    robj *ob = dictGetVal(kde);
    dict *d, *newd;
    serverAssert(ob->type == OBJ_SET && ob->encoding == OBJ_ENCODING_HT);
    d = ob->ptr;
    if (dictSize(d) > server.active_defrag_max_scan_fields)
        defragLater(db, kde);
    else
        defragged += activeDefragSdsDict(d, DEFRAG_SDS_DICT_NO_VAL);
    /* handle the dict struct */
    if ((newd = activeDefragAlloc(ob->ptr)))
        defragged++, ob->ptr = newd;
    /* defrag the dict tables */
    defragged += dictDefragTables(ob->ptr);
    return defragged;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

p.s. note that the main purpose of that field is not just for statistics, but also for defrag loop itself (having a sense how how much work was done, and when it's time to call mstime and check the exit conditions).

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, I guess we're just waiting on the perf to validate?

@zuiderkwast
Copy link
Contributor Author

@madolson I can't see any regression. The label-triggered benchmark results are here: https://benchmarksredisio.grafana.net/d/1fWbtb7nz/experimental-oss-spec-benchmarks.

I think Oran is waiting for another PR actually making use of this before merging.

@oranagra
Copy link
Member

so what's the next step?
listpack entries in small dict buckets?
reduce value overhead for dict based sets?
i think once we get confidence with a POC with one of these, we can merge it.

@zuiderkwast
Copy link
Contributor Author

so what's the next step?

I'm experimenting with some ideas ATM. Multiple dictEntries in one allocation (bulk chaining), dictEntries without next, encoding stuff in the LSB of the bucket pointer...

listpack entries in small dict buckets?

This seems hard to do. I don't know about the API and how to fallback to a regular dictEntry. Possibly as a pluggable bucket implementation... (bucket callbacks in the dictType)

reduce value overhead for dict based sets?

Yes, probably this. When there are no values and no hash conflict, the key pointer can be stored directly in the hashtable without a dictEntry, saving 24 bytes per set element.

i think once we get confidence with a POC with one of these, we can merge it.

Sounds good! 😄

@oranagra
Copy link
Member

listpack entries in small dict buckets?

This seems hard to do. I don't know about the API and how to fallback to a regular dictEntry. Possibly as a pluggable bucket implementation... (bucket callbacks in the dictType)

it may mean that the interface isn't abstracted enough. i.e. making dictEntry opaque still means we need to return something that's unique but not specifically allocated for the caller (since we didn't require the caller to release it).

but where does it actually cause a problem? (IIRC we never use the next pointer in any way now, other than with a dictIterator)
i.e. if we would have returned a pointer to the middle of the listpack (as a dictEntery), what's missing?

or maybe i misunderstood you and you're talking of another complication?

@zuiderkwast
Copy link
Contributor Author

but where does it actually cause a problem? (IIRC we never use the next pointer in any way now, other than with a dictIterator) i.e. if we would have returned a pointer to the middle of the listpack (as a dictEntery), what's missing?

or maybe i misunderstood you and you're talking of another complication?

Sure we could return a pointer to the middle of a listpack, but then things like dictSetValue(de) needs a changed API. It would need a pointer to the beginning of the listpack.

The various strings sds, robj and other strings is another complication. Currently, a dict key is just a pointer and dictType callbacks decide how they're interpreted. We'd need to change all of that, or add things dictType->getKeyLength(key) and dictType->getKeyChars(key) and then reconstruct the key/value as an sds (etc.) in dictGetKey and dictGetValue. It needs some amount of API design.

@oranagra
Copy link
Member

ohh, right. i forgot about the problem with listpack values being non "sds compatible".

the other comment you made about dictSetValue needing a reference to the listpack is maybe an indication that the interface isn't abstract enough.

maybe the way around it is to introduce an alternative interface in dict.h (one that can work side by side with the current one), and then convert just t_set.c and then t_hash.c etc to use it. but actually that's not that different than just creating an alternative dict2.c.

i guess that either way, the sds problem is a huge blocker to proceed in that direction.

@zuiderkwast
Copy link
Contributor Author

maybe the way around it is to introduce an alternative interface in dict.h (one that can work side by side with the current one), and then convert just t_set.c and then t_hash.c etc to use it. but actually that's not that different than just creating an alternative dict2.c.

There's value in not duplicating code though. The good bits to keep are scan and incremental rehash. We'll see the path of least resistance when we start implementing. 👁️

i guess that either way, the sds problem is a huge blocker to proceed in that direction.

Yes, I'm doing dicts without value first, for sets (and potentially useful for the keyspace dict in the future if we move the db key to robj). No embedded keys, just pointer bit trix.

@zuiderkwast
Copy link
Contributor Author

Rebased (trivial merge conflicts)

Use functions for all accesses to dictEntry (except in dict.c). Dict abuses
e.g. in defrag.c have been replaced by support functions provided by dict.
Also delete unused function activeDefragSdsListAndDict
…canDefrag

This change deletes the dictGetNext and dictGetNextRef functions, so the
dict API doesn't expose the next field at all.

The bucket function in dictScan is deleted. A separate dictScanDefrag function
is added which takes a defrag alloc function to defrag-reallocate the dict entries.

"Dirty" code accessing the dict internals in active defrag is removed.

An 'afterReplaceEntry' is added to dictType, which allows the dict user
to keep the dictEntry metadata up to date after reallocation/defrag/move.

Additionally, for updating the cluster slot-to-key mapping, after a dictEntry
has been reallocated, we need to know which db a dict belongs to, so we store
a pointer to the db in a new metadata section in the dict struct, which is
a new mechanism similar to dictEntry metadata. This adds some complexity but
provides better isolation.
instead of passing it around to every defrag function
@oranagra oranagra merged commit 12826fa into redis:unstable Jan 11, 2023
@zuiderkwast zuiderkwast deleted the opaque-dictEntry branch January 19, 2023 10:07
@zuiderkwast zuiderkwast removed the action:run-benchmark Triggers the benchmark suite for this Pull Request label Feb 2, 2023
sundb added a commit that referenced this pull request Aug 14, 2025
Fix #13612
This bug was introduced by #11465

This PR added the incremental defragmentation for db->expires.
If the time for the defragmentation cycle has not arrived, it will exit
unless both cursor and expires_cursor are 0, meaning both keys and
expires have been fragmented.
However, the expires_cursor check has now been overlooked, which leads
to we will exit immediately even if the defragmentation doesn't reach
the end time, which will cause the efficiency of defragmentation to
become very low.

Note that this bug was already fixed in version
7.4(#13058)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants