-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Fix the failure of defrag test under 32-bit #13013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
using the following example to verify the effect of dict shrink on external fragmentation: dictType dt = {
hashCallback, /* hash function */
NULL, /* key dup */
NULL, /* val dup */
compareCallback, /* key compare */
NULL, /* key destructor */
NULL, /* val destructor */
NULL /* allow to expand */
};
dict **ds = zmalloc(sizeof(dict*) * 50000);
for (int i = 0; i < 50000; i++) {
ds[i] = dictCreate(&dt);
dictAdd(ds[i], (void*)1, NULL);
dictAdd(ds[i], (void*)2, NULL);
dictAdd(ds[i], (void*)3, NULL);
dictAdd(ds[i], (void*)4, NULL);
dictAdd(ds[i], (void*)5, NULL);
}
je_malloc_stats_print(NULL, NULL, NULL);
for (int i = 0; i < 50000; i++) {
dictDelete(ds[i], (void*)1);
dictDelete(ds[i], (void*)2);
dictDelete(ds[i], (void*)3);
dictDelete(ds[i], (void*)4);
dictDelete(ds[i], (void*)5);
dictDefragTables(ds[i]);
}
je_malloc_stats_print(NULL, NULL, NULL);
printf("server.stat_active_defrag_misses: %lld\n", server.stat_active_defrag_misses);output (unstable and build with 32bit): we can see that the |
|
first, i'm not yet certain that the solution for this test failure is to adjust the threshold.
i'm not certain the test you conducted is correct, since it allocated the dicts one by one (after populating the previous one), and in #11695 we allocate all the dict structs at once so that's less susceptible to fragmentation.
i see utilization of 50% in the quote you posted (not low), but it could be the stagnation problem that #7289 tried to solve and failed (#10586 attempts to solve it, but is too risky) |
|
@oranagra Here are the fragmentation rates under 32-bit for different commits (at the end of the test before #11695 : 3.11% |
|
It looks to me that the big jump from roughly 3.0 to roughly 5.0 was when we moved to per-slot dictionaries. I'd like to try to defrad the dict structs themselves and see how it affects us. Specifically now after the kvstore pr when they are sparsely allocated, I think it's a must (for 64 bit builds too). |
|
From what I can tell, the recent commits don't defrag the dict struct itself ( |
|
@oranagra commit e1a6a09 (without defragging dict struct) fragmentation rate is 1.037 commit 2af05fa (with defragging dict struct) fragmentation rate is 1.032 My concern is that the dict* arrays in the kvstore are basically never changed again, which means defragging them will only work the first time. |
|
last commit(398cfe0) CI with 32bit: https://github.com/sundb/redis/actions/runs/7840599733/job/21395466548 |
|
It's nice to see we solved the problem with just the tables (i thought we already handled them). |
|
Shall we make the effort to defrag the entries in the pubsub dicts in this pr? Or leave it for another one? |
|
Actually, it's quite a lot of work, and also a specific test, let's leave it for another pr, and in that case let's exclude all changes regarding pubsub from this one. |
yeah, i'm not done with it yet which involves multiple loop iterator(kvsotre, channel dict and clients dict). |
This reverts commit 36f7197.
|
@oranagra Update the method name and comments. |
|
@sundb what bothers me with the pubsub is not to see that it is able to undo any fragmentation, but rather to make the code reachable and make sure it doesn't crash. although arguably in order to really make sure it doesn't crash, we need to make sure it re-allocates some pointers. i think we can remove the two pubsub related lines we added in this PR, and then in another one, add code to defrag the keys / values in these, and add some test. |
After #13013 ### This PR make effort to defrag the pubsub kvstore in the following ways: 1. Till now server.pubsub(shard)_channels only share channel name obj with the first subscribed client, now change it so that the clients and the pubsub kvstore share the channel name robj. This would save a lot of memory when there are many subscribers to the same channel. It also means that we only need to defrag the channel name robj in the pubsub kvstore, and then update all client references for the current channel, avoiding the need to iterate through all the clients to do the same things. 2. Refactor the code to defragment pubsub(shard) in the same way as defragment of keys and EXPIRES, with the exception that we only defragment pubsub(without shard) when slot is zero. ### Other Fix an overlook in #11695, if defragment doesn't reach the end time, we should wait for the current db's keys and expires, pubsub and pubsubshard to finish before leaving, now it's possible to exit early when the keys are defragmented. --------- Co-authored-by: oranagra <oran@redislabs.com>
Fail CI: https://github.com/redis/redis/actions/runs/7837608438/job/21387609715 ## Why defragment tests only failed under 32-bit First of all, under 32-bit jemalloc will allocate more small bins and less large bins, which will also lead to more external fragmentation, therefore, the fragmentation ratio is higher in 32-bit than in 64-bit, so the defragment tests(`Active defrag eval scripts: cluster` and `Active defrag big keys: cluster`) always fails in 32-bit. ## Why defragment tests only failed with cluster The fowllowing is the result of `Active defrag eval scripts: cluster` test. 1) Before redis#11695, the fragmentation ratio is 3.11%. 2) After redis#11695, the fragmentation ratio grew to 4.58%. Since we are using per-slot dictionary to manage slots, we will only defragment the contents of these dictionaries (keys, values), but not the dictionaries' struct and ht_table, which means that frequent shrinking and expanding of the dictionaries, will make more fragments. 3) After redis#12850 and redis#12948, In cluster mode, a large number of cluster slot dicts will be shrunk, creating additional fragmention, and the dictionary will not be defragged. ## Solution * Add defragmentation of the per-slot dictionary's own structures, dict struct and ht_table. ## Other change * Increase floating point print precision of `frags` and `rss` in debug logs for defrag --------- Co-authored-by: Oran Agra <oran@redislabs.com>
After redis#13013 ### This PR make effort to defrag the pubsub kvstore in the following ways: 1. Till now server.pubsub(shard)_channels only share channel name obj with the first subscribed client, now change it so that the clients and the pubsub kvstore share the channel name robj. This would save a lot of memory when there are many subscribers to the same channel. It also means that we only need to defrag the channel name robj in the pubsub kvstore, and then update all client references for the current channel, avoiding the need to iterate through all the clients to do the same things. 2. Refactor the code to defragment pubsub(shard) in the same way as defragment of keys and EXPIRES, with the exception that we only defragment pubsub(without shard) when slot is zero. ### Other Fix an overlook in redis#11695, if defragment doesn't reach the end time, we should wait for the current db's keys and expires, pubsub and pubsubshard to finish before leaving, now it's possible to exit early when the keys are defragmented. --------- Co-authored-by: oranagra <oran@redislabs.com>
…after defragment (redis#13231) Introducted by redis#13013 After defragmenting the dictionary in the kvstore, if the dict is reallocated, the value of its node in the kvstore rehashing list must be updated.
Fail CI: https://github.com/redis/redis/actions/runs/7837608438/job/21387609715
Why defragment tests only failed under 32-bit
First of all, under 32-bit jemalloc will allocate more small bins and less large bins, which will also lead to more external fragmentation, therefore, the fragmentation ratio is higher in 32-bit than in 64-bit, so the defragment tests(
Active defrag eval scripts: clusterandActive defrag big keys: cluster) always fails in 32-bit.Why defragment tests only failed with cluster
The fowllowing is the result of
Active defrag eval scripts: clustertest.Before Replace cluster metadata with slot specific dictionaries #11695, the fragmentation ratio is 3.11%.
After Replace cluster metadata with slot specific dictionaries #11695, the fragmentation ratio grew to 4.58%.
Since we are using per-slot dictionary to manage slots, we will only defragment the contents of these dictionaries (keys, values), but not the dictionaries' struct and ht_table, which means that frequent shrinking and expanding of the dictionaries, will make more fragments.
After Shrink dict when deleting dictEntry #12850 and Change the threshold of dict expand, shrink and rehash #12948, In cluster mode, a large number of cluster slot dicts will be shrunk, creating additional fragmention, and the dictionary will not be defragged.
Solution
Other change
fragsandrssin debug logs for defrag