Fix Active Defrag HFE with large_ebrax test#14344
Conversation
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
tests/unit/memefficiency.tcl
Outdated
|
|
||
| # wait for the active defrag to stop working | ||
| wait_for_defrag_stop 500 100 1.05 | ||
| wait_for_defrag_stop 500 100 1.06 |
There was a problem hiding this comment.
i've seen cases (in ROF) where it was 1.07 (so needs 1.0.8 here)
There was a problem hiding this comment.
actually, i see that was with ebrax and i don't recall if we already did something about it since then.
i'm ok increasing it slightly and keeping an eye for later.
There was a problem hiding this comment.
i guess not, i don't have that in my branch yet.
i think i saw 1.0.6 too with large_ebrax.
i'm ok setting it to 1.0.6 or 1.0.7 and waiting for the next incident.
or we need to change the test to get a good distance from that threshold (more data, or different bins)
There was a problem hiding this comment.
note that if you're certain the 1.06 you posted is with the up to date code, then setting it to 1.06 is insufficient.
There was a problem hiding this comment.
@oranagra There were no failures for 20 fully CI just with this test, but I still lowered the threshold to 1.07.
Co-authored-by: oranagra <oran@redislabs.com>
|
i saw another failed CI where it was 1.06, so i increase the threshold to 1.08. |
| } | ||
|
|
||
| foreach {eb_container fields n} {eblist 16 3000 ebrax 30 1600 large_ebrax 1600 30} { | ||
| foreach {eb_container fields n} {eblist 16 3000 ebrax 30 1600 large_ebrax 500 100} { |
There was a problem hiding this comment.
@oranagra Because I modified this test, still verifying whether CI will still fail.
I reduced the length of the dictionary and increased the number of dicts.
There was a problem hiding this comment.
ok, i'm not following this closely.. just merge when you feel you're ready.. and if needed we can always add further adjustments later.
There was a problem hiding this comment.
i just reduced the length of dict to reduce the rehashing time, and increased the allocation of bin400(2000 times).
feel like it's fine.
From the malloc-stats reports of both failures and successes, we can see that the additional fragments mainly come from bin24.
By analyzing the fragments mainly from the entries of the dict, since
large_ebraxtest uses a dictionary with 1600 elements, it will move a large number of entries during the rehashing process, and we will not perform defragmentation on the dict entries.In #13842 we changed to use two dicts alternately to generate frag. Normally, the entries should also alternate, but rehashing disrupted this, which resulted in bin24 frag that can't be defragged.
Solution
In this PR, the length of a single dictionary was reduced from 1600 to 500 to avoid excessive rehashing, and the threshold was also lowered.
The difference between the failed and the successful report:
total frag bytes: 567608 bytes (1335552 - 767944)
bin24 frag bytes: 529929 bytes((1253064 / 0.621) - (1261728 / 0.848))
failed log:
success log:
failed CI: https://github.com/redis/redis/actions/runs/17567591780/job/49897482080