Fix Active Defrag HFE with large_ebrax test by sundb · Pull Request #14344 · redis/redis

sundb · 2025-09-09T14:01:45Z

From the malloc-stats reports of both failures and successes, we can see that the additional fragments mainly come from bin24.
By analyzing the fragments mainly from the entries of the dict, since large_ebrax test uses a dictionary with 1600 elements, it will move a large number of entries during the rehashing process, and we will not perform defragmentation on the dict entries.

In #13842 we changed to use two dicts alternately to generate frag. Normally, the entries should also alternate, but rehashing disrupted this, which resulted in bin24 frag that can't be defragged.

Solution

In this PR, the length of a single dictionary was reduced from 1600 to 500 to avoid excessive rehashing, and the threshold was also lowered.

The difference between the failed and the successful report:

total frag bytes: 567608 bytes (1335552 - 767944)
bin24 frag bytes: 529929 bytes((1253064 / 0.621) - (1261728 / 0.848))

failed log:

allocator_frag_ratio:1.05
allocator_frag_bytes:1335552

bins:           size ind    allocated      nmalloc (#/sec)      ndalloc (#/sec)    nrequests   (#/sec)  nshards      curregs     curslabs  nonfull_slabs regs pgs   util       nfills (#/sec)     nflushes (#/sec)       nslabs     nreslabs (#/sec)      n_lock_ops (#/sec)       n_waiting (#/sec)      n_spin_acq (#/sec)  n_owner_switch (#/sec)   total_wait_ns   (#/sec)     max_wait_ns  max_n_thds
                   8   0       390960       368833    6046       319963    5245       402688      6601        1        48870           99              5  512   1  0.964         3417      56         2425      39          562          292       4        96609896 1583768               0       0               0       0          168052    2754               0         0               0           0
                  16   1       237984        52164     855        37290     611      3271366     53628        1        14874           67             16  256   1  0.867          734      12          595       9          171          223       3         6656416  109121               0       0               0       0           23826     390               0         0               0           0
                  24   2      1253064       378673    6207       326462    5351      2127089     34870        1        52211          164            109  512   3  0.621         3650      59         2582      42          594          258       4       101280691 1660339               0       0               1       0          154680    2535               0         0               0           0

success log:

allocator_frag_ratio:1.03
allocator_frag_bytes:767944

bins:           size ind    allocated      nmalloc (#/sec)      ndalloc (#/sec)    nrequests   (#/sec)  nshards      curregs     curslabs  nonfull_slabs regs pgs   util       nfills (#/sec)     nflushes (#/sec)       nslabs     nreslabs (#/sec)      n_lock_ops (#/sec)       n_waiting (#/sec)      n_spin_acq (#/sec)  n_owner_switch (#/sec)   total_wait_ns   (#/sec)     max_wait_ns  max_n_thds
                   8   0       389344       369432   46179       320764   40095       402569     50321        1        48668          103             13  512   1  0.922         3464     433         1460     182          561          289      36          599955   74994               0       0               3       0          161405   20175               0         0               0           0
                  16   1       228384        50082    6260        35808    4476      3268897    408612        1        14274           63             14  256   1  0.885          745      93          465      58          164          226      28           58898    7362               0       0               0       0           11023    1377               0         0               0           0
                  24   2      1261728       391927   48990       339355   42419      1958980    244872        1        52572          121             31  512   3  0.848         3623     452         1578     197          592          294      36          625633   78204               0       0               4       0          166227   20778               0         0               0           0

failed CI: https://github.com/redis/redis/actions/runs/17567591780/job/49897482080

snyk-io · 2025-09-09T14:01:58Z

🎉 Snyk checks have passed. No issues have been found so far.

✅ security/snyk check is complete. No issues have been found. (View Details)

✅ license/snyk check is complete. No issues have been found. (View Details)

oranagra · 2025-09-09T18:26:05Z

tests/unit/memefficiency.tcl


                # wait for the active defrag to stop working
-                wait_for_defrag_stop 500 100 1.05
+                wait_for_defrag_stop 500 100 1.06


i've seen cases (in ROF) where it was 1.07 (so needs 1.0.8 here)

actually, i see that was with ebrax and i don't recall if we already did something about it since then.
i'm ok increasing it slightly and keeping an eye for later.

~~done with 2a58a00~~
Did you see it after #14303?

i guess not, i don't have that in my branch yet.
i think i saw 1.0.6 too with large_ebrax.
i'm ok setting it to 1.0.6 or 1.0.7 and waiting for the next incident.
or we need to change the test to get a good distance from that threshold (more data, or different bins)

note that if you're certain the 1.06 you posted is with the up to date code, then setting it to 1.06 is insufficient.

the reason is #14344 (comment)

@oranagra There were no failures for 20 fully CI just with this test, but I still lowered the threshold to 1.07.

Co-authored-by: oranagra <oran@redislabs.com>

sundb · 2025-09-10T03:08:22Z

i saw another failed CI where it was 1.06, so i increase the threshold to 1.08.

sundb · 2025-09-10T09:48:36Z

tests/unit/memefficiency.tcl

        }

-        foreach {eb_container fields n} {eblist 16 3000 ebrax 30 1600 large_ebrax 1600 30} {
+        foreach {eb_container fields n} {eblist 16 3000 ebrax 30 1600 large_ebrax 500 100} {


@oranagra Because I modified this test, still verifying whether CI will still fail.
I reduced the length of the dictionary and increased the number of dicts.

ok, i'm not following this closely.. just merge when you feel you're ready.. and if needed we can always add further adjustments later.

i just reduced the length of dict to reduce the rehashing time, and increased the allocation of bin400(2000 times).
feel like it's fine.

Fix Active Defrag HFE with large_ebrax test

2674d8c

sundb requested a review from oranagra September 9, 2025 14:20

oranagra approved these changes Sep 9, 2025

View reviewed changes

sundb and others added 3 commits September 10, 2025 08:48

Increment threshold to 1.08

2a58a00

Co-authored-by: oranagra <oran@redislabs.com>

Revert Increment threshold to 1.08

912f173

Increate the threshold to 1.08

2318d72

Make more data for large_ebrax test

ecf03e0

sundb commented Sep 10, 2025

View reviewed changes

Relax the threshold for HFE defrag test

9d444b2

sundb merged commit fb32174 into redis:unstable Sep 11, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Active Defrag HFE with large_ebrax test#14344

Fix Active Defrag HFE with large_ebrax test#14344
sundb merged 6 commits intoredis:unstablefrom
sundb:fix-large-ebrax-defrag

sundb commented Sep 9, 2025 •

edited

Loading

Uh oh!

snyk-io bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

oranagra Sep 9, 2025

Uh oh!

oranagra Sep 9, 2025

Uh oh!

sundb Sep 10, 2025 •

edited

Loading

Uh oh!

oranagra Sep 10, 2025

Uh oh!

oranagra Sep 10, 2025

Uh oh!

sundb Sep 10, 2025

Uh oh!

sundb Sep 11, 2025

Uh oh!

sundb commented Sep 10, 2025

Uh oh!

sundb Sep 10, 2025

Uh oh!

oranagra Sep 11, 2025

Uh oh!

sundb Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sundb commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solution

The difference between the failed and the successful report:

Uh oh!

snyk-io bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎉 Snyk checks have passed. No issues have been found so far.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sundb Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sundb commented Sep 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sundb commented Sep 9, 2025 •

edited

Loading

snyk-io bot commented Sep 9, 2025 •

edited

Loading

sundb Sep 10, 2025 •

edited

Loading