[Performance] Revert tryOffloadFreeObjToIOThreads to offload entire robj#3760
[Performance] Revert tryOffloadFreeObjToIOThreads to offload entire robj#3760roshkhatri wants to merge 3 commits into
tryOffloadFreeObjToIOThreads to offload entire robj#3760Conversation
Commit fdd9039 changed tryOffloadFreeObjToIOThreads to only offload the SDS buffer free to IO threads, while freeing the robj shell on the main thread. This caused a ~20% SET regression with IO threads active (128B values, pipeline=10, 9 IO threads, 1600 clients). SPE profiling shows tryOffloadFreeObjToIOThreads became the #1 hotspot on the main thread, and IO threads shifted from doing real jemalloc work (je_edata_heap_remove_first, sdsfree) to spinning idle on spmcDequeue. The original design correctly offloads the entire decrRefCount to the IO thread. Cross-thread zfree is handled efficiently by jemalloc's tcache mechanism and does not require same-thread affinity for correctness. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
📝 WalkthroughWalkthroughThis PR refactors object freeing offloaded to IO threads. The enqueuing path in ChangesIO Thread Object Freeing
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/io_threads.c (1)
346-346: ⚡ Quick winDocument why the whole
robjstays on the I/O thread.This contract is easy to “optimize” back into the regressed version. A short rationale here would make it clear that only last-reference RAW strings take this path, and that the full
decrRefCount()intentionally remains on the I/O thread so allocator work does not move back to the main thread.Suggested comment update
case JOB_REQ_FREE_OBJ: - decrRefCount(data); + /* Only last-reference RAW string objects are enqueued here. + * Keep the full decrRefCount() on the I/O thread so both the + * SDS buffer and the robj shell are freed off the main thread. + */ + decrRefCount((robj *)data); break; @@ - void *job = tagJob(obj, JOB_REQ_FREE_OBJ); + /* Offload the whole object, not just obj->ptr, to keep allocator work + * off the main thread on this hot path. + */ + void *job = tagJob(obj, JOB_REQ_FREE_OBJ);As per coding guidelines, "Use comments for non-obvious behavior and rationale, not for restating code."
Also applies to: 692-693
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/io_threads.c` at line 346, Add a short explanatory comment next to the decrRefCount(data) call describing why the entire robj (the full object) is intentionally dropped on the I/O thread rather than moved back to the main thread: explain that this path is only taken for last-reference RAW strings and that keeping decrRefCount() on the I/O thread avoids moving allocator work back to the main thread (preventing the regression), and duplicate this rationale at the other occurrence around the decrRefCount() at the 692-693 site; reference the robj and decrRefCount symbols in the comment so future editors understand the contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/io_threads.c`:
- Line 346: Add a short explanatory comment next to the decrRefCount(data) call
describing why the entire robj (the full object) is intentionally dropped on the
I/O thread rather than moved back to the main thread: explain that this path is
only taken for last-reference RAW strings and that keeping decrRefCount() on the
I/O thread avoids moving allocator work back to the main thread (preventing the
regression), and duplicate this rationale at the other occurrence around the
decrRefCount() at the 692-693 site; reference the robj and decrRefCount symbols
in the comment so future editors understand the contract.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 0fa592b3-8dcc-409e-b41d-0c9c69053ddc
📒 Files selected for processing (1)
src/io_threads.c
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3760 +/- ##
============================================
- Coverage 76.84% 76.65% -0.19%
============================================
Files 162 162
Lines 80689 80693 +4
============================================
- Hits 62009 61859 -150
- Misses 18680 18834 +154
🚀 New features to boost your workflow:
|
This regression is still present in 9.1 GA as the cherrypick of revert commit was missed during release. Re-applies #3544 (reverted in #3756 due to ~20% SET regression) with the performance fix from #3760. **Root Cause:** The original #3544 changed `tryOffloadFreeObjToIOThreads` to only offload the SDS buffer free to IO threads, freeing the `robj` shell on the main thread. I carried out profiling for the change and it showed that freeing the `robj` shell on the main thread became the prime main-thread hotspot (~10% CPU), while IO threads shifted from doing real `jemalloc` work to spinning idle on `spmcDequeue`. **Fix**: Keep `tryOffloadFreeObjToIOThreads` offloading the entire robj (`decrRefCount`) to the IO thread. Cross-thread `zfree` is safe with `jemalloc`. This PR includes all cleanup work from #3544 so - - `trySendWriteToIOThreads`: defer clearing `last_header` until after successful enqueue - `evictClients`: simplified bookkeeping - Queue sizes as runtime parameters instead of compile-time macros - IO ignition policy using `stat_active_time` instead of `getrusage` - Function renames (`IOThreadFreeArgv` --> `ioThreadFreeArgv`, etc.) and doc comments **Benchmark** on (Graviton4 c8gb.metal-48xl): Config: SET, 128B values, 9 IO threads, pipeline=10, 1600 clients - Same as Valkey official method | Version | Throughput | |---------|-----------| | Unstable + original #3544 | ~1,554K rps | | Unstable + this PR | ~2,116K rps | <details> <summary>Diff vs original #3544 (perf fix)</summary> ```diff diff --git a/src/io_threads.c b/src/io_threads.c --- a/src/io_threads.c +++ b/src/io_threads.c @@ // IO thread handler case JOB_REQ_FREE_OBJ: - zfree(data); + decrRefCount(data); break; @@ // tryOffloadFreeObjToIOThreads - /* We offload only the free of the ptr that may be allocated by the I/O thread. - * The object itself was allocated by the main thread and will be freed by the main thread. */ - void *job = tagJob(sdsAllocPtr(objectGetVal(obj)), JOB_REQ_FREE_OBJ); + void *job = tagJob(obj, JOB_REQ_FREE_OBJ); if (unlikely(spmcEnqueue(&io_shared_inbox, job) == false)) return C_ERR; - objectSetVal(obj, NULL); - decrRefCount(obj); io_jobs_submitted++; ``` </details> --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Problem:
PR #3544 changed
tryOffloadFreeObjToIOThreadsto only offload the SDS buffer free to IO threads, while freeing the robj shell on the main thread viadecrRefCount(obj) → zfree(). This moved allocator work from 8 IO threads back to the single main thread, causing ~20% SET throughput regression when IO threads are active (128B+ values, pipelined workloads).Fix:
Restore the original behavior: offload the entire
decrRefCount(obj)to the IO thread. Cross-threadzfreeis safe with jemalloc's tcache mechanism and the original design has been running without correctness issues since the IO threads redesign.Resolves #3750
Full length benchmark:
Benchmark Comparison: unstable vs fix
Configuration: