Skip to content

[Performance] Revert tryOffloadFreeObjToIOThreads to offload entire robj#3760

Closed
roshkhatri wants to merge 3 commits into
valkey-io:unstablefrom
roshkhatri:fix-io-thread-free-regression
Closed

[Performance] Revert tryOffloadFreeObjToIOThreads to offload entire robj#3760
roshkhatri wants to merge 3 commits into
valkey-io:unstablefrom
roshkhatri:fix-io-thread-free-regression

Conversation

@roshkhatri

@roshkhatri roshkhatri commented May 18, 2026

Copy link
Copy Markdown
Member

Problem:

PR #3544 changed tryOffloadFreeObjToIOThreads to only offload the SDS buffer free to IO threads, while freeing the robj shell on the main thread via decrRefCount(obj) → zfree(). This moved allocator work from 8 IO threads back to the single main thread, causing ~20% SET throughput regression when IO threads are active (128B+ values, pipelined workloads).

Fix:

Restore the original behavior: offload the entire decrRefCount(obj) to the IO thread. Cross-thread zfree is safe with jemalloc's tcache mechanism and the original design has been running without correctness issues since the IO threads redesign.

Resolves #3750

Full length benchmark:

Benchmark Comparison: unstable vs fix

% Change Test Unstable Fixed
+28.2% SET rps P10 T9 1.53M 1.96M
-22.0% SET avg_latency P10 T9 10.4ms 8.13ms
-22.0% SET p50_latency P10 T9 10.4ms 8.12ms
-21.9% SET p95_latency P10 T9 10.6ms 8.32ms
-21.8% SET p99_latency P10 T9 10.7ms 8.37ms

Configuration:

  • architecture: aarch64
  • benchmark_mode: duration (180s)
  • clients: 1600
  • data_size: 128B
  • pipeline: 10
  • io-threads: 9
  • benchmark-threads: 90
  • server_cpu_range: 0-8
  • client_cpu_range: 96-191
  • warmup: 30s

Commit fdd9039 changed tryOffloadFreeObjToIOThreads to only offload
the SDS buffer free to IO threads, while freeing the robj shell on
the main thread. This caused a ~20% SET regression with IO threads
active (128B values, pipeline=10, 9 IO threads, 1600 clients).

SPE profiling shows tryOffloadFreeObjToIOThreads became the #1
hotspot on the main thread, and IO threads shifted from doing real
jemalloc work (je_edata_heap_remove_first, sdsfree) to spinning
idle on spmcDequeue.

The original design correctly offloads the entire decrRefCount to
the IO thread. Cross-thread zfree is handled efficiently by
jemalloc's tcache mechanism and does not require same-thread
affinity for correctness.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors object freeing offloaded to IO threads. The enqueuing path in tryOffloadFreeObjToIOThreads() now passes the robj pointer itself as the job payload, while the IO thread's job handler for JOB_REQ_FREE_OBJ switches from direct zfree() to decrRefCount() to respect reference counting semantics.

Changes

IO Thread Object Freeing

Layer / File(s) Summary
Reference-counted object freeing in IO threads
src/io_threads.c
JOB_REQ_FREE_OBJ handling changed from zfree(data) to decrRefCount(data), and the job enqueuing path now passes the robj *obj pointer directly instead of an allocated pointer derived from the object's value.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: reverting tryOffloadFreeObjToIOThreads to offload the entire robj, which directly addresses the performance regression issue.
Linked Issues check ✅ Passed The PR directly addresses issue #3750 by reverting the problematic change in PR #3544 that moved allocator work from IO threads to main thread, restoring the original behavior to fix the ~20% SET throughput regression.
Out of Scope Changes check ✅ Passed The changes to src/io_threads.c (modifying JOB_REQ_FREE_OBJ handling to call decrRefCount instead of zfree directly) are directly scoped to fixing the performance regression identified in issue #3750.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description clearly explains the problem (performance regression caused by PR #3544), the proposed fix (restoring original offloading behavior), and provides comprehensive benchmark data demonstrating the improvement.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@roshkhatri roshkhatri added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label May 18, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/io_threads.c (1)

346-346: ⚡ Quick win

Document why the whole robj stays on the I/O thread.

This contract is easy to “optimize” back into the regressed version. A short rationale here would make it clear that only last-reference RAW strings take this path, and that the full decrRefCount() intentionally remains on the I/O thread so allocator work does not move back to the main thread.

Suggested comment update
             case JOB_REQ_FREE_OBJ:
-                decrRefCount(data);
+                /* Only last-reference RAW string objects are enqueued here.
+                 * Keep the full decrRefCount() on the I/O thread so both the
+                 * SDS buffer and the robj shell are freed off the main thread.
+                 */
+                decrRefCount((robj *)data);
                 break;
@@
-    void *job = tagJob(obj, JOB_REQ_FREE_OBJ);
+    /* Offload the whole object, not just obj->ptr, to keep allocator work
+     * off the main thread on this hot path.
+     */
+    void *job = tagJob(obj, JOB_REQ_FREE_OBJ);

As per coding guidelines, "Use comments for non-obvious behavior and rationale, not for restating code."

Also applies to: 692-693

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/io_threads.c` at line 346, Add a short explanatory comment next to the
decrRefCount(data) call describing why the entire robj (the full object) is
intentionally dropped on the I/O thread rather than moved back to the main
thread: explain that this path is only taken for last-reference RAW strings and
that keeping decrRefCount() on the I/O thread avoids moving allocator work back
to the main thread (preventing the regression), and duplicate this rationale at
the other occurrence around the decrRefCount() at the 692-693 site; reference
the robj and decrRefCount symbols in the comment so future editors understand
the contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/io_threads.c`:
- Line 346: Add a short explanatory comment next to the decrRefCount(data) call
describing why the entire robj (the full object) is intentionally dropped on the
I/O thread rather than moved back to the main thread: explain that this path is
only taken for last-reference RAW strings and that keeping decrRefCount() on the
I/O thread avoids moving allocator work back to the main thread (preventing the
regression), and duplicate this rationale at the other occurrence around the
decrRefCount() at the 692-693 site; reference the robj and decrRefCount symbols
in the comment so future editors understand the contract.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 0fa592b3-8dcc-409e-b41d-0c9c69053ddc

📥 Commits

Reviewing files that changed from the base of the PR and between 2946807 and 5096255.

📒 Files selected for processing (1)
  • src/io_threads.c

@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.65%. Comparing base (69bd02b) to head (5096255).
⚠️ Report is 2 commits behind head on unstable.

Files with missing lines Patch % Lines
src/io_threads.c 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3760      +/-   ##
============================================
- Coverage     76.84%   76.65%   -0.19%     
============================================
  Files           162      162              
  Lines         80689    80693       +4     
============================================
- Hits          62009    61859     -150     
- Misses        18680    18834     +154     
Files with missing lines Coverage Δ
src/io_threads.c 76.04% <0.00%> (+0.33%) ⬆️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rainsupreme

rainsupreme commented May 19, 2026

Copy link
Copy Markdown
Contributor

this is an empty PR after #3756 which reverted #3756

@roshkhatri roshkhatri closed this May 20, 2026
madolson pushed a commit that referenced this pull request Jun 9, 2026
This regression is still present in 9.1 GA as the cherrypick of revert
commit was missed during release.

Re-applies #3544 (reverted in #3756 due to ~20% SET regression) with the
performance fix from #3760.

**Root Cause:** The original #3544 changed
`tryOffloadFreeObjToIOThreads` to only offload the SDS buffer free to IO
threads, freeing the `robj` shell on the main thread. I carried out
profiling for the change and it showed that freeing the `robj` shell on
the main thread became the prime main-thread hotspot (~10% CPU), while
IO threads shifted from doing real `jemalloc` work to spinning idle on
`spmcDequeue`.

**Fix**: Keep `tryOffloadFreeObjToIOThreads` offloading the entire robj
(`decrRefCount`) to the IO thread. Cross-thread `zfree` is safe with
`jemalloc`.

This PR includes all cleanup work from #3544 so - 
- `trySendWriteToIOThreads`: defer clearing `last_header` until after
successful enqueue
- `evictClients`: simplified bookkeeping
- Queue sizes as runtime parameters instead of compile-time macros
- IO ignition policy using `stat_active_time` instead of `getrusage`
- Function renames (`IOThreadFreeArgv` --> `ioThreadFreeArgv`, etc.) and
doc comments

**Benchmark** on (Graviton4 c8gb.metal-48xl):
Config: SET, 128B values, 9 IO threads, pipeline=10, 1600 clients - Same
as Valkey official method
| Version | Throughput |
|---------|-----------|
| Unstable + original #3544 | ~1,554K rps |
| Unstable + this PR | ~2,116K rps |

<details>
<summary>Diff vs original #3544 (perf fix)</summary>

```diff
diff --git a/src/io_threads.c b/src/io_threads.c
--- a/src/io_threads.c
+++ b/src/io_threads.c
@@ // IO thread handler
 case JOB_REQ_FREE_OBJ:
-    zfree(data);
+    decrRefCount(data);
     break;

@@ // tryOffloadFreeObjToIOThreads
-    /* We offload only the free of the ptr that may be allocated by the I/O thread.
-     * The object itself was allocated by the main thread and will be freed by the main thread. */
-    void *job = tagJob(sdsAllocPtr(objectGetVal(obj)), JOB_REQ_FREE_OBJ);
+    void *job = tagJob(obj, JOB_REQ_FREE_OBJ);
     if (unlikely(spmcEnqueue(&io_shared_inbox, job) == false)) return C_ERR;
-    objectSetVal(obj, NULL);
-    decrRefCount(obj);
     io_jobs_submitted++;
```
</details>

---------

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@roshkhatri roshkhatri deleted the fix-io-thread-free-regression branch June 10, 2026 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] 20% SET command regression due to IO-Threads redesign cleanup work

2 participants