perf(core): cut runtime sync I/O on tool hot path by 91%#3581
Conversation
Every recorded chat event (user message, assistant turn, tool call, tool result, slash command, etc.) was issuing 4 sync fs syscalls on the main event loop: existsSync(dir) + mkdirSync(dir) + existsSync(file) + appendFileSync(file). For a tool-heavy prompt this added ~88 sync I/O calls per session, blocking the UI render and keypress handler during each one. - chatRecordingService.appendRecord: cache ensure-flags so dir/file creation runs once per session, then enqueue the actual write on a per-instance promise chain (writeChain). lastRecordUuid is updated synchronously so chained createBaseRecord still sees the right parentUuid without waiting for the previous write. - chatRecordingService.flush: drains the chain — wired into Config.shutdown so no records are lost on exit. - jsonl-utils.writeLine: now actually async (fs.promises.mkdir + fs.promises.appendFile) with per-dir mkdir cache. The existing per-file mutex still serializes writes correctly. - Tests updated to await flush() before assertions. Trace measurement on a single tool-heavy prompt: 110 → 20 sync I/O calls (-82%), with chatRecordingService dropping from 88 to 0.
Each tool invocation went through validatePath → isPathWithinWorkspace → fullyResolvedPath, plus its own existence/dir checks. The same paths got re-resolved across back-to-back tool calls, and ripGrep re- discovered .qwenignore on every Grep. - workspaceContext.fullyResolvedPath: bounded LRU on input path (1024, FIFO). Failed resolutions are NOT cached so retries work. - paths.validatePath: cache positive isDirectory results; ENOENT falls through every time so a freshly created file is picked up immediately. - ripGrep: module-level caches for searchPath-is-dir and per-dir .qwenignore presence (256 each, FIFO). - fileUtils.processSingleFileContent: drop the existsSync gate; let fs.promises.stat throw ENOENT and convert to FILE_NOT_FOUND in catch. Trace: 20 → 10 sync I/O calls. Cumulative reduction since the chat-recording change: 110 → 10, -91%. All 6057 core tests pass.
Self-review pass on the previous two perf commits surfaced a few
follow-ups worth pinning down before they bite:
- Module-level caches (paths.isDirectoryCache, ripGrep dirIsDir/qwen-
Ignore, jsonl-utils.ensuredDirs) persisted across vitest cases
silently. Added underscore-prefixed `_reset*ForTest` exports and
wired one into the validatePath describe block so future cases
mutating the same absolute paths can't pass by accident.
- Documented the parentUuid-chain tradeoff on chatRecordingService
.appendRecord: when the async write rejects, lastRecordUuid was
already set sync, so subsequent records reference an absent
ancestor — readers like sessionService.reconstructHistory then
silently drop those descendants. Same observable failure mode as
the prior sync code's caught-and-logged throw.
- Documented the dir<->file mutation and mid-session .qwenignore
staleness windows for the validatePath / ripGrep caches.
- Added regression tests:
* validatePath does NOT cache ENOENT (Edit-then-Read works)
* validatePath skips re-stat on cache hit (perf assertion)
* flush() resolves immediately on a fresh service
* a rejected writeLine does not block the next record
Full core suite: 6061 pass, 2 skipped — no regressions.
de0a9ed to
d648596
Compare
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
This PR removes repeated synchronous I/O from the chat-recording hot path by moving JSONL appends onto an async write chain and adding bounded caches for repeated path checks. I did not find actionable correctness issues in the changed queue handling.
Verdict
APPROVE — The async queue preserves write ordering, normal shutdown flushes pending records, and the targeted changed-file tests pass.
zhangxy-zju
left a comment
There was a problem hiding this comment.
Really nice work — the measurement methodology (tracer attached, before/after numbers, per-PID handling for spawned helpers) is the most rigorous perf PR I've seen in this repo, and the LRU bounds + ENOENT-not-cached invariant + reset hooks show real care.
Leaving two inline comments on what I think are real correctness regressions introduced by this PR (one definite, one edge-case-but-silent-data-loss).
Known gap (not blocking, follow-up worth filing): after this PR, Config.shutdown() awaits flush() correctly on the graceful path, but runExitCleanup itself has no failsafe — it's an unbounded series of awaits. If any cleanup hangs (slow disk, MCP disconnect on a dead socket, telemetry HTTP stall) the process now hangs forever instead of exiting. Sync writes were inherently bounded by syscall return; async writes need explicit bounds. Claude Code's gracefulShutdown.ts wraps this in Math.max(5s, hookBudget+3.5s) failsafe + per-cleanup Promise.race(2s) — they had to add it after a real incident (per their inline comments). Worth a follow-up PR to add the same shape here; not blocking this one.
| // Ignore errors - directory will be created if it doesn't exist | ||
| } | ||
|
|
||
| this.chatsDirEnsured = true; |
There was a problem hiding this comment.
chatsDirEnsured is set to true even when mkdirSync throws.
Pre-PR, every call retried mkdirSync, so a transient failure was self-healing. Now one failure poisons the cache for the entire process: every subsequent appendRecord short-circuits to a chatsDir whose parent does not exist, the eventual appendFile silently fails to create the file — and no record is ever written for the rest of the session, with no error surfaced to the user.
Trigger case: running the CLI in a sandbox/container with a delayed mount of ~/.qwen, on an NFS mount with intermittent EACCES, or alongside another process that briefly removes the chats parent dir. First mkdir fails → entire session's transcript silently lost.
Fix: move this.chatsDirEnsured = true; inside the try block so it only caches success.
There was a problem hiding this comment.
Fixed in 341b90c — moved this.chatsDirEnsured = true inside the try so a transient mkdir failure stays self-healing. Added a regression test that fails both mkdirSync and writeFileSync(wx) together (otherwise ensureConversationFile's own cache short-circuits and masks the bug), and asserts the next appendRecord retries mkdir.
| // the async write queue so no records are lost on exit. | ||
| try { | ||
| this.chatRecordingService?.finalize(); | ||
| await this.chatRecordingService?.flush(); |
There was a problem hiding this comment.
flush() only runs on the graceful path; the EPIPE exit path now silently drops queued records.
This line correctly drains the queue on Config.shutdown(), but nonInteractiveCli.ts:181 still calls process.exit(0) directly when stdout gets EPIPE — bypassing runExitCleanup entirely, so flush() never runs and queued records are lost.
Under sync writes this was harmless: bytes were on disk before the syscall returned, regardless of how the process exited. After this PR, the same code path becomes silent data loss.
Trigger case: qwen -p "list every file in this repo" | head -1. head exits → EPIPE on stdout → process.exit(0) → ~5 queued records (the most recent assistant turn + tool results) silently lost. This is a routine CLI usage pattern — cli | head, cli | less, cli | grep -m 1 — not an edge case.
Fix: at minimum, change nonInteractiveCli.ts:181 from process.exit(0) to process.stdout.destroy(), so the natural finally → runExitCleanup chain can complete the flush. Claude Code's process.ts uses exactly this pattern — stream.destroy() on EPIPE without exiting — for the same reason.
There was a problem hiding this comment.
Fixed in bf24fff at nonInteractiveCli.ts:181 — process.exit(0) → process.stdout.destroy(), with a regression test that emits EPIPE mid-stream and asserts stdout.destroy is called (mocked process.exit would throw if the bypass came back).
One subtlety from implementing it: I tried adding abortController.abort() alongside the destroy, but the abort path runs handleCancellationError which itself calls process.exit(130) — same bypass on a different code path. So the handler now only destroys stdout and lets the natural function return drive cleanup. Token waste on EPIPE is bounded by the in-flight LLM response and seems like the right trade vs. data loss.
Pre-fix, the flag flipped to true even when mkdirSync threw, so a single transient failure (NFS EACCES, sandbox mount race, parent dir briefly missing) would short-circuit every subsequent appendRecord and silently drop the rest of the session's transcript with no error surfaced. Reported by zhangxy-zju on #3581.
Routine CLI patterns like `qwen -p ... | head -1` / `| less` / `| grep -m1` close the downstream pipe and trigger EPIPE. The previous handler called process.exit(0), which bypassed the caller's runExitCleanup -> Config .shutdown -> chat-recording flush() chain and silently dropped queued JSONL writes (most recent assistant turn + tool results). Destroying stdout instead lets writes fail fast and the natural function return drive cleanup. We deliberately do not also abortController.abort() here: the abort path runs handleCancellationError which itself calls process.exit(130), re-introducing the same bypass. Reported by zhangxy-zju on #3581.
Pre-fix, runExitCleanup was an unbounded series of awaits. After the async-jsonl change moved chat-recording writes off the calling thread (Config.shutdown now `await flush()`s the queue), any hung syscall (slow disk, dead NFS mount, stuck MCP socket, telemetry HTTP stall) would hang process exit indefinitely — sync writes were inherently bounded by syscall return; async writes are not. Adds per-cleanup 2s + overall 5s wall-clock failsafes on the same shape as Claude Code's gracefulShutdown.ts. Also replaces dead test-isolation code (`global['cleanupFunctions']` was never on global, the array is module-private) with a `_resetCleanupFunctionsForTest` hook matching the convention from d648596. Follow-up flagged by zhangxy-zju on #3581.
|
Thanks for the thorough review — addressed all three points in this push:
Note: All changes went through 4 rounds of self-audit + reverse-audit before commit. Tests: 53 passing across the touched files (including 8 new regression guards). Per-cleanup test ran 5× locally with no flakes. |
zhangxy-zju
left a comment
There was a problem hiding this comment.
Verified all three follow-ups — LGTM, approving.
-
341b90cd9(chatsDirEnsured) —this.chatsDirEnsured = truecorrectly moved inside thetry. The regression test is well-designed: it intentionally fails bothmkdirSyncAND thewx-create, with a comment explaining why (otherwisecachedConversationFile's own short-circuit hides the bug). That's the kind of test that only gets written when you've actually traced the failure path. -
bf24fff1f(EPIPE) —process.exit(0)→process.stdout.destroy()with idempotency guard. Especially appreciated the commit message calling out whyabortController.abort()is also wrong (handleCancellationError→process.exit(130)re-introduces the same bypass) — proves the alternative was considered, not just settled for. -
869a4e3b1(shutdown failsafe) — went beyond the scope I flagged. The per-fn 2s + overall 5s structure mirrors the Claude CodegracefulShutdownpattern;wallClockTimer.unref?.()is the correct touch so the timer can't keep the loop alive on its own. Bonus: the deadglobal['cleanupFunctions'] = []test-isolation code is now properly replaced with_resetCleanupFunctionsForTest, matching the convention fromd6485964c. Solid cleanup work I wasn't expecting.
The perf wins from the original three commits + the safety nets from the follow-ups together make this a much stronger PR than it started as. Nice turnaround.
|
Addressed the SIGINT bypass too in 718ebb3 — the What changed: Test coverage: 5 new regression guards in errors.test.ts — cleanup-before-exit ordering for each handler (text + JSON modes), plus a concurrent-handler race that proves the latch parks the second caller. Out of scope still: there are a few other After this PR there are no remaining |
|
Thanks for the careful re-review and the approval @zhangxy-zju 🙏 Heads-up that 718ebb3 landed an hour after your approval — same theme (process.exit bypassing runExitCleanup), this time on the SIGINT / max-turn / fatal-error paths via the three handlers in If you'd like a fresh pass on that commit before this gets merged, happy to wait. If you'd rather the approval cover it (it's structurally the same fix shape — async handler + drain cleanup + exit-once latch for the concurrent case), say the word and I'll proceed to merge. |
* perf(core): make chat recording writes async
Every recorded chat event (user message, assistant turn, tool call,
tool result, slash command, etc.) was issuing 4 sync fs syscalls on
the main event loop: existsSync(dir) + mkdirSync(dir) + existsSync(file)
+ appendFileSync(file). For a tool-heavy prompt this added ~88 sync
I/O calls per session, blocking the UI render and keypress handler
during each one.
- chatRecordingService.appendRecord: cache ensure-flags so dir/file
creation runs once per session, then enqueue the actual write on a
per-instance promise chain (writeChain). lastRecordUuid is updated
synchronously so chained createBaseRecord still sees the right
parentUuid without waiting for the previous write.
- chatRecordingService.flush: drains the chain — wired into
Config.shutdown so no records are lost on exit.
- jsonl-utils.writeLine: now actually async (fs.promises.mkdir +
fs.promises.appendFile) with per-dir mkdir cache. The existing
per-file mutex still serializes writes correctly.
- Tests updated to await flush() before assertions.
Trace measurement on a single tool-heavy prompt: 110 → 20 sync I/O
calls (-82%), with chatRecordingService dropping from 88 to 0.
* perf(core): cache repeated fs lookups on tool hot path
Each tool invocation went through validatePath → isPathWithinWorkspace
→ fullyResolvedPath, plus its own existence/dir checks. The same paths
got re-resolved across back-to-back tool calls, and ripGrep re-
discovered .qwenignore on every Grep.
- workspaceContext.fullyResolvedPath: bounded LRU on input path
(1024, FIFO). Failed resolutions are NOT cached so retries work.
- paths.validatePath: cache positive isDirectory results; ENOENT
falls through every time so a freshly created file is picked up
immediately.
- ripGrep: module-level caches for searchPath-is-dir and per-dir
.qwenignore presence (256 each, FIFO).
- fileUtils.processSingleFileContent: drop the existsSync gate;
let fs.promises.stat throw ENOENT and convert to FILE_NOT_FOUND
in catch.
Trace: 20 → 10 sync I/O calls. Cumulative reduction since the
chat-recording change: 110 → 10, -91%. All 6057 core tests pass.
* test(core): cache reset hooks + regression-guards from audit
Self-review pass on the previous two perf commits surfaced a few
follow-ups worth pinning down before they bite:
- Module-level caches (paths.isDirectoryCache, ripGrep dirIsDir/qwen-
Ignore, jsonl-utils.ensuredDirs) persisted across vitest cases
silently. Added underscore-prefixed `_reset*ForTest` exports and
wired one into the validatePath describe block so future cases
mutating the same absolute paths can't pass by accident.
- Documented the parentUuid-chain tradeoff on chatRecordingService
.appendRecord: when the async write rejects, lastRecordUuid was
already set sync, so subsequent records reference an absent
ancestor — readers like sessionService.reconstructHistory then
silently drop those descendants. Same observable failure mode as
the prior sync code's caught-and-logged throw.
- Documented the dir<->file mutation and mid-session .qwenignore
staleness windows for the validatePath / ripGrep caches.
- Added regression tests:
* validatePath does NOT cache ENOENT (Edit-then-Read works)
* validatePath skips re-stat on cache hit (perf assertion)
* flush() resolves immediately on a fresh service
* a rejected writeLine does not block the next record
Full core suite: 6061 pass, 2 skipped — no regressions.
* fix(core): cache chatsDirEnsured only on mkdir success
Pre-fix, the flag flipped to true even when mkdirSync threw, so a single
transient failure (NFS EACCES, sandbox mount race, parent dir briefly
missing) would short-circuit every subsequent appendRecord and silently
drop the rest of the session's transcript with no error surfaced.
Reported by zhangxy-zju on QwenLM#3581.
* fix(cli): destroy stdout instead of process.exit on EPIPE
Routine CLI patterns like `qwen -p ... | head -1` / `| less` / `| grep -m1`
close the downstream pipe and trigger EPIPE. The previous handler called
process.exit(0), which bypassed the caller's runExitCleanup -> Config
.shutdown -> chat-recording flush() chain and silently dropped queued
JSONL writes (most recent assistant turn + tool results).
Destroying stdout instead lets writes fail fast and the natural function
return drive cleanup. We deliberately do not also abortController.abort()
here: the abort path runs handleCancellationError which itself calls
process.exit(130), re-introducing the same bypass.
Reported by zhangxy-zju on QwenLM#3581.
* fix(cli): bound runExitCleanup with per-fn + wall-clock timeouts
Pre-fix, runExitCleanup was an unbounded series of awaits. After the
async-jsonl change moved chat-recording writes off the calling thread
(Config.shutdown now `await flush()`s the queue), any hung syscall
(slow disk, dead NFS mount, stuck MCP socket, telemetry HTTP stall)
would hang process exit indefinitely — sync writes were inherently
bounded by syscall return; async writes are not.
Adds per-cleanup 2s + overall 5s wall-clock failsafes on the same
shape as Claude Code's gracefulShutdown.ts. Also replaces dead
test-isolation code (`global['cleanupFunctions']` was never on global,
the array is module-private) with a `_resetCleanupFunctionsForTest`
hook matching the convention from 9dce715f3.
Follow-up flagged by zhangxy-zju on QwenLM#3581.
---------
Co-authored-by: wenshao <wenshao@U-K7F6PQY3-2157.local>
Summary
Tracing a single tool-heavy prompt revealed that
chatRecordingService.appendRecordwas issuing 4 synchronous fs syscalls (existsSync+mkdirSync+existsSync+appendFileSync) on every recorded event — user message, assistant turn, tool call, tool result, slash command, etc. Combined with redundantrealpathSync/statSync/existsSynccalls scattered acrossvalidatePath/workspaceContext/ripGrep/fileUtils, a typical session was paying ~110 sync I/O calls per prompt on the main event loop. On slow disks or large config files this directly stalls Ink rendering and keypress handling.This branch reduces that to ~10 calls per prompt (-91%) through three commits:
perf(core): make chat recording writes async—appendRecordbecomes fire-and-forget on a per-instance promise chain (writeChain);lastRecordUuidupdated synchronously to preserveparentUuidchain;Config.shutdown()awaits a newflush()so no records are lost on exit.jsonl.writeLinenow usesfs.promises.mkdir/appendFile. (110 → 20)perf(core): cache repeated fs lookups on tool hot path— bounded LRU caches forworkspaceContext.fullyResolvedPath,paths.validatePath(positive only — ENOENT not cached),ripGrep.qwenignorediscovery;fileUtilsdrops itsexistsSyncpre-check and convertsfs.promises.statENOENT toFILE_NOT_FOUND. (20 → 10)test(core): cache reset hooks + regression-guards from audit— followups from a multi-round reverse audit:_reset*ForTestexports for module-level caches, regression tests covering ENOENT-not-cached /flush()early-resolve / write failure not blocking the chain, and inline notes on the deliberate staleness windows (dir↔file mutation, mid-session.qwenignore,parentUuidchain on write rejection).Reproducing the measurement
/tmp/qwen-trace/trace-sync-io.cjs.npm install && npm run bundle.NODE_OPTIONSpointed at the tracer. The same prompt was used for every measurement in this PR:NODE_OPTIONS, so each process writes its own PID-suffixed summary — pick the largestcalls.<pid>.log(that's the CLI process, ~30s runtime; helpers exit in ~1ms):Before vs after (same prompt, same tracer)
Baseline (commit 3182500,
x2HEAD before this branch):After (this PR head):
Every remaining site fires exactly once per unique input path — which is the theoretical lower bound short of rewriting the semantics.
Tracer script (
trace-sync-io.cjs, ~160 lines)Key design points worth noting:
fs.appendFileSyncinternally callsfs.writeFileSync; without this guard the wrappers recurse and pollute the data with phantomwriteFileSyncattributions.NODE_OPTIONS; each writes its own summary so the UI process's data isn't overwritten by a 1ms helper's empty one.setTimeout(() => armed = true, 4000)excludes module-init syscalls, so only interaction-time I/O is counted.Early exploration used tmux to drive the interactive TUI through slash commands (where API calls weren't wanted), but the numbers above come from the non-interactive path — a real prompt with real tool calls, deterministic and easy to diff across stages.
Test plan
npx vitest run(packages/core) — 6061 pass, 2 skipped, 246 filesparentUuidchain after the async switchNotes for review
appendRecordkeeps itsvoidsignature so the 11+ internal callers don't need to propagateawaits; the chain +flush()covers correctness._reset*ForTestexports.chatRecordingService.appendRecord,paths.isDirectoryCache, and the ripGrep cache header.