Skip to content

fix(core): recover from }{ glued records on session JSONL load (#3606)#3656

Merged
wenshao merged 1 commit into
QwenLM:mainfrom
qqqys:fix/3606-jsonl-recovery
Apr 27, 2026
Merged

fix(core): recover from }{ glued records on session JSONL load (#3606)#3656
wenshao merged 1 commit into
QwenLM:mainfrom
qqqys:fix/3606-jsonl-recovery

Conversation

@qqqys

@qqqys qqqys commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • What changed: jsonl.read() and jsonl.readLines() now tolerate per-line parse errors and recover concatenated records (}{ glued onto a single physical line) via a brace-depth scanner that respects string boundaries and \ escapes. New helper _recoverObjectsFromLine() does the splitting; parseLineTolerant() routes per-line failures through it.
  • Why it changed: Fixes No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606. A single corrupted line (two records glued without a newline separator — caused by an interrupted append losing its trailing \n) made JSON.parse throw, so read() returned [] and loadSession() returned undefined ("No saved session found"). Worse, simply skipping the bad line orphans every subsequent record via parentUuid chain breakage in reconstructHistory, so recovery — not just tolerance — is required to actually restore the session.
  • Reviewer focus:
    1. _recoverObjectsFromLine correctness — string/escape handling, depth < 0 reset, and the documented limitation (top-level objects only)
    2. parseLineTolerant 0-vs-N branch in debugLogger.warn (no false "Recovered 0" noise)
    3. No write-side change — read-side recovery is sufficient and avoids accumulating empty separator lines on every resume

Validation

  • Commands run:

    # Full unit suite for the touched code paths
    npx vitest run \
      packages/core/src/utils/jsonl-utils.test.ts \
      packages/core/src/services/chatRecordingService.test.ts \
      packages/core/src/services/sessionService.test.ts \
      packages/core/src/services/sessionService.rename.test.ts \
      packages/core/src/services/chatRecordingService.customTitle.test.ts \
      packages/core/src/services/chatRecordingService.autoTitle.test.ts
    
    # Type + lint
    npx tsc --noEmit -p packages/core
    npx eslint packages/core/src/utils/jsonl-utils.ts \
                packages/core/src/utils/jsonl-utils.test.ts
    
    # End-to-end against the JSONL attached to #3606
    npx tsx <<'EOF'
    const { SessionService } = await import(
      './packages/core/src/services/sessionService.ts'
    );
    const svc = new SessionService('/path/to/project');
    const loaded = await svc.loadSession('5b0c6411-5f63-4457-a33f-f4d3af300dd5');
    console.log('messages:', loaded?.conversation.messages.length);
    EOF
  • Inputs used: the JSONL file megapro17 attached to No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606 (5b0c6411-...jsonl), 480 lines, 1 malformed line at line 63 of the form {record_A}}}{record_B} (an api_response ui_telemetry record glued to a user message record across a 5-minute session-restart gap).

  • Expected result:

    • Unit suite green (existing + 13 new cases)
    • loadSession returns 481 records (480 lines, line 63 yields 2)
    • All UUIDs unique
    • The line-64 record's parentUuid = ce2f0622... resolves (recovered from the second half of line 63), so reconstructHistory walks the full chain with no orphans
  • Observed result:

    Test Files  6 passed (6)
         Tests  98 passed (98)
    
    --- end-to-end loadSession against #3606 attachment ---
    Sessions visible to picker: ... (target listed; messageCount: 144)
    loadSession result:
      messages reconstructed: 481
      lastCompletedUuid: 4702d357-1eca-42ab-bb67-21d7683f5e2d
      Glued pair recovered: true true (3ff76a26 + ce2f0622)
      Post-corruption record present: true (63eefedc — line 64)
    
  • Quickest reviewer verification path:

    1. Download the JSONL from megapro17's comment on No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606
    2. Place under any <runtime>/projects/<sanitized-cwd>/chats/<id>.jsonl (rewrite records[0].cwd to your project root so getProjectHash matches)
    3. npm start -- --resume 5b0c6411-5f63-4457-a33f-f4d3af300dd5
    4. Before this PR: "No saved session found". After: full session loads, 481 messages reconstructed; stderr/debug log shows Recovered 2 record(s) from malformed line in ...
  • Evidence — before/after on the real attachment:

    Metric Before fix After fix
    read() returns [] 481 records
    loadSession() returns undefined full conversation
    reconstructHistory() linear messages 0 481
    Glued line halves recovered 0 / 2 2 / 2
    Orphaned records via parentUuid break 417 0

Scope / Risk

  • Main risk / tradeoff:
    • Scanner only recognizes top-level {...} records, not [...] arrays (documented in the helper's JSDoc). All existing JSONL writers in this codebase emit object records, so this matches the only corruption shape we observe in the wild — but a future writer that emits arrays would not benefit from the recovery path.
    • On a parse failure, the scanner attempts recovery by walking the line; on a deliberately-crafted huge single line this is O(n) but unbounded by the upstream readline buffer. In practice session JSONL is small (megapro17's worst case: 162 KB single record, < 5 ms scan).
  • Not covered / not validated:
    • No fix on the write side. The trailing-newline loss that triggered No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606 is OS/process-level (Windows console close, abnormal exit, antivirus interception of un-fsync'd page cache) — out of scope to repair preemptively. Read-side recovery is enough to keep loadSession correct even if the same corruption recurs, and avoids accumulating extra empty lines on every resume.
    • Not exercised on Windows or Linux locally — see matrix below.
  • Breaking changes / migration notes: None. read() / readLines() signatures and return types are unchanged; behavior change is purely additive (was throwing → now recovers + logs).

Testing Matrix

🍏 🪟 🐧
npm run ⚠️ ⚠️
npx ⚠️ ⚠️
Docker N/A N/A N/A
Podman N/A N/A N/A
Seatbelt N/A N/A N/A

Testing matrix notes:

  • macOS (🍏): unit suite + end-to-end loadSession against the No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606 attachment, both via npm run and direct npx tsx. ✅
  • Windows (🪟) / Linux (🐧): not exercised locally. The change is pure JS string scanning + a try/catch around JSON.parse; no platform-specific syscalls or path handling. The original write-side \n-loss that produced the bug is reportedly Windows-specific, but the read-side recovery is platform-agnostic.
  • Docker / Podman / Seatbelt: not applicable — change is in packages/core JSONL utilities, no sandbox or container surface.

Linked Issues / Bugs

Closes #3606

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review

@wenshao wenshao merged commit 8a27876 into QwenLM:main Apr 27, 2026
23 of 25 checks passed
B-A-M-N pushed a commit to B-A-M-N/qwen-code that referenced this pull request Apr 28, 2026
doudouOUC added a commit that referenced this pull request Jun 2, 2026
…NL (closes #3681, #4095 Phase 2) (#4333)

* feat(core): add atomicWriteFileSync + forceMode option

Sync mirror of atomicWriteFile for code paths that can't await
(settings persistence on exit, sync config writers). Same semantics:
symlink chain resolution, permission preservation, fsync via flush:true,
EPERM/EACCES rename retry, EXDEV fallback to direct write.

Add forceMode option on AtomicWriteFileOptions — when true, ignore the
existing target's permission bits and apply options.mode regardless.
Needed for credential files that must heal historically over-permissive
files (e.g. a 0o644 token restored from backup must be forced to 0o600).
Honored by both async and sync paths. Default false preserves existing
behavior.

Reuses Atomics.wait for true blocking sleep in renameWithRetrySync —
no busy-wait, no extra dep.

Refs: #4095 Phase 2

* refactor(core): migrate credential writes to atomicWriteFile (#4095 Tier 1)

Route all OAuth credential persistence through atomicWriteFile with
forceMode: true, so a process crash mid-write cannot leave the user
with a half-written token file, and historically over-permissive files
(e.g. 0o644 from a manual restore) are healed to 0o600 on the next
write.

- oauth-token-storage.ts: setCredentials, deleteCredentials
- file-token-storage.ts: saveTokens (encrypted MCP token storage)
- qwenOAuth2.ts: cacheQwenCredentials (also fixes missing mode — was
  inheriting 0o644 from umask, now forced to 0o600)
- sharedTokenManager.ts: saveCredentialsToFile — drops ~15 lines of
  hand-rolled tmp + rename in favor of the shared helper

Lock-file writes using flag: 'wx' (sharedTokenManager.ts:720) are
intentionally left untouched — they rely on exclusive-create semantics
that atomic write does not preserve.

Tests updated to mock atomicWriteFile instead of fs.writeFile.

Refs: #4095 Phase 2

* refactor(core): migrate memory state writes to atomicWriteFile (#4095 Tier 2)

Route all auto-memory state persistence through atomicWriteFile so a
process crash during a dream/extract/forget cycle cannot corrupt the
metadata sidecar, extraction cursor, or topic body files.

Touched: manager (writeDreamMetadata), extract (writeExtractCursor +
bumpMetadata), indexer (rebuild), dream (bumpDreamMetadata), forget
(bumpMetadata + topic body rewrite).

manager.ts:362 acquireDreamLock uses flag: 'wx' for exclusive create —
left untouched, atomic write does not preserve that semantic.

Uses atomicWriteFile (not atomicWriteJSON) to preserve the trailing
newline these files have always had.

Refs: #4095 Phase 2

* refactor: migrate config + logger + state writes to atomic helpers (#4095 Tier 3a)

Route the remaining state-file write paths through atomic helpers so a
crash mid-write cannot corrupt config, log, or session-scoped state:

- trustedFolders.ts (sync): atomicWriteFileSync — sole path that flips
  workspace trust, must not half-write
- logger.ts (4 sites): atomicWriteFile — full-file JSON rewrites for
  logs.json and per-checkpoint files
- tipHistory, installationManager, projectSummary, todoWrite,
  trustedHooks: bonus sites with the same shape (state JSON written
  multiple times per session)

todoWrite is on the hot path — writes every time the todo list mutates
— so the added rename + fsync cost is measurable (a few ms per write
on SSD). Trade-off accepted to avoid a half-written todos file
silently breaking the next session's resume.

Export atomicWriteFile / atomicWriteFileSync from the core public
index so CLI-side callers (trustedFolders, tipHistory) can reach them.

Tests updated:
- logger.test.ts uses vi.importActual to re-export the real helper and
  override per-test via vi.mocked(atomicWriteFile).mockRejectedValueOnce
- trustedFolders.test.ts and todoWrite.test.ts mock the helper directly

Refs: #4095 Phase 2

* fix(core): flush JSONL appends to disk (#4095 Tier 3b, closes #3681)

#3656 fixed the read side of glued '}{' JSONL records — when a process
was killed mid-appendFile, the trailing '\n' was lost and the next
record was concatenated. The write side was left for a follow-up
(#3681).

This adds flush:true (fsync) to every per-line append:
- jsonl-utils.ts writeLine / writeLineSync (session transcripts,
  auto-titles, prompt history)
- debugLogger.ts appendFile (per-session debug log)

jsonl-utils.ts write() (full-file replace) now goes through
atomicWriteFileSync so a crash during overwrite cannot corrupt the
session transcript either.

Trade-off: fsync on every append adds disk-sync latency (single-digit
ms on SSD, more on spinning disk / network FS). Acceptable for a few
writes per turn; the alternative is silently losing the last record
of every interrupted session, which #3681 explicitly flagged.

Refs: #4095 Phase 2 Tier 3b
Closes: #3681

* refactor(core): migrate extension config + LSP edit to atomic write

Catch up two sites where Claude Code's equivalent path is atomic but
qwen-code's isn't (verified against
/Users/jinye.djy/Projects/claude-code on 2026-05-19):

- extension/extensionManager.ts:533, :1073 — enablement config and
  install metadata writes. Claude Code's plugin install-counts and zip
  cache use atomic temp+rename via writeFileSyncAndFlush_DEPRECATED.
- lsp/NativeLspService.ts:1351 — applying an LSP edit to a user file.
  Claude Code's FileWriteTool/FileEditTool both route through atomic
  writeTextContent → writeFileSyncAndFlush_DEPRECATED. A bare
  writeFileSync here could half-write the user's source file if the
  process is killed during an LSP-driven rename or quick-fix.

Also clean up stale fs.rename mock setups in sharedTokenManager.test.ts
that became no-ops after Tier 1 migration (rename is no longer called
by saveCredentialsToFile). The fs.writeFile mocks stay because the
wx-flag lock path still uses them.

Refs: #4095 Phase 2

* chore: cosmetic cleanups from PR review

- packages/core/src/index.ts: move atomicFileWrite export to its
  alphabetical position (before browser.js)
- tipHistory.ts: add forceMode: true to atomicWriteFileSync for
  consistency with other 0o600 sites — heals legacy 0o644 files even
  though tips are non-critical

Refs: #4095 Phase 2

* fix(core): address Codex review findings on Phase 2 PR

Three issues caught by post-merge Codex review of the #4095 Phase 2
branch — none had user-visible symptoms yet but all were latent bugs.

1. atomicFileWrite: forceMode without mode silently downgraded perms
   `if (!options?.forceMode)` skipped the existing-mode stat whenever
   forceMode was true, regardless of whether `mode` was also supplied.
   Calling `atomicWriteFile(p, data, { forceMode: true })` (no mode) on
   an existing 0o600 file produced 0o644 (umask default) instead of
   preserving 0o600. Tightened the guard to also require `options.mode`
   to be defined; mirrored fix in atomicWriteFileSync. Added two
   regression tests (async + sync) that assert mode preservation.

2. logger.test.ts: vi.resetAllMocks() blanked the atomicWriteFile shim
   The vi.fn(actual.atomicWriteFile) factory implementation gets reset
   to a no-op by `vi.resetAllMocks()` in beforeEach, which would make
   `logger.initialize()` silently skip creating logs.json on disk.
   Tests passed by coincidence (file pre-existence from prior runs).
   Captured the real implementation at module load and re-attach it via
   `mockImplementation` after each reset.

3. NativeLspService.applyTextEdits: atomic write bypassed file unwritability
   The read catch swallowed every error and treated it as "new empty
   file". With atomic write (tmp + rename), an unreadable target on a
   writable parent could be replaced with edits applied to an empty
   buffer — the old fs.writeFileSync would have errored on the target
   permission. Now only ENOENT is treated as new-file; other read
   errors (EACCES, EISDIR, etc.) propagate.

Refs: #4095 Phase 2

* fix(lsp): refuse LSP edits to chmod 0444 files (Codex round 2)

The previous fix only handled "read failed → propagate the error".
Codex round 2 caught the remaining gap: a file that's readable but
chmod 0444 (read-only) would still be replaced by the atomic rename,
because rename only needs parent-directory write access.

Add an explicit fs.accessSync(W_OK) check before the atomic write.
ENOENT is allowed through so LSP can still create new files via edits.

Refs: #4095 Phase 2

* fix(core): drop withTimeout around atomic credential write (Codex round 3)

`saveCredentialsToFile` wrapped `atomicWriteFile` in `withTimeout(5000)`.
If the call hits the 5s budget (e.g. slow NFS home, network-backed
storage, fsync added by Phase 2), withTimeout rejects but the
atomicWriteFile internal write+rename keeps running unobserved:

  1. withTimeout rejects → saveCredentialsToFile throws
  2. performTokenRefresh `finally` releases the refresh lock
  3. Another process acquires the lock and writes newer credentials
  4. The original atomicWriteFile finally completes its rename and
     overwrites the newer credentials — silent token rollback

Pre-migration the code awaited the tmp write and the rename in two
separate withTimeout calls; a timed-out tmp write never reached the
rename so there was no race against the target file. The migration
collapsed both into one inseparable atomicWriteFile, which made the
timeout actively unsafe (the work cannot be cancelled after the
timeout fires — fs.rename is not abortable).

Atomic write is durable by design — accept the I/O latency. The
mkdir and stat timeouts are kept (idempotent and read-only
respectively, no corruption risk on late completion).

Refs: #4095 Phase 2

* test(core): add rename-retry + EXDEV-fallback coverage (#4333 review)

Address PR review suggestions from wenshao (via qwen-latest /review):
neither renameWithRetry/Sync nor the EXDEV cross-device fallback had
direct test coverage. Both paths are critical (Windows AV contention,
Docker tmpfs /tmp) and a regression would degrade silently.

Vitest can't spy on ESM exports of `node:fs` (`Cannot redefine property:
renameSync`), so add narrow internal test seams instead:
- renameWithRetry / renameWithRetrySync take an optional `_renameImpl`
  parameter, defaulting to fs.rename / fs.renameSync.
- atomicWriteFile / atomicWriteFileSync take an optional `_testFs`
  parameter with `rename` and `writeFile` overrides, forwarded to the
  retry helper and used in the EXDEV fallback branch.

The seams are underscore-prefixed and JSDoc-tagged as "Internal test
seam — production callers never pass this", which keeps the public API
clean while making the behavior testable.

New coverage (+9 tests, 36 → 45):
- renameWithRetry: retry-EPERM-then-succeed, give-up after retries,
  no-retry on non-retryable (ENOSPC)
- renameWithRetrySync: same 3 patterns (EACCES, EPERM exhausted, EINVAL)
- EXDEV fallback: async direct write + tmp cleanup, sync ditto,
  non-EXDEV failure propagates without fallback (rejects EIO + tmp cleanup)

Refs: #4333, #4095 Phase 2

* fix(test): update telemetry sdk.test.ts appendFile assertions for flush:true

CI failure on all 3 OSes (macos / ubuntu / windows): sdk.test.ts asserted
`fs.appendFile` was called with `'utf8'` as the 3rd positional argument,
but commit b7badc7 (#4095 Tier 3b — JSONL fsync) changed the
`debugLogger.ts` appendFile call from string-form to options-form
`{ encoding: 'utf8', flush: true }` to enable per-line fsync. Update the
3 assertions in the telemetry diagnostics test to match the new shape.

No behavior change — debugLogger still flushes per append; only the
assertion in this previously-unrelated suite needed updating.

Refs: #4333, #4095 Phase 2 Tier 3b

* test: cover LSP error branches + sync EXDEV cleanup + JSONL writes (#4333 review)

Address three review suggestions from wenshao (via qwen-latest /review),
each pointing at a real coverage gap introduced by this PR:

1. NativeLspService.applyTextEdits error branches (round-2 LSP fix):
   the ENOENT-only read guard and the fs.accessSync(W_OK) refusal had
   no automated coverage. Added 3 tests accessing applyTextEdits via a
   typed cast (the method is private; making it public for one
   verification inflates API surface). Tests use chmod 0000 / chmod 0444
   reproducers and assert (a) read failure propagates EACCES without
   silently overwriting with empty content, (b) W_OK rejects with
   EACCES/EPERM before the atomic rename touches the target,
   (c) nonexistent files are still accepted so LSP can create via edits.

2. atomicWriteFileSync non-EXDEV rename failure cleanup: the async
   counterpart had an explicit EIO-rename test asserting tmp cleanup;
   the sync variant did not. Added the mirror — injects a sync rename
   throwing EIO via the existing _testFs seam and asserts
   `readdirSync(tmpDir).length === 0`.

3. jsonl-utils writeLine / writeLineSync / write smoke tests: the three
   write paths are the core fix for #3681 (the PR's headline goal) but
   downstream callers (chatRecordingService, sessionService) mock them
   entirely. Without direct unit tests, a regression that dropped
   `flush: true` or reverted `write()` to bare writeFileSync would go
   undetected. Added 3 real-fs roundtrip tests.

Test count delta:
- NativeLspService.test.ts: 15 → 18
- atomicFileWrite.test.ts: 45 → 46
- jsonl-utils.test.ts: 22 → 25

Refs: #4333, #4095 Phase 2

* fix(core,test): wrap atomic write errors + guard root user + cover save failure (#4333 review)

Three review fold-ins from wenshao (via qwen-latest /review):

1. atomicFileWrite: error messages reference the random `.tmp.<hex>`
   path, not the logical target — many callers (memory subsystem,
   extension manager) don't wrap the error, making debug logs unhelpful.
   Add `annotateWriteError(error, targetPath)` that mutates the error
   message in-place to prefix `atomicWriteFile("<targetPath>"): ` while
   preserving `code` / `errno` / `syscall` / `stack` / the prototype
   chain so downstream `err.code === 'ENOENT'` checks and `instanceof`
   narrowing keep working. Applied to both async and sync variants;
   only the final propagated throw (not the EXDEV fallback path) is
   annotated.

2. NativeLspService.test.ts: the chmod 0444 and chmod 0000 tests rely
   on `accessSync(W_OK)` and `readFileSync` failing — but on POSIX with
   UID 0 (root, including most Docker CI runners), permission bits are
   bypassed and `accessSync` always succeeds. The tests would silently
   pass even with the W_OK guard removed entirely. Add
   `process.getuid?.() === 0` to the skip guard on both tests.

3. sharedTokenManager.test.ts: the catch block in saveCredentialsToFile
   that maps disk-full / permission-denied to
   `TokenManagerError(FILE_ACCESS_ERROR)` was never exercised — every
   prior test mocked atomicWriteFile as always-successful. Added a
   regression test that rejects atomicWriteFile with ENOSPC and asserts
   the wrapped TokenManagerError surfaces with the right type and
   carries the original message.

Refs: #4333, #4095 Phase 2

* fix(core,lsp): annotate sync errors correctly + cover EXDEV fallback + async LSP IO (#4333 review)

Three review fold-ins from wenshao (via qwen-latest /review), all real
correctness/consistency issues:

1. annotateWriteError hardcoded "atomicWriteFile" prefix regardless of
   caller. Sync write failures produced misleading
   `atomicWriteFile("/path"):` prefixes in incident logs. Add optional
   `fnName` parameter (defaults to async name) and have sync call sites
   pass "atomicWriteFileSync".

2. EXDEV fallback path in both async and sync variants did NOT route
   the inner writeFileImpl/tryChmod failures through annotateWriteError.
   On a cross-device write that subsequently hit ENOSPC, the propagated
   error had a bare syscall message without the target-path prefix —
   breaking the function's documented error-shape contract. Wrap both
   fallback branches in try/catch + annotate.

3. NativeLspService.applyTextEdits is declared `async` and all callers
   `await` it, but the round-2 fix mixed in sync IO: readFileSync,
   accessSync, atomicWriteFileSync. The sync helper's renameWithRetrySync
   blocks the event loop up to ~350ms under Atomics.wait EPERM backoff
   — particularly bad for LSP workspace edits that loop over many files.
   Switch to async throughout: fsp.readFile, fsp.access, atomicWriteFile.
   Behavior preserved (same ENOENT-vs-other distinction, same W_OK gate).
   Existing tests pass unchanged (they already use the async typed-cast
   entry point).

Refs: #4333, #4095 Phase 2

* test(core): cover EXDEV-fallback-write-failure annotation path (#4333 review)

The previous fold-in added try/catch around the EXDEV fallback write in
both async and sync variants, routing fallback failures through
`annotateWriteError(err, target, fnName)`. The sync-specific
`fnName='atomicWriteFileSync'` differentiation is ONLY exercised on
that path, so a regression that dropped or misapplied the annotation
on sync would otherwise go undetected.

Two new tests inject both a failing rename (EXDEV) and a failing
writeFile (ENOSPC) via the `_testFs` seam, then assert (a) the
original `code === 'ENOSPC'` propagates intact, and (b) the message
matches `/atomicWriteFile(Sync)?\(<target>\):.*ENOSPC/` — verifying
target-path prefix AND correct fn-name differentiation.

Refs: #4333, #4095 Phase 2

* fix(core): annotate guard, drop debug fsync, add noFollow for creds (#4333 review)

Four review fold-ins from wenshao (via qwen-latest /review):

1. annotateWriteError guard `!error.message.includes(targetPath)` was
   always false in production. tmpPath is `${targetPath}.<hex>.tmp`, so
   any real fs error like `ENOSPC ... '/path/creds.json.a1b2c3.tmp'`
   *contains* targetPath as substring → guard returned false → the
   annotation prefix was silently skipped on every real failure path.
   Tests passed only because mock errors used bare `new Error('ENOSPC')`
   messages without paths. Change guard to idempotency check on our
   own prefix (`startsWith(`${fnName}(`)`), and update the two existing
   EXDEV-fallback tests to use realistic path-embedding fs messages.

2. debugLogger.appendFile dropped `flush: true`. 1050+ call sites,
   default-enabled, fire-and-forget — per-line fsync creates sustained
   I/O pressure / SSD wear with no user benefit (debug logs are
   best-effort, the module already tracks `hasWriteFailure` for the
   degraded-mode UI). Kept the kernel page-cache flush; revert
   debugLogger.test.ts and telemetry/sdk.test.ts assertions back to
   plain `'utf8'`. The #3681 closure target is jsonl-utils writeLine,
   not debug logs.

3. Symlink security regression: the old
   `fs.writeFile(tmp) + fs.rename(tmp, filePath)` pattern atomically
   *replaced* a pre-placed symlink at `filePath`. atomicWriteFile's
   default `resolveSymlinkChain(filePath)` follows the link and writes
   through, redirecting tokens to wherever the link points (real
   concern on shared hosts with weaker-than-expected dir perms). Add
   `noFollow?: boolean` option that skips chain resolution; apply
   `noFollow: true` to all 4 credential write sites
   (oauth-token-storage [2 sites], file-token-storage, sharedTokenManager)
   to match the pre-migration replace-symlink semantics.

4a. Test seam `_testFs` was a 4th positional arg → considered moving
    into options. Punted: positional with underscore + JSDoc is
    materially the same surface as options field with underscore,
    and the only realistic collision (future production option as
    5th arg) is bounded by review.
4b. Sync/async code duplication (~110 lines mirror) → DECLINED.
    Refactoring to a sync/async-polymorphic helper introduces a new
    abstraction layer with worse type ergonomics; the duplication is
    mechanical and lined up for easy diffing. Tracked as Phase 2.5
    candidate if divergence actually accumulates.

Refs: #4333, #4095 Phase 2

* fix(core): EXDEV fallback honors noFollow + fix test correctness (#4333 review)

Three review fold-ins from wenshao (via qwen-latest /review):

1. **[CRITICAL]** EXDEV fallback path silently bypassed `noFollow`
   protection. The fallback `writeFileImpl(targetPath, ...)` is
   `fs.writeFile` / `fs.writeFileSync`, both of which follow symlinks.
   When a credential write site set `noFollow: true` and rename threw
   EXDEV (realistic on Docker OverlayFS / union mounts), the fallback
   would write credentials *through* a pre-placed symlink to an
   attacker-controlled target — the exact attack noFollow was meant
   to prevent. Fix: when `noFollow` is set, the EXDEV fallback now
   `unlink`s any existing entry, then opens with `O_WRONLY|O_CREAT|O_EXCL`
   to refuse to write through a symlink that races back. Applied to
   both async and sync variants.

2. **Test correctness**: the two EXDEV-fallback-write-failure tests
   added in the previous round had a bug — the `failingWrite` mock
   threw on every call, including the first call which is the
   tmp-file write. The tmp write failed → outer catch caught
   ENOSPC → EXDEV check returned false (code is ENOSPC) → fell
   through to the outer annotateWriteError. The inner
   EXDEV-fallback annotation was never exercised. The assertions
   passed only because annotateWriteError produces the same format
   in both catch blocks. Fix: selective-failure mock that succeeds
   on the first call (tmp write) and fails on the second (fallback
   write), genuinely reaching the EXDEV branch.

3. **Behavioral noFollow tests**: previously `noFollow: true` was
   only verified at the "option is passed to mock" level. Added 4
   real-fs tests (async + sync × happy-path + EXDEV-fallback) that
   pre-place a symlink, call atomicWriteFile with noFollow, and
   assert: (a) the symlink is replaced by a regular file, (b) the
   new file holds the new data, (c) the real file behind the
   symlink is untouched. A regression flipping the noFollow ternary
   or skipping the noFollow-aware EXDEV fallback now fails directly.

Refs: #4333, #4095 Phase 2

* fix(core): close TOCTOU window on noFollow EXDEV fallback + cover ENOENT path

Two PR #4333 round-7 review items folded in:

1. TOCTOU race: the path-based `tryChmod(targetPath)` after the EXDEV
   noFollow branch ran AFTER `fd.close()` released the inode reference.
   Between close and chmod, an attacker with parent-directory write access
   could replace the regular file with a symlink, redirecting the
   `chmod 0o600` onto an attacker-chosen target — silently defeating the
   `noFollow` protection that the `unlink + O_EXCL` pattern was added to
   provide. Fix: switch to `fd.chmod`/`fchmodSync` on the open fd before
   close (operates on the inode, immune to symlink swap), and skip the
   path-based chmod for the noFollow branch (path-based chmod remains
   for the non-noFollow direct-write branch, where following symlinks
   was already in scope).

2. Missing test coverage: all 4 existing noFollow EXDEV tests pre-place
   a symlink at the target, so `unlink(targetPath)` always succeeded —
   the ENOENT-swallow branch (first-write scenarios, e.g. initial
   credential provisioning on a cross-device mount) had no coverage.
   Added 2 tests (async + sync) verifying the fallback creates a new
   file with the requested mode when the target never existed.

Test results: 54/54 atomicFileWrite tests pass (was 52). Async + sync
parity preserved.

Refs: #4333, #4095 Phase 2

* test(core): skip new noFollow EXDEV-fallback tests on Windows (NTFS perm bits)

The two new-file noFollow EXDEV tests added in 7c47664 assert the file
ends up at 0o600. NTFS reports 0o666 for any file that isn't read-only
regardless of chmod/fchmod, so the assertion fails on Windows runners
(`AssertionError: expected 438 to be 384`). Match the existing
`it.skipIf(process.platform === 'win32')` pattern already used by all
mode-asserting tests in this file. Linux/macOS coverage of the
ENOENT-swallow + new-file path is unchanged.

Refs: #4333, #4095 Phase 2

* fix(core): narrow fchmod catch + verify mechanism on noFollow EXDEV

Two PR #4333 round-8 review items:

1. The `catch {}` around `fd.chmod(desiredMode)` (and `fchmodSync`) was
   intended to tolerate filesystems without POSIX permissions (FAT/exFAT)
   but silently swallowed every error code: EPERM under a hardened
   sandbox, EIO on flaky NFS/CIFS, EROFS if the filesystem is remounted
   read-only mid-write. Because this path operates on the *final
   credential file* — not a tmp file — a silent fchmod failure leaves
   the target at the umask-masked open() mode with no diagnostic trail.
   Narrowed the catch to ENOSYS/ENOTSUP so the FAT/exFAT case still
   tolerates failure but security-relevant errors propagate.

2. The round-7 noFollow-EXDEV-new-file tests asserted the final mode
   (0o600) but didn't verify the *mechanism*. Under typical umask 0o022,
   `open(O_EXCL, 0o600)` already creates the file at 0o600, so a
   regression that swapped `fd.chmod()` back to a path-based
   `tryChmod(targetPath)` (the pre-fix TOCTOU-vulnerable form) would
   leave the mode assertion passing — defeating the round-7 fix
   undetected. Added a `chmod` test seam to `atomicWriteFile` /
   atomicWriteFileSync` (mirroring the existing `rename` / `writeFile`
   seams) and asserted that path-based chmod is never invoked against
   the credential target on this code path.

54/54 atomicFileWrite tests pass.

Refs: #4333, #4095 Phase 2

* fix(core): clean up O_EXCL orphan on noFollow EXDEV fchmod failure + cover both branches

Two PR #4333 round-9 review items, one critical correctness bug
introduced by the round-8 catch narrowing:

1. **[Critical]** When `fd.chmod(desiredMode)` (or `fchmodSync`) on the
   noFollow EXDEV fallback throws a propagating error (EPERM under
   seccomp, EIO on flaky NFS, EROFS after a remount), the file created
   by `open(targetPath, O_CREAT|O_EXCL)` remained on disk after the
   error rethrew. Every subsequent credential write retry then hit
   EEXIST from O_EXCL and failed permanently — turning a transient
   sandbox EPERM into a permanent OAuth-refresh deadlock that requires
   manual file removal. All three credential write sites
   (`sharedTokenManager`, `oauth-token-storage` save+delete,
   `file-token-storage`) hit this code path. Fix: a `writeOk` flag plus
   nested try/catch — fd is closed in the inner finally, then if write/
   sync/fchmod failed, the orphan is unlinked best-effort before the
   error rethrows.

2. The fchmod catch-narrowing (the headline behavior of round 8) had
   zero test coverage on either branch — `_testFs` exposed `rename`,
   `writeFile`, and path-based `chmod`, but no hook for the open-fd
   `fchmod`. A one-line revert from the narrowed catch back to
   `catch {}` would pass every existing test. Added `fchmod` field to
   `_testFs` (async + sync) and four tests:
   - ENOSYS swallowed → write succeeds (FAT/exFAT happy path)
   - EPERM propagates AND `targetPath` is absent (regression-tests #1)
   for both async and sync.

58/58 atomicFileWrite tests pass (was 54). Async + sync parity preserved.

Refs: #4333, #4095 Phase 2

* fix(core): annotate symlink errors + cover ENOTSUP/EACCES + log orphan unlink

Four PR #4333 round-10 review items, all small bounded fixes:

1. Parameterized the FAT/exFAT fchmod-swallow tests over both ENOSYS
   (Linux) and ENOTSUP (macOS). Round-9 only covered ENOSYS — a
   one-token regression dropping `ENOTSUP` from the catch condition
   would have passed every existing test.

2. The orphan-unlink catch block was empty (`/* best effort */`),
   leaving no diagnostic trail when the cleanup itself fails (EROFS,
   immutable flag, sandboxed container). Added a `createDebugLogger`
   import (`'ATOMIC_WRITE'` category) and a debug log so incident
   response can correlate the original write error with a subsequent
   EEXIST loop. Async + sync.

3. The pre-open `unlink(targetPath)` correctly propagates non-ENOENT
   errors (EACCES on parent dir, EROFS), but no test exercised that
   path. Added an `unlink` field to the `_testFs` seam (async + sync,
   matching the existing `rename` / `writeFile` / `chmod` / `fchmod`
   pattern) and two tests verifying EACCES propagates instead of
   getting hidden behind a downstream EEXIST from O_EXCL.

4. `resolveSymlinkChain(filePath)` ran before the function's main
   try-block, so symlink-resolution errors (EACCES on intermediate
   dir, ELOOP from circular chain) bypassed the
   `atomicWriteFile("path"): ...` annotation that every other failure
   path applies — leaving `err.path` referencing an internal
   intermediate directory the caller never asked about. Wrapped with
   `.catch(err => throw annotateWriteError(err, filePath))` (async)
   and the equivalent try/catch for sync. Added a real-fs ELOOP
   regression test for both variants (skipped on Windows).

64/64 atomicFileWrite tests pass (was 58).

Refs: #4333, #4095 Phase 2

* fix(core): widen atomicWriteJSON options + tighten trustedHooks perms + cover backoff/mkdir branches

Five PR #4333 round-11 review items, all small and bounded:

1. **atomicWriteJSON option-type widening**: Was typed as the narrower
   `AtomicWriteOptions` (retries / delayMs only), so credential-grade
   options added in this PR (`mode`, `forceMode`, `noFollow`) and
   pre-existing ones (`flush`, `encoding`) were silently dropped at the
   type level even though the body spread them at runtime. A future
   maintainer calling `atomicWriteJSON(credPath, creds, {noFollow:
   true, mode: 0o600, forceMode: true})` would have typechecked but
   silently lost noFollow + forceMode + mode. Widened to
   `AtomicWriteFileOptions`.

2. **trustedHooks 0o600 + forceMode**: This file lists user-approved
   *executable hook commands* — strictly more sensitive than the
   sibling state files (`trustedFolders.json`, `tipHistory.json`) that
   already use `{mode: 0o600, forceMode: true}`. Was dropping to the
   process umask (0o644 by default), and a backup-restored looser mode
   was never healed. Now matches the sibling pattern.

3. **renameWithRetry exponential-backoff coverage**: Existing tests
   covered retry count and error propagation but not the
   `delayMs * 2 ** attempt` curve itself. A regression to linear,
   constant, or — worst — regressive backoff (which intensifies under
   Windows AV-scan stress) would have passed every existing test.
   Added a test using `vi.useFakeTimers()` that records gaps between
   mock-rename invocations and asserts `[delayMs, 2*delayMs,
   4*delayMs]` for `(retries=3, delayMs=50)`.

4. **jsonl-utils write() parent-dir creation**: The other `write()`
   test targets a path inside the pre-created `tmpRoot`, so the
   `!existsSync(dir) → mkdirSync(dir, {recursive: true})` branch was
   never exercised. Added a one-liner test that targets a deeply
   nested non-existent path.

5. **writeLineSync docstring accuracy**: The docstring claimed "uses
   a simple flag-based locking mechanism (less robust than async
   version)" but there is no flag-based locking — and `writeLine`
   serializes via per-file `Mutex` that this function bypasses. Now
   accurately documents the lack of locking, the bypass, and the
   `flush: true` rationale (closes #3681).

Test results: 65/65 atomicFileWrite tests pass (was 64), 26/26
jsonl-utils tests pass (was 25). Typecheck clean.

Refs: #4333, #4095 Phase 2

* fix(core): force JS slow path for appendFile flush + correct debugLogger format

Two PR #4333 round-12 review items, both real bugs:

1. **flush:true was silently a no-op for string payloads on appendFile**.
   Node's C++ fast path (binding.writeFileUtf8) for string + utf8 +
   appendFile bypasses the JS-side flush/fsync logic entirely —
   empirically string+flush:true takes ~0.05ms/op (identical to no
   flush) while Buffer+flush:true takes ~4.9ms/op (91× slower,
   proving fsync only runs for Buffer payloads). The data still
   reaches the kernel page cache (the syscall is synchronous), so
   `kill -9` is fine, but power-loss durability — the actual #3681
   guarantee — was silently absent.

   Fix: pass `Buffer.from(line, 'utf8')` to both writeLine (async)
   and writeLineSync. This forces the JS slow path that honors
   `flush: true` and actually fsyncs the file. Updated the JSDoc
   on both functions to document the C++ fast-path bypass so a
   future maintainer doesn't revert to the simpler string form.

2. **`debugLogger.debug` doesn't do printf substitution**.
   `debugLogger`'s `formatArgs` (debugLogger.ts:67-77) just joins
   args with spaces — no `util.format()`. The round-10 calls used
   `'orphan unlink failed for %s: %s'` which rendered the literal
   `%s` markers in the log:
       orphan unlink failed for %s: %s /path/to/target Error: EACCES
   instead of:
       orphan unlink failed for /path/to/target: Error: EACCES
   Switched both async and sync sites to template literals, matching
   every other `debugLogger` call site in the codebase.

Refs: #4333, #4095 Phase 2

* fix(core): narrow tryChmod catch + writeFileSync(fd) + cover credential write failures

Three PR #4333 round-13 review items + one comment-wording softening:

1. **`tryChmod`/`tryChmodSync` catch narrowed (wenshao)** — was bare
   `catch {}` swallowing all errors, while the round-8 `fchmod` catch
   was already narrowed to ENOSYS/ENOTSUP. Same security rationale
   applies — and *specifically* on the EXDEV non-noFollow fallback,
   `tryChmod(targetPath)` is the *sole* mode-setting mechanism for an
   existing target (writeFile ignores `mode` when the target exists),
   so a silent EPERM/EROFS would leave credentials at the old mode.
   Non-credential callers don't pass `mode` → `desiredMode === undefined`
   short-circuits, so they're unaffected.

2. **Sync EXDEV `writeSync` → `writeFileSync(fd)` (yiliang114)** —
   `fsSync.writeSync(fd, buf)` returns bytes-actually-written and can
   short-write; the current code ignored the return, so a partial write
   would silently truncate the credential file with the call still
   returning success after fsync+fchmod. Switched to
   `fsSync.writeFileSync(fd, buf)` which loops internally per Node spec.
   The async sibling (`fd.writeFile`) already handles short-writes;
   this brings sync parity.

3. **`file-token-storage` failure-path coverage (wenshao)** — both
   `setCredentials` and `deleteCredentials` propagate `atomicWriteFile`
   rejections (no try/catch around the call), but no test exercised
   that path. Added two tests mirroring the round-1 sharedTokenManager
   precedent: ENOSPC on `setCredentials` and EROFS on `deleteCredentials`
   both rethrow.

4. **Round-12 comment wording softened (wenshao verification report)** —
   strace on Node v22/v24 confirms string + utf8 + flush:true does
   fsync correctly today, counter to my round-12 "silent no-op"
   framing. Buffer is still the safer documented form (forward-compat
   insurance against any future fast-path optimization), but the
   commit's claim that it was *fixing* a confirmed bug overstated what
   reproduces. Reframed both writeLine and writeLineSync comments
   accordingly without changing the code behavior.

Test results: 109/109 affected suites pass (atomicFileWrite 65,
jsonl-utils 26, file-token-storage 18). Broader credential/state
suites also green: 216/216 across sharedTokenManager + oauth-token-storage
+ qwenOAuth2 + logger + trustedFolders. Typecheck clean.

Refs: #4333, #4095 Phase 2

* fix(core): route orphan unlink through seam + cover tryChmod + trustedHooks tests

Three PR #4333 round-14 review items:

1. **Path-level tryChmod / tryChmodSync catch narrowing had zero direct
   coverage**. Round-13 narrowed the bare `catch {}` to ENOSYS/ENOTSUP
   only (matching the round-8 fd-level fchmod narrowing), but every
   existing EXDEV test passed `options: undefined`, so `desiredMode ===
   undefined` short-circuited before any chmod was attempted. A
   regression that inverted the catch condition (missing `!` prefix)
   would silently swallow EPERM on every sync credential write
   (trustedHooks, tipHistory, trustedFolders, etc.) with zero
   diagnostic signal. Added 6 tests via the existing `_testFs.chmod`
   seam: parameterized ENOSYS/ENOTSUP swallow + EPERM propagation, for
   both async (`atomicWriteFile`) and sync (`atomicWriteFileSync`).

2. **Orphan-cleanup unlink on the noFollow EXDEV failure path was
   using raw `fs.unlink` / `fsSync.unlinkSync` instead of the
   injected `unlinkImpl` seam**. The pre-open unlink correctly used
   the seam, but the round-9 orphan cleanup added later bypassed it,
   making it the only fs operation in `atomicWriteFile` not flowing
   through the test seam. Routed both async and sync orphan cleanup
   through `unlinkImpl`, and added 2 tests that inject a spy and
   assert orphan cleanup is invoked against targetPath after a
   simulated fchmod EPERM.

3. **`trustedHooks.ts` had no test coverage**. Round-11 migrated it to
   `atomicWriteFileSync` with `{ mode: 0o600, forceMode: true }` —
   strictly the most security-sensitive write in the PR since the
   file stores user-approved executable hook commands — but unlike
   the sibling files (trustedFolders, tipHistory) it had no test
   file. A regression that dropped `forceMode: true` or weakened
   the mode would have passed all existing tests. Created
   `trustedHooks.test.ts` covering: write goes through
   atomicWriteFileSync with `{ mode: 0o600, forceMode: true }`,
   write targets the global qwen dir path, the persisted content
   matches the hook key derived from the hook config, and round-trip
   trust/untrust behavior.

Test results: 73/73 atomicFileWrite tests (was 65) + 5/5
trustedHooks (new). Typecheck clean.

Refs: #4333, #4095 Phase 2

* fix(core): add noFollow to trustedHooks/trustedFolders + route tmp cleanup through seam

Two PR #4333 round-15 review items:

1. **`trustedHooks.ts` and `trustedFolders.ts` were missing `noFollow:
   true`**. The credential write sites
   (`sharedTokenManager`/`oauth-token-storage`/`file-token-storage`)
   all pass `{ mode: 0o600, forceMode: true, noFollow: true }` to
   prevent pre-placed symlink attacks. The trustedHooks comment
   already called it "strictly more sensitive than trustedFolders /
   tipHistory" — yet credential paths got symlink protection it
   didn't. A pre-placed symlink at `~/.qwen/trusted_hooks.json` (or
   `trustedFolders.json`) could redirect the atomic write to an
   attacker-controlled path, either leaking the executable-trust
   list / trusted-folder list, or leaving the user's real config
   silently stale. Added `noFollow: true` to both write sites and
   updated the assertions in `trustedHooks.test.ts` and
   `trustedFolders.test.ts`.

2. **Tmp-file cleanup at L240 (async) and L568 (sync) used raw
   `fs.unlink` / `fsSync.unlinkSync` instead of the injected
   `unlinkImpl` seam**. Pre-open unlink and orphan cleanup correctly
   routed through `unlinkImpl`, making the tmp-cleanup branch the
   only outlier — `_testFs.unlink`-injecting tests couldn't intercept
   this path, weakening the seam abstraction. Behavioral impact is
   nil (cleanup is best-effort, errors swallowed), but consistency
   matters for future test authors. Routed both async and sync
   variants through `unlinkImpl`.

Test results: 99/99 affected suites pass (atomicFileWrite 73,
trustedHooks 5, trustedFolders 21). Typecheck clean on core.

Refs: #4333, #4095 Phase 2

* fix(core): use path.join in trustedHooks test to fix Windows CI

PR #4333 round-15 review item (Critical, from claude-opus-4-8 /qreview):

trustedHooks.test.ts:68 hardcoded a POSIX path
('/mock/home/.qwen/trusted_hooks.json'), but production builds the path
with path.join(Storage.getGlobalQwenDir(), 'trusted_hooks.json')
(trustedHooks.ts:29-31). On Windows path.join emits '\' separators, so
the mocked atomicWriteFileSync receives
'\mock\home\.qwen\trusted_hooks.json' and the toBe assertion fails — the
cause of the red Test (windows-latest, Node 22.x) check. macOS/Linux
runs are green because forward slashes match.

Build the expected path with path.join so it matches the platform
separator, and add the node:path import.

Refs: #4333, #4095 Phase 2

* fix(core,cli): round-16 review — tipHistory noFollow + atomicWrite test coverage + LSP concurrency note

Four PR #4333 follow-up review items from wenshao:

1. tipHistory.ts was missing noFollow: true while the sibling
   0o600+forceMode config sites (trustedFolders.ts:192,
   trustedHooks.ts:63) both set it. The cleanup commit added forceMode
   "for consistency with other 0o600 sites" but stopped short of
   noFollow. All three store user-trusted paths and should share the
   same pre-placed-symlink protection. Added noFollow: true.

2. annotateWriteError idempotency guard had no direct coverage. The
   guard switched from includes(targetPath) to startsWith(fnName+"(")
   because real syscall errors embed the *tmp* path (which contains the
   target as a substring), so the old guard silently skipped annotation
   on every real failure. Added a test where the rename error message
   embeds a tmp-style path containing the target; the correct startsWith
   guard still annotates it, and reverting to includes would fail it.

3. non-EXDEV rename failure tests only asserted /EIO/, not the
   atomicWriteFile(...): / atomicWriteFileSync(...): annotation prefix
   that the production re-throw applies on that path. Tightened both
   async and sync assertions to match the prefix.

4. applyTextEdits concurrency constraint was undocumented. The async
   read-modify-write has await points between read and write;
   atomicWriteFile prevents torn files but does not serialize writers,
   so concurrent same-path edits can lose updates. Latent today (no
   production caller / no workspace/applyEdit handler), so documented
   the per-file-serialization requirement instead of adding a lock.

Not taking: extending the _testFs seam with open/openSync to fault-inject
O_EXCL EEXIST — the noFollow/O_EXCL security behavior is already covered
by the real-fs behavioral tests (noFollow EXDEV fallback refuses to
follow symlinks); a seam-injected EEXIST would only exercise error
propagation, not the guarantee.

Test results: atomicFileWrite 74/74 pass, eslint clean.

Refs: #4333, #4095 Phase 2

* test(core): route atomicWriteFile open through _testFs seam + assert O_EXCL (PR #4333 review)

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
xaelistic pushed a commit to xaelistic/qwen-code that referenced this pull request Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions.

2 participants