fix(core): recover from `}{` glued records on session JSONL load (#3606) by qqqys · Pull Request #3656 · QwenLM/qwen-code

qqqys · 2026-04-27T03:51:13Z

Summary

What changed: jsonl.read() and jsonl.readLines() now tolerate per-line parse errors and recover concatenated records (}{ glued onto a single physical line) via a brace-depth scanner that respects string boundaries and \ escapes. New helper _recoverObjectsFromLine() does the splitting; parseLineTolerant() routes per-line failures through it.
Why it changed: Fixes No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606. A single corrupted line (two records glued without a newline separator — caused by an interrupted append losing its trailing \n) made JSON.parse throw, so read() returned [] and loadSession() returned undefined ("No saved session found"). Worse, simply skipping the bad line orphans every subsequent record via parentUuid chain breakage in reconstructHistory, so recovery — not just tolerance — is required to actually restore the session.
Reviewer focus:
1. _recoverObjectsFromLine correctness — string/escape handling, depth < 0 reset, and the documented limitation (top-level objects only)
2. parseLineTolerant 0-vs-N branch in debugLogger.warn (no false "Recovered 0" noise)
3. No write-side change — read-side recovery is sufficient and avoids accumulating empty separator lines on every resume

Validation

Commands run:

# Full unit suite for the touched code paths
npx vitest run \
  packages/core/src/utils/jsonl-utils.test.ts \
  packages/core/src/services/chatRecordingService.test.ts \
  packages/core/src/services/sessionService.test.ts \
  packages/core/src/services/sessionService.rename.test.ts \
  packages/core/src/services/chatRecordingService.customTitle.test.ts \
  packages/core/src/services/chatRecordingService.autoTitle.test.ts

# Type + lint
npx tsc --noEmit -p packages/core
npx eslint packages/core/src/utils/jsonl-utils.ts \
            packages/core/src/utils/jsonl-utils.test.ts

# End-to-end against the JSONL attached to #3606
npx tsx <<'EOF'
const { SessionService } = await import(
  './packages/core/src/services/sessionService.ts'
);
const svc = new SessionService('/path/to/project');
const loaded = await svc.loadSession('5b0c6411-5f63-4457-a33f-f4d3af300dd5');
console.log('messages:', loaded?.conversation.messages.length);
EOF

Inputs used: the JSONL file megapro17 attached to No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606 (5b0c6411-...jsonl), 480 lines, 1 malformed line at line 63 of the form {record_A}}}{record_B} (an api_response ui_telemetry record glued to a user message record across a 5-minute session-restart gap).
Expected result:
- Unit suite green (existing + 13 new cases)
- loadSession returns 481 records (480 lines, line 63 yields 2)
- All UUIDs unique
- The line-64 record's parentUuid = ce2f0622... resolves (recovered from the second half of line 63), so reconstructHistory walks the full chain with no orphans

Observed result:

Test Files  6 passed (6)
     Tests  98 passed (98)

--- end-to-end loadSession against #3606 attachment ---
Sessions visible to picker: ... (target listed; messageCount: 144)
loadSession result:
  messages reconstructed: 481
  lastCompletedUuid: 4702d357-1eca-42ab-bb67-21d7683f5e2d
  Glued pair recovered: true true (3ff76a26 + ce2f0622)
  Post-corruption record present: true (63eefedc — line 64)

Quickest reviewer verification path:
1. Download the JSONL from megapro17's comment on No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606
2. Place under any <runtime>/projects/<sanitized-cwd>/chats/<id>.jsonl (rewrite records[0].cwd to your project root so getProjectHash matches)
3. npm start -- --resume 5b0c6411-5f63-4457-a33f-f4d3af300dd5
4. Before this PR: "No saved session found". After: full session loads, 481 messages reconstructed; stderr/debug log shows Recovered 2 record(s) from malformed line in ...

Evidence — before/after on the real attachment:

Metric	Before fix	After fix
`read()` returns	`[]`	481 records
`loadSession()` returns	`undefined`	full conversation
`reconstructHistory()` linear messages	0	481
Glued line halves recovered	0 / 2	2 / 2
Orphaned records via parentUuid break	417	0

Scope / Risk

Main risk / tradeoff:
- Scanner only recognizes top-level {...} records, not [...] arrays (documented in the helper's JSDoc). All existing JSONL writers in this codebase emit object records, so this matches the only corruption shape we observe in the wild — but a future writer that emits arrays would not benefit from the recovery path.
- On a parse failure, the scanner attempts recovery by walking the line; on a deliberately-crafted huge single line this is O(n) but unbounded by the upstream readline buffer. In practice session JSONL is small (megapro17's worst case: 162 KB single record, < 5 ms scan).
Not covered / not validated:
- No fix on the write side. The trailing-newline loss that triggered No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606 is OS/process-level (Windows console close, abnormal exit, antivirus interception of un-fsync'd page cache) — out of scope to repair preemptively. Read-side recovery is enough to keep loadSession correct even if the same corruption recurs, and avoids accumulating extra empty lines on every resume.
- Not exercised on Windows or Linux locally — see matrix below.
Breaking changes / migration notes: None. read() / readLines() signatures and return types are unchanged; behavior change is purely additive (was throwing → now recovers + logs).

Testing Matrix

	🍏	🪟	🐧
npm run	✅	⚠️	⚠️
npx	✅	⚠️	⚠️
Docker	N/A	N/A	N/A
Podman	N/A	N/A	N/A
Seatbelt	N/A	N/A	N/A

Testing matrix notes:

macOS (🍏): unit suite + end-to-end loadSession against the No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606 attachment, both via npm run and direct npx tsx. ✅
Windows (🪟) / Linux (🐧): not exercised locally. The change is pure JS string scanning + a try/catch around JSON.parse; no platform-specific syscalls or path handling. The original write-side \n-loss that produced the bug is reportedly Windows-specific, but the read-side recovery is platform-agnostic.
Docker / Podman / Seatbelt: not applicable — change is in packages/core JSONL utilities, no sandbox or container surface.

Linked Issues / Bugs

Closes #3606

…nLM#3606)

wenshao

No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review

…nLM#3606) (QwenLM#3656)

…NL (closes #3681, #4095 Phase 2) (#4333) * feat(core): add atomicWriteFileSync + forceMode option Sync mirror of atomicWriteFile for code paths that can't await (settings persistence on exit, sync config writers). Same semantics: symlink chain resolution, permission preservation, fsync via flush:true, EPERM/EACCES rename retry, EXDEV fallback to direct write. Add forceMode option on AtomicWriteFileOptions — when true, ignore the existing target's permission bits and apply options.mode regardless. Needed for credential files that must heal historically over-permissive files (e.g. a 0o644 token restored from backup must be forced to 0o600). Honored by both async and sync paths. Default false preserves existing behavior. Reuses Atomics.wait for true blocking sleep in renameWithRetrySync — no busy-wait, no extra dep. Refs: #4095 Phase 2 * refactor(core): migrate credential writes to atomicWriteFile (#4095 Tier 1) Route all OAuth credential persistence through atomicWriteFile with forceMode: true, so a process crash mid-write cannot leave the user with a half-written token file, and historically over-permissive files (e.g. 0o644 from a manual restore) are healed to 0o600 on the next write. - oauth-token-storage.ts: setCredentials, deleteCredentials - file-token-storage.ts: saveTokens (encrypted MCP token storage) - qwenOAuth2.ts: cacheQwenCredentials (also fixes missing mode — was inheriting 0o644 from umask, now forced to 0o600) - sharedTokenManager.ts: saveCredentialsToFile — drops ~15 lines of hand-rolled tmp + rename in favor of the shared helper Lock-file writes using flag: 'wx' (sharedTokenManager.ts:720) are intentionally left untouched — they rely on exclusive-create semantics that atomic write does not preserve. Tests updated to mock atomicWriteFile instead of fs.writeFile. Refs: #4095 Phase 2 * refactor(core): migrate memory state writes to atomicWriteFile (#4095 Tier 2) Route all auto-memory state persistence through atomicWriteFile so a process crash during a dream/extract/forget cycle cannot corrupt the metadata sidecar, extraction cursor, or topic body files. Touched: manager (writeDreamMetadata), extract (writeExtractCursor + bumpMetadata), indexer (rebuild), dream (bumpDreamMetadata), forget (bumpMetadata + topic body rewrite). manager.ts:362 acquireDreamLock uses flag: 'wx' for exclusive create — left untouched, atomic write does not preserve that semantic. Uses atomicWriteFile (not atomicWriteJSON) to preserve the trailing newline these files have always had. Refs: #4095 Phase 2 * refactor: migrate config + logger + state writes to atomic helpers (#4095 Tier 3a) Route the remaining state-file write paths through atomic helpers so a crash mid-write cannot corrupt config, log, or session-scoped state: - trustedFolders.ts (sync): atomicWriteFileSync — sole path that flips workspace trust, must not half-write - logger.ts (4 sites): atomicWriteFile — full-file JSON rewrites for logs.json and per-checkpoint files - tipHistory, installationManager, projectSummary, todoWrite, trustedHooks: bonus sites with the same shape (state JSON written multiple times per session) todoWrite is on the hot path — writes every time the todo list mutates — so the added rename + fsync cost is measurable (a few ms per write on SSD). Trade-off accepted to avoid a half-written todos file silently breaking the next session's resume. Export atomicWriteFile / atomicWriteFileSync from the core public index so CLI-side callers (trustedFolders, tipHistory) can reach them. Tests updated: - logger.test.ts uses vi.importActual to re-export the real helper and override per-test via vi.mocked(atomicWriteFile).mockRejectedValueOnce - trustedFolders.test.ts and todoWrite.test.ts mock the helper directly Refs: #4095 Phase 2 * fix(core): flush JSONL appends to disk (#4095 Tier 3b, closes #3681) #3656 fixed the read side of glued '}{' JSONL records — when a process was killed mid-appendFile, the trailing '\n' was lost and the next record was concatenated. The write side was left for a follow-up (#3681). This adds flush:true (fsync) to every per-line append: - jsonl-utils.ts writeLine / writeLineSync (session transcripts, auto-titles, prompt history) - debugLogger.ts appendFile (per-session debug log) jsonl-utils.ts write() (full-file replace) now goes through atomicWriteFileSync so a crash during overwrite cannot corrupt the session transcript either. Trade-off: fsync on every append adds disk-sync latency (single-digit ms on SSD, more on spinning disk / network FS). Acceptable for a few writes per turn; the alternative is silently losing the last record of every interrupted session, which #3681 explicitly flagged. Refs: #4095 Phase 2 Tier 3b Closes: #3681 * refactor(core): migrate extension config + LSP edit to atomic write Catch up two sites where Claude Code's equivalent path is atomic but qwen-code's isn't (verified against /Users/jinye.djy/Projects/claude-code on 2026-05-19): - extension/extensionManager.ts:533, :1073 — enablement config and install metadata writes. Claude Code's plugin install-counts and zip cache use atomic temp+rename via writeFileSyncAndFlush_DEPRECATED. - lsp/NativeLspService.ts:1351 — applying an LSP edit to a user file. Claude Code's FileWriteTool/FileEditTool both route through atomic writeTextContent → writeFileSyncAndFlush_DEPRECATED. A bare writeFileSync here could half-write the user's source file if the process is killed during an LSP-driven rename or quick-fix. Also clean up stale fs.rename mock setups in sharedTokenManager.test.ts that became no-ops after Tier 1 migration (rename is no longer called by saveCredentialsToFile). The fs.writeFile mocks stay because the wx-flag lock path still uses them. Refs: #4095 Phase 2 * chore: cosmetic cleanups from PR review - packages/core/src/index.ts: move atomicFileWrite export to its alphabetical position (before browser.js) - tipHistory.ts: add forceMode: true to atomicWriteFileSync for consistency with other 0o600 sites — heals legacy 0o644 files even though tips are non-critical Refs: #4095 Phase 2 * fix(core): address Codex review findings on Phase 2 PR Three issues caught by post-merge Codex review of the #4095 Phase 2 branch — none had user-visible symptoms yet but all were latent bugs. 1. atomicFileWrite: forceMode without mode silently downgraded perms `if (!options?.forceMode)` skipped the existing-mode stat whenever forceMode was true, regardless of whether `mode` was also supplied. Calling `atomicWriteFile(p, data, { forceMode: true })` (no mode) on an existing 0o600 file produced 0o644 (umask default) instead of preserving 0o600. Tightened the guard to also require `options.mode` to be defined; mirrored fix in atomicWriteFileSync. Added two regression tests (async + sync) that assert mode preservation. 2. logger.test.ts: vi.resetAllMocks() blanked the atomicWriteFile shim The vi.fn(actual.atomicWriteFile) factory implementation gets reset to a no-op by `vi.resetAllMocks()` in beforeEach, which would make `logger.initialize()` silently skip creating logs.json on disk. Tests passed by coincidence (file pre-existence from prior runs). Captured the real implementation at module load and re-attach it via `mockImplementation` after each reset. 3. NativeLspService.applyTextEdits: atomic write bypassed file unwritability The read catch swallowed every error and treated it as "new empty file". With atomic write (tmp + rename), an unreadable target on a writable parent could be replaced with edits applied to an empty buffer — the old fs.writeFileSync would have errored on the target permission. Now only ENOENT is treated as new-file; other read errors (EACCES, EISDIR, etc.) propagate. Refs: #4095 Phase 2 * fix(lsp): refuse LSP edits to chmod 0444 files (Codex round 2) The previous fix only handled "read failed → propagate the error". Codex round 2 caught the remaining gap: a file that's readable but chmod 0444 (read-only) would still be replaced by the atomic rename, because rename only needs parent-directory write access. Add an explicit fs.accessSync(W_OK) check before the atomic write. ENOENT is allowed through so LSP can still create new files via edits. Refs: #4095 Phase 2 * fix(core): drop withTimeout around atomic credential write (Codex round 3) `saveCredentialsToFile` wrapped `atomicWriteFile` in `withTimeout(5000)`. If the call hits the 5s budget (e.g. slow NFS home, network-backed storage, fsync added by Phase 2), withTimeout rejects but the atomicWriteFile internal write+rename keeps running unobserved: 1. withTimeout rejects → saveCredentialsToFile throws 2. performTokenRefresh `finally` releases the refresh lock 3. Another process acquires the lock and writes newer credentials 4. The original atomicWriteFile finally completes its rename and overwrites the newer credentials — silent token rollback Pre-migration the code awaited the tmp write and the rename in two separate withTimeout calls; a timed-out tmp write never reached the rename so there was no race against the target file. The migration collapsed both into one inseparable atomicWriteFile, which made the timeout actively unsafe (the work cannot be cancelled after the timeout fires — fs.rename is not abortable). Atomic write is durable by design — accept the I/O latency. The mkdir and stat timeouts are kept (idempotent and read-only respectively, no corruption risk on late completion). Refs: #4095 Phase 2 * test(core): add rename-retry + EXDEV-fallback coverage (#4333 review) Address PR review suggestions from wenshao (via qwen-latest /review): neither renameWithRetry/Sync nor the EXDEV cross-device fallback had direct test coverage. Both paths are critical (Windows AV contention, Docker tmpfs /tmp) and a regression would degrade silently. Vitest can't spy on ESM exports of `node:fs` (`Cannot redefine property: renameSync`), so add narrow internal test seams instead: - renameWithRetry / renameWithRetrySync take an optional `_renameImpl` parameter, defaulting to fs.rename / fs.renameSync. - atomicWriteFile / atomicWriteFileSync take an optional `_testFs` parameter with `rename` and `writeFile` overrides, forwarded to the retry helper and used in the EXDEV fallback branch. The seams are underscore-prefixed and JSDoc-tagged as "Internal test seam — production callers never pass this", which keeps the public API clean while making the behavior testable. New coverage (+9 tests, 36 → 45): - renameWithRetry: retry-EPERM-then-succeed, give-up after retries, no-retry on non-retryable (ENOSPC) - renameWithRetrySync: same 3 patterns (EACCES, EPERM exhausted, EINVAL) - EXDEV fallback: async direct write + tmp cleanup, sync ditto, non-EXDEV failure propagates without fallback (rejects EIO + tmp cleanup) Refs: #4333, #4095 Phase 2 * fix(test): update telemetry sdk.test.ts appendFile assertions for flush:true CI failure on all 3 OSes (macos / ubuntu / windows): sdk.test.ts asserted `fs.appendFile` was called with `'utf8'` as the 3rd positional argument, but commit b7badc7 (#4095 Tier 3b — JSONL fsync) changed the `debugLogger.ts` appendFile call from string-form to options-form `{ encoding: 'utf8', flush: true }` to enable per-line fsync. Update the 3 assertions in the telemetry diagnostics test to match the new shape. No behavior change — debugLogger still flushes per append; only the assertion in this previously-unrelated suite needed updating. Refs: #4333, #4095 Phase 2 Tier 3b * test: cover LSP error branches + sync EXDEV cleanup + JSONL writes (#4333 review) Address three review suggestions from wenshao (via qwen-latest /review), each pointing at a real coverage gap introduced by this PR: 1. NativeLspService.applyTextEdits error branches (round-2 LSP fix): the ENOENT-only read guard and the fs.accessSync(W_OK) refusal had no automated coverage. Added 3 tests accessing applyTextEdits via a typed cast (the method is private; making it public for one verification inflates API surface). Tests use chmod 0000 / chmod 0444 reproducers and assert (a) read failure propagates EACCES without silently overwriting with empty content, (b) W_OK rejects with EACCES/EPERM before the atomic rename touches the target, (c) nonexistent files are still accepted so LSP can create via edits. 2. atomicWriteFileSync non-EXDEV rename failure cleanup: the async counterpart had an explicit EIO-rename test asserting tmp cleanup; the sync variant did not. Added the mirror — injects a sync rename throwing EIO via the existing _testFs seam and asserts `readdirSync(tmpDir).length === 0`. 3. jsonl-utils writeLine / writeLineSync / write smoke tests: the three write paths are the core fix for #3681 (the PR's headline goal) but downstream callers (chatRecordingService, sessionService) mock them entirely. Without direct unit tests, a regression that dropped `flush: true` or reverted `write()` to bare writeFileSync would go undetected. Added 3 real-fs roundtrip tests. Test count delta: - NativeLspService.test.ts: 15 → 18 - atomicFileWrite.test.ts: 45 → 46 - jsonl-utils.test.ts: 22 → 25 Refs: #4333, #4095 Phase 2 * fix(core,test): wrap atomic write errors + guard root user + cover save failure (#4333 review) Three review fold-ins from wenshao (via qwen-latest /review): 1. atomicFileWrite: error messages reference the random `.tmp.<hex>` path, not the logical target — many callers (memory subsystem, extension manager) don't wrap the error, making debug logs unhelpful. Add `annotateWriteError(error, targetPath)` that mutates the error message in-place to prefix `atomicWriteFile("<targetPath>"): ` while preserving `code` / `errno` / `syscall` / `stack` / the prototype chain so downstream `err.code === 'ENOENT'` checks and `instanceof` narrowing keep working. Applied to both async and sync variants; only the final propagated throw (not the EXDEV fallback path) is annotated. 2. NativeLspService.test.ts: the chmod 0444 and chmod 0000 tests rely on `accessSync(W_OK)` and `readFileSync` failing — but on POSIX with UID 0 (root, including most Docker CI runners), permission bits are bypassed and `accessSync` always succeeds. The tests would silently pass even with the W_OK guard removed entirely. Add `process.getuid?.() === 0` to the skip guard on both tests. 3. sharedTokenManager.test.ts: the catch block in saveCredentialsToFile that maps disk-full / permission-denied to `TokenManagerError(FILE_ACCESS_ERROR)` was never exercised — every prior test mocked atomicWriteFile as always-successful. Added a regression test that rejects atomicWriteFile with ENOSPC and asserts the wrapped TokenManagerError surfaces with the right type and carries the original message. Refs: #4333, #4095 Phase 2 * fix(core,lsp): annotate sync errors correctly + cover EXDEV fallback + async LSP IO (#4333 review) Three review fold-ins from wenshao (via qwen-latest /review), all real correctness/consistency issues: 1. annotateWriteError hardcoded "atomicWriteFile" prefix regardless of caller. Sync write failures produced misleading `atomicWriteFile("/path"):` prefixes in incident logs. Add optional `fnName` parameter (defaults to async name) and have sync call sites pass "atomicWriteFileSync". 2. EXDEV fallback path in both async and sync variants did NOT route the inner writeFileImpl/tryChmod failures through annotateWriteError. On a cross-device write that subsequently hit ENOSPC, the propagated error had a bare syscall message without the target-path prefix — breaking the function's documented error-shape contract. Wrap both fallback branches in try/catch + annotate. 3. NativeLspService.applyTextEdits is declared `async` and all callers `await` it, but the round-2 fix mixed in sync IO: readFileSync, accessSync, atomicWriteFileSync. The sync helper's renameWithRetrySync blocks the event loop up to ~350ms under Atomics.wait EPERM backoff — particularly bad for LSP workspace edits that loop over many files. Switch to async throughout: fsp.readFile, fsp.access, atomicWriteFile. Behavior preserved (same ENOENT-vs-other distinction, same W_OK gate). Existing tests pass unchanged (they already use the async typed-cast entry point). Refs: #4333, #4095 Phase 2 * test(core): cover EXDEV-fallback-write-failure annotation path (#4333 review) The previous fold-in added try/catch around the EXDEV fallback write in both async and sync variants, routing fallback failures through `annotateWriteError(err, target, fnName)`. The sync-specific `fnName='atomicWriteFileSync'` differentiation is ONLY exercised on that path, so a regression that dropped or misapplied the annotation on sync would otherwise go undetected. Two new tests inject both a failing rename (EXDEV) and a failing writeFile (ENOSPC) via the `_testFs` seam, then assert (a) the original `code === 'ENOSPC'` propagates intact, and (b) the message matches `/atomicWriteFile(Sync)?$<target>$:.*ENOSPC/` — verifying target-path prefix AND correct fn-name differentiation. Refs: #4333, #4095 Phase 2 * fix(core): annotate guard, drop debug fsync, add noFollow for creds (#4333 review) Four review fold-ins from wenshao (via qwen-latest /review): 1. annotateWriteError guard `!error.message.includes(targetPath)` was always false in production. tmpPath is `${targetPath}.<hex>.tmp`, so any real fs error like `ENOSPC ... '/path/creds.json.a1b2c3.tmp'` *contains* targetPath as substring → guard returned false → the annotation prefix was silently skipped on every real failure path. Tests passed only because mock errors used bare `new Error('ENOSPC')` messages without paths. Change guard to idempotency check on our own prefix (`startsWith(`${fnName}(`)`), and update the two existing EXDEV-fallback tests to use realistic path-embedding fs messages. 2. debugLogger.appendFile dropped `flush: true`. 1050+ call sites, default-enabled, fire-and-forget — per-line fsync creates sustained I/O pressure / SSD wear with no user benefit (debug logs are best-effort, the module already tracks `hasWriteFailure` for the degraded-mode UI). Kept the kernel page-cache flush; revert debugLogger.test.ts and telemetry/sdk.test.ts assertions back to plain `'utf8'`. The #3681 closure target is jsonl-utils writeLine, not debug logs. 3. Symlink security regression: the old `fs.writeFile(tmp) + fs.rename(tmp, filePath)` pattern atomically *replaced* a pre-placed symlink at `filePath`. atomicWriteFile's default `resolveSymlinkChain(filePath)` follows the link and writes through, redirecting tokens to wherever the link points (real concern on shared hosts with weaker-than-expected dir perms). Add `noFollow?: boolean` option that skips chain resolution; apply `noFollow: true` to all 4 credential write sites (oauth-token-storage [2 sites], file-token-storage, sharedTokenManager) to match the pre-migration replace-symlink semantics. 4a. Test seam `_testFs` was a 4th positional arg → considered moving into options. Punted: positional with underscore + JSDoc is materially the same surface as options field with underscore, and the only realistic collision (future production option as 5th arg) is bounded by review. 4b. Sync/async code duplication (~110 lines mirror) → DECLINED. Refactoring to a sync/async-polymorphic helper introduces a new abstraction layer with worse type ergonomics; the duplication is mechanical and lined up for easy diffing. Tracked as Phase 2.5 candidate if divergence actually accumulates. Refs: #4333, #4095 Phase 2 * fix(core): EXDEV fallback honors noFollow + fix test correctness (#4333 review) Three review fold-ins from wenshao (via qwen-latest /review): 1. **[CRITICAL]** EXDEV fallback path silently bypassed `noFollow` protection. The fallback `writeFileImpl(targetPath, ...)` is `fs.writeFile` / `fs.writeFileSync`, both of which follow symlinks. When a credential write site set `noFollow: true` and rename threw EXDEV (realistic on Docker OverlayFS / union mounts), the fallback would write credentials *through* a pre-placed symlink to an attacker-controlled target — the exact attack noFollow was meant to prevent. Fix: when `noFollow` is set, the EXDEV fallback now `unlink`s any existing entry, then opens with `O_WRONLY|O_CREAT|O_EXCL` to refuse to write through a symlink that races back. Applied to both async and sync variants. 2. **Test correctness**: the two EXDEV-fallback-write-failure tests added in the previous round had a bug — the `failingWrite` mock threw on every call, including the first call which is the tmp-file write. The tmp write failed → outer catch caught ENOSPC → EXDEV check returned false (code is ENOSPC) → fell through to the outer annotateWriteError. The inner EXDEV-fallback annotation was never exercised. The assertions passed only because annotateWriteError produces the same format in both catch blocks. Fix: selective-failure mock that succeeds on the first call (tmp write) and fails on the second (fallback write), genuinely reaching the EXDEV branch. 3. **Behavioral noFollow tests**: previously `noFollow: true` was only verified at the "option is passed to mock" level. Added 4 real-fs tests (async + sync × happy-path + EXDEV-fallback) that pre-place a symlink, call atomicWriteFile with noFollow, and assert: (a) the symlink is replaced by a regular file, (b) the new file holds the new data, (c) the real file behind the symlink is untouched. A regression flipping the noFollow ternary or skipping the noFollow-aware EXDEV fallback now fails directly. Refs: #4333, #4095 Phase 2 * fix(core): close TOCTOU window on noFollow EXDEV fallback + cover ENOENT path Two PR #4333 round-7 review items folded in: 1. TOCTOU race: the path-based `tryChmod(targetPath)` after the EXDEV noFollow branch ran AFTER `fd.close()` released the inode reference. Between close and chmod, an attacker with parent-directory write access could replace the regular file with a symlink, redirecting the `chmod 0o600` onto an attacker-chosen target — silently defeating the `noFollow` protection that the `unlink + O_EXCL` pattern was added to provide. Fix: switch to `fd.chmod`/`fchmodSync` on the open fd before close (operates on the inode, immune to symlink swap), and skip the path-based chmod for the noFollow branch (path-based chmod remains for the non-noFollow direct-write branch, where following symlinks was already in scope). 2. Missing test coverage: all 4 existing noFollow EXDEV tests pre-place a symlink at the target, so `unlink(targetPath)` always succeeded — the ENOENT-swallow branch (first-write scenarios, e.g. initial credential provisioning on a cross-device mount) had no coverage. Added 2 tests (async + sync) verifying the fallback creates a new file with the requested mode when the target never existed. Test results: 54/54 atomicFileWrite tests pass (was 52). Async + sync parity preserved. Refs: #4333, #4095 Phase 2 * test(core): skip new noFollow EXDEV-fallback tests on Windows (NTFS perm bits) The two new-file noFollow EXDEV tests added in 7c47664 assert the file ends up at 0o600. NTFS reports 0o666 for any file that isn't read-only regardless of chmod/fchmod, so the assertion fails on Windows runners (`AssertionError: expected 438 to be 384`). Match the existing `it.skipIf(process.platform === 'win32')` pattern already used by all mode-asserting tests in this file. Linux/macOS coverage of the ENOENT-swallow + new-file path is unchanged. Refs: #4333, #4095 Phase 2 * fix(core): narrow fchmod catch + verify mechanism on noFollow EXDEV Two PR #4333 round-8 review items: 1. The `catch {}` around `fd.chmod(desiredMode)` (and `fchmodSync`) was intended to tolerate filesystems without POSIX permissions (FAT/exFAT) but silently swallowed every error code: EPERM under a hardened sandbox, EIO on flaky NFS/CIFS, EROFS if the filesystem is remounted read-only mid-write. Because this path operates on the *final credential file* — not a tmp file — a silent fchmod failure leaves the target at the umask-masked open() mode with no diagnostic trail. Narrowed the catch to ENOSYS/ENOTSUP so the FAT/exFAT case still tolerates failure but security-relevant errors propagate. 2. The round-7 noFollow-EXDEV-new-file tests asserted the final mode (0o600) but didn't verify the *mechanism*. Under typical umask 0o022, `open(O_EXCL, 0o600)` already creates the file at 0o600, so a regression that swapped `fd.chmod()` back to a path-based `tryChmod(targetPath)` (the pre-fix TOCTOU-vulnerable form) would leave the mode assertion passing — defeating the round-7 fix undetected. Added a `chmod` test seam to `atomicWriteFile` / atomicWriteFileSync` (mirroring the existing `rename` / `writeFile` seams) and asserted that path-based chmod is never invoked against the credential target on this code path. 54/54 atomicFileWrite tests pass. Refs: #4333, #4095 Phase 2 * fix(core): clean up O_EXCL orphan on noFollow EXDEV fchmod failure + cover both branches Two PR #4333 round-9 review items, one critical correctness bug introduced by the round-8 catch narrowing: 1. **[Critical]** When `fd.chmod(desiredMode)` (or `fchmodSync`) on the noFollow EXDEV fallback throws a propagating error (EPERM under seccomp, EIO on flaky NFS, EROFS after a remount), the file created by `open(targetPath, O_CREAT|O_EXCL)` remained on disk after the error rethrew. Every subsequent credential write retry then hit EEXIST from O_EXCL and failed permanently — turning a transient sandbox EPERM into a permanent OAuth-refresh deadlock that requires manual file removal. All three credential write sites (`sharedTokenManager`, `oauth-token-storage` save+delete, `file-token-storage`) hit this code path. Fix: a `writeOk` flag plus nested try/catch — fd is closed in the inner finally, then if write/ sync/fchmod failed, the orphan is unlinked best-effort before the error rethrows. 2. The fchmod catch-narrowing (the headline behavior of round 8) had zero test coverage on either branch — `_testFs` exposed `rename`, `writeFile`, and path-based `chmod`, but no hook for the open-fd `fchmod`. A one-line revert from the narrowed catch back to `catch {}` would pass every existing test. Added `fchmod` field to `_testFs` (async + sync) and four tests: - ENOSYS swallowed → write succeeds (FAT/exFAT happy path) - EPERM propagates AND `targetPath` is absent (regression-tests #1) for both async and sync. 58/58 atomicFileWrite tests pass (was 54). Async + sync parity preserved. Refs: #4333, #4095 Phase 2 * fix(core): annotate symlink errors + cover ENOTSUP/EACCES + log orphan unlink Four PR #4333 round-10 review items, all small bounded fixes: 1. Parameterized the FAT/exFAT fchmod-swallow tests over both ENOSYS (Linux) and ENOTSUP (macOS). Round-9 only covered ENOSYS — a one-token regression dropping `ENOTSUP` from the catch condition would have passed every existing test. 2. The orphan-unlink catch block was empty (`/* best effort */`), leaving no diagnostic trail when the cleanup itself fails (EROFS, immutable flag, sandboxed container). Added a `createDebugLogger` import (`'ATOMIC_WRITE'` category) and a debug log so incident response can correlate the original write error with a subsequent EEXIST loop. Async + sync. 3. The pre-open `unlink(targetPath)` correctly propagates non-ENOENT errors (EACCES on parent dir, EROFS), but no test exercised that path. Added an `unlink` field to the `_testFs` seam (async + sync, matching the existing `rename` / `writeFile` / `chmod` / `fchmod` pattern) and two tests verifying EACCES propagates instead of getting hidden behind a downstream EEXIST from O_EXCL. 4. `resolveSymlinkChain(filePath)` ran before the function's main try-block, so symlink-resolution errors (EACCES on intermediate dir, ELOOP from circular chain) bypassed the `atomicWriteFile("path"): ...` annotation that every other failure path applies — leaving `err.path` referencing an internal intermediate directory the caller never asked about. Wrapped with `.catch(err => throw annotateWriteError(err, filePath))` (async) and the equivalent try/catch for sync. Added a real-fs ELOOP regression test for both variants (skipped on Windows). 64/64 atomicFileWrite tests pass (was 58). Refs: #4333, #4095 Phase 2 * fix(core): widen atomicWriteJSON options + tighten trustedHooks perms + cover backoff/mkdir branches Five PR #4333 round-11 review items, all small and bounded: 1. **atomicWriteJSON option-type widening**: Was typed as the narrower `AtomicWriteOptions` (retries / delayMs only), so credential-grade options added in this PR (`mode`, `forceMode`, `noFollow`) and pre-existing ones (`flush`, `encoding`) were silently dropped at the type level even though the body spread them at runtime. A future maintainer calling `atomicWriteJSON(credPath, creds, {noFollow: true, mode: 0o600, forceMode: true})` would have typechecked but silently lost noFollow + forceMode + mode. Widened to `AtomicWriteFileOptions`. 2. **trustedHooks 0o600 + forceMode**: This file lists user-approved *executable hook commands* — strictly more sensitive than the sibling state files (`trustedFolders.json`, `tipHistory.json`) that already use `{mode: 0o600, forceMode: true}`. Was dropping to the process umask (0o644 by default), and a backup-restored looser mode was never healed. Now matches the sibling pattern. 3. **renameWithRetry exponential-backoff coverage**: Existing tests covered retry count and error propagation but not the `delayMs * 2 ** attempt` curve itself. A regression to linear, constant, or — worst — regressive backoff (which intensifies under Windows AV-scan stress) would have passed every existing test. Added a test using `vi.useFakeTimers()` that records gaps between mock-rename invocations and asserts `[delayMs, 2*delayMs, 4*delayMs]` for `(retries=3, delayMs=50)`. 4. **jsonl-utils write() parent-dir creation**: The other `write()` test targets a path inside the pre-created `tmpRoot`, so the `!existsSync(dir) → mkdirSync(dir, {recursive: true})` branch was never exercised. Added a one-liner test that targets a deeply nested non-existent path. 5. **writeLineSync docstring accuracy**: The docstring claimed "uses a simple flag-based locking mechanism (less robust than async version)" but there is no flag-based locking — and `writeLine` serializes via per-file `Mutex` that this function bypasses. Now accurately documents the lack of locking, the bypass, and the `flush: true` rationale (closes #3681). Test results: 65/65 atomicFileWrite tests pass (was 64), 26/26 jsonl-utils tests pass (was 25). Typecheck clean. Refs: #4333, #4095 Phase 2 * fix(core): force JS slow path for appendFile flush + correct debugLogger format Two PR #4333 round-12 review items, both real bugs: 1. **flush:true was silently a no-op for string payloads on appendFile**. Node's C++ fast path (binding.writeFileUtf8) for string + utf8 + appendFile bypasses the JS-side flush/fsync logic entirely — empirically string+flush:true takes ~0.05ms/op (identical to no flush) while Buffer+flush:true takes ~4.9ms/op (91× slower, proving fsync only runs for Buffer payloads). The data still reaches the kernel page cache (the syscall is synchronous), so `kill -9` is fine, but power-loss durability — the actual #3681 guarantee — was silently absent. Fix: pass `Buffer.from(line, 'utf8')` to both writeLine (async) and writeLineSync. This forces the JS slow path that honors `flush: true` and actually fsyncs the file. Updated the JSDoc on both functions to document the C++ fast-path bypass so a future maintainer doesn't revert to the simpler string form. 2. **`debugLogger.debug` doesn't do printf substitution**. `debugLogger`'s `formatArgs` (debugLogger.ts:67-77) just joins args with spaces — no `util.format()`. The round-10 calls used `'orphan unlink failed for %s: %s'` which rendered the literal `%s` markers in the log: orphan unlink failed for %s: %s /path/to/target Error: EACCES instead of: orphan unlink failed for /path/to/target: Error: EACCES Switched both async and sync sites to template literals, matching every other `debugLogger` call site in the codebase. Refs: #4333, #4095 Phase 2 * fix(core): narrow tryChmod catch + writeFileSync(fd) + cover credential write failures Three PR #4333 round-13 review items + one comment-wording softening: 1. **`tryChmod`/`tryChmodSync` catch narrowed (wenshao)** — was bare `catch {}` swallowing all errors, while the round-8 `fchmod` catch was already narrowed to ENOSYS/ENOTSUP. Same security rationale applies — and *specifically* on the EXDEV non-noFollow fallback, `tryChmod(targetPath)` is the *sole* mode-setting mechanism for an existing target (writeFile ignores `mode` when the target exists), so a silent EPERM/EROFS would leave credentials at the old mode. Non-credential callers don't pass `mode` → `desiredMode === undefined` short-circuits, so they're unaffected. 2. **Sync EXDEV `writeSync` → `writeFileSync(fd)` (yiliang114)** — `fsSync.writeSync(fd, buf)` returns bytes-actually-written and can short-write; the current code ignored the return, so a partial write would silently truncate the credential file with the call still returning success after fsync+fchmod. Switched to `fsSync.writeFileSync(fd, buf)` which loops internally per Node spec. The async sibling (`fd.writeFile`) already handles short-writes; this brings sync parity. 3. **`file-token-storage` failure-path coverage (wenshao)** — both `setCredentials` and `deleteCredentials` propagate `atomicWriteFile` rejections (no try/catch around the call), but no test exercised that path. Added two tests mirroring the round-1 sharedTokenManager precedent: ENOSPC on `setCredentials` and EROFS on `deleteCredentials` both rethrow. 4. **Round-12 comment wording softened (wenshao verification report)** — strace on Node v22/v24 confirms string + utf8 + flush:true does fsync correctly today, counter to my round-12 "silent no-op" framing. Buffer is still the safer documented form (forward-compat insurance against any future fast-path optimization), but the commit's claim that it was *fixing* a confirmed bug overstated what reproduces. Reframed both writeLine and writeLineSync comments accordingly without changing the code behavior. Test results: 109/109 affected suites pass (atomicFileWrite 65, jsonl-utils 26, file-token-storage 18). Broader credential/state suites also green: 216/216 across sharedTokenManager + oauth-token-storage + qwenOAuth2 + logger + trustedFolders. Typecheck clean. Refs: #4333, #4095 Phase 2 * fix(core): route orphan unlink through seam + cover tryChmod + trustedHooks tests Three PR #4333 round-14 review items: 1. **Path-level tryChmod / tryChmodSync catch narrowing had zero direct coverage**. Round-13 narrowed the bare `catch {}` to ENOSYS/ENOTSUP only (matching the round-8 fd-level fchmod narrowing), but every existing EXDEV test passed `options: undefined`, so `desiredMode === undefined` short-circuited before any chmod was attempted. A regression that inverted the catch condition (missing `!` prefix) would silently swallow EPERM on every sync credential write (trustedHooks, tipHistory, trustedFolders, etc.) with zero diagnostic signal. Added 6 tests via the existing `_testFs.chmod` seam: parameterized ENOSYS/ENOTSUP swallow + EPERM propagation, for both async (`atomicWriteFile`) and sync (`atomicWriteFileSync`). 2. **Orphan-cleanup unlink on the noFollow EXDEV failure path was using raw `fs.unlink` / `fsSync.unlinkSync` instead of the injected `unlinkImpl` seam**. The pre-open unlink correctly used the seam, but the round-9 orphan cleanup added later bypassed it, making it the only fs operation in `atomicWriteFile` not flowing through the test seam. Routed both async and sync orphan cleanup through `unlinkImpl`, and added 2 tests that inject a spy and assert orphan cleanup is invoked against targetPath after a simulated fchmod EPERM. 3. **`trustedHooks.ts` had no test coverage**. Round-11 migrated it to `atomicWriteFileSync` with `{ mode: 0o600, forceMode: true }` — strictly the most security-sensitive write in the PR since the file stores user-approved executable hook commands — but unlike the sibling files (trustedFolders, tipHistory) it had no test file. A regression that dropped `forceMode: true` or weakened the mode would have passed all existing tests. Created `trustedHooks.test.ts` covering: write goes through atomicWriteFileSync with `{ mode: 0o600, forceMode: true }`, write targets the global qwen dir path, the persisted content matches the hook key derived from the hook config, and round-trip trust/untrust behavior. Test results: 73/73 atomicFileWrite tests (was 65) + 5/5 trustedHooks (new). Typecheck clean. Refs: #4333, #4095 Phase 2 * fix(core): add noFollow to trustedHooks/trustedFolders + route tmp cleanup through seam Two PR #4333 round-15 review items: 1. **`trustedHooks.ts` and `trustedFolders.ts` were missing `noFollow: true`**. The credential write sites (`sharedTokenManager`/`oauth-token-storage`/`file-token-storage`) all pass `{ mode: 0o600, forceMode: true, noFollow: true }` to prevent pre-placed symlink attacks. The trustedHooks comment already called it "strictly more sensitive than trustedFolders / tipHistory" — yet credential paths got symlink protection it didn't. A pre-placed symlink at `~/.qwen/trusted_hooks.json` (or `trustedFolders.json`) could redirect the atomic write to an attacker-controlled path, either leaking the executable-trust list / trusted-folder list, or leaving the user's real config silently stale. Added `noFollow: true` to both write sites and updated the assertions in `trustedHooks.test.ts` and `trustedFolders.test.ts`. 2. **Tmp-file cleanup at L240 (async) and L568 (sync) used raw `fs.unlink` / `fsSync.unlinkSync` instead of the injected `unlinkImpl` seam**. Pre-open unlink and orphan cleanup correctly routed through `unlinkImpl`, making the tmp-cleanup branch the only outlier — `_testFs.unlink`-injecting tests couldn't intercept this path, weakening the seam abstraction. Behavioral impact is nil (cleanup is best-effort, errors swallowed), but consistency matters for future test authors. Routed both async and sync variants through `unlinkImpl`. Test results: 99/99 affected suites pass (atomicFileWrite 73, trustedHooks 5, trustedFolders 21). Typecheck clean on core. Refs: #4333, #4095 Phase 2 * fix(core): use path.join in trustedHooks test to fix Windows CI PR #4333 round-15 review item (Critical, from claude-opus-4-8 /qreview): trustedHooks.test.ts:68 hardcoded a POSIX path ('/mock/home/.qwen/trusted_hooks.json'), but production builds the path with path.join(Storage.getGlobalQwenDir(), 'trusted_hooks.json') (trustedHooks.ts:29-31). On Windows path.join emits '\' separators, so the mocked atomicWriteFileSync receives '\mock\home\.qwen\trusted_hooks.json' and the toBe assertion fails — the cause of the red Test (windows-latest, Node 22.x) check. macOS/Linux runs are green because forward slashes match. Build the expected path with path.join so it matches the platform separator, and add the node:path import. Refs: #4333, #4095 Phase 2 * fix(core,cli): round-16 review — tipHistory noFollow + atomicWrite test coverage + LSP concurrency note Four PR #4333 follow-up review items from wenshao: 1. tipHistory.ts was missing noFollow: true while the sibling 0o600+forceMode config sites (trustedFolders.ts:192, trustedHooks.ts:63) both set it. The cleanup commit added forceMode "for consistency with other 0o600 sites" but stopped short of noFollow. All three store user-trusted paths and should share the same pre-placed-symlink protection. Added noFollow: true. 2. annotateWriteError idempotency guard had no direct coverage. The guard switched from includes(targetPath) to startsWith(fnName+"(") because real syscall errors embed the *tmp* path (which contains the target as a substring), so the old guard silently skipped annotation on every real failure. Added a test where the rename error message embeds a tmp-style path containing the target; the correct startsWith guard still annotates it, and reverting to includes would fail it. 3. non-EXDEV rename failure tests only asserted /EIO/, not the atomicWriteFile(...): / atomicWriteFileSync(...): annotation prefix that the production re-throw applies on that path. Tightened both async and sync assertions to match the prefix. 4. applyTextEdits concurrency constraint was undocumented. The async read-modify-write has await points between read and write; atomicWriteFile prevents torn files but does not serialize writers, so concurrent same-path edits can lose updates. Latent today (no production caller / no workspace/applyEdit handler), so documented the per-file-serialization requirement instead of adding a lock. Not taking: extending the _testFs seam with open/openSync to fault-inject O_EXCL EEXIST — the noFollow/O_EXCL security behavior is already covered by the real-fs behavioral tests (noFollow EXDEV fallback refuses to follow symlinks); a seam-injected EEXIST would only exercise error propagation, not the guarantee. Test results: atomicFileWrite 74/74 pass, eslint clean. Refs: #4333, #4095 Phase 2 * test(core): route atomicWriteFile open through _testFs seam + assert O_EXCL (PR #4333 review) 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

…nLM#3606) (QwenLM#3656)

fix(core): recover from }{ glued records on session JSONL load (Qwe…

aac7107

…nLM#3606)

BingqingLyu mentioned this pull request Apr 27, 2026

fix(core): recover from }{ glued records on session JSONL load (#3606) BingqingLyu/qwen-code#132

Open

wenshao approved these changes Apr 27, 2026

View reviewed changes

wenshao merged commit 8a27876 into QwenLM:main Apr 27, 2026
23 of 25 checks passed

wenshao mentioned this pull request Apr 27, 2026

chore(core): jsonl reader/writer follow-ups from #3656 #3681

Closed

B-A-M-N pushed a commit to B-A-M-N/qwen-code that referenced this pull request Apr 28, 2026

fix(core): recover from }{ glued records on session JSONL load (Qwe…

9516b54

…nLM#3606) (QwenLM#3656)

qqqys mentioned this pull request Apr 29, 2026

No saved session found <guid>. Run qwen --resume without an ID to choose from existing sessions. #3606

Closed

qwen-code-ci-bot mentioned this pull request May 12, 2026

feat: atomic file write & transaction rollback #4095

Closed

xaelistic pushed a commit to xaelistic/qwen-code that referenced this pull request Jun 7, 2026

fix(core): recover from }{ glued records on session JSONL load (Qwe…

6eb7332

…nLM#3606) (QwenLM#3656)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): recover from `}{` glued records on session JSONL load (#3606)#3656

fix(core): recover from `}{` glued records on session JSONL load (#3606)#3656
wenshao merged 1 commit into
QwenLM:mainfrom
qqqys:fix/3606-jsonl-recovery

qqqys commented Apr 27, 2026

Uh oh!

wenshao left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qqqys commented Apr 27, 2026

Summary

Validation

Scope / Risk

Testing Matrix

Linked Issues / Bugs

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants