Skip to content

perf(filesearch): move @-picker crawl and fzf index to worker_threads#100

Open
BingqingLyu wants to merge 15 commits into
mainfrom
fork-pr-3455-feat-filesearch-worker-p1
Open

perf(filesearch): move @-picker crawl and fzf index to worker_threads#100
BingqingLyu wants to merge 15 commits into
mainfrom
fork-pr-3455-feat-filesearch-worker-p1

Conversation

@BingqingLyu

@BingqingLyu BingqingLyu commented Apr 27, 2026

Copy link
Copy Markdown
Owner

TLDR

Moves the @-picker's recursive file crawl and fzf index construction off the main thread into a worker_threads worker, so pressing @ no longer freezes the Ink render loop. On large workspaces (home directory, 100k-file monorepos) the CLI stays fully responsive where it previously stalled for 1–9 seconds.

Also wires up a bundled ripgrep fast path for the crawl (3–4× faster than fdir past ~50k files) and pre-warms the index at CLI boot so the first @ press usually finds a ready snapshot.

Screenshots / Video Demo

N/A — no new UI surface; the visible change is "nothing freezes anymore" on large trees, which is hard to capture in a still. Happy to record a before/after clip against ~ if reviewers want one.

Dive Deeper

Root cause of the original freeze was two synchronous pieces of work on the main thread:

  1. fdir recursive crawl in packages/core/src/utils/filesearch/crawler.ts.
  2. new AsyncFzf(allFiles, …) in packages/core/src/utils/filesearch/fileSearch.ts — despite the name the constructor is synchronous and dominates for large file counts.

This branch introduces:

  • FileIndexCore — pure class holding the crawl, fzf index, and result cache. Worker-safe and unit-testable without a real worker.

  • fileIndexWorker.ts — thin parentPort pump that wraps FileIndexCore.

  • FileIndexService — main-thread singleton keyed by (projectRoot + ignore-fingerprint + options) that owns the worker, multiplexes reqId-based search calls, fans out streaming partial updates, and disposes stale-key instances when .gitignore edits or cwd changes produce a new key.

  • RecursiveFileSearch is now a thin proxy over the service, so the public FileSearch interface (and vscode-ide-companion's use of it) is unchanged.

  • ripgrepCrawler.tsrg --files backend using the bundled ripgrep binary. Empirically on macOS:

    Tree size fdir rg via Node Winner
    ~2.7k files (qwen-code) ~25 ms ~140 ms fdir
    ~48k (node_modules) ~640 ms ~1.8 s fdir
    ~100k (home dir cap) ~9 s ~2.5 s rg 3–4×

    At small sizes Node's spawn+IPC overhead beats rg's native parallel walker; past ~50k files rg's Rust walker wins decisively. Since both are well under the 200 ms loading threshold on small repos, rg is the default; QWEN_FILESEARCH_USE_RG=0 forces fdir.

  • AppContainer.tsx pre-warms FileIndexService.for(...) right after config.initialize() resolves, so the worker crawl starts in parallel with Config/MCP boot instead of waiting for the first @ keypress.

  • useAtCompletion.ts no longer flips isLoading=true on the INITIALIZE dispatch; the 200 ms slow-load timer now also covers INITIALIZING, so the common pre-warmed path opens silently and only slow cold-starts surface a spinner.

  • Composer.tsx no longer renders ConfigInitDisplay; the "Initializing…" banner at top of the prompt is gone.

The branch also lands five rounds of self-audit fixes on top of the P1 baseline:

  • crawlError now disposes the service (no zombie worker on transient failures).
  • dispose() rejects whenReady() waiters.
  • REFRESH uses a monotonic token so partial-driven re-searches fire even mid-SEARCHING.
  • worker.on('error') is wired and converted to a synthetic exit.
  • Worker dispose uses parentPort.close() (drains queued sends) instead of process.exit(0).
  • optionsKey() includes the .gitignore/.qwenignore fingerprint so edits invalidate the cached singleton; FileIndexService.for() also evicts stale-key instances sharing the same projectRoot.
  • FileSearch.dispose?() added to the interface; vscode-ide-companion's clearFileSearchCache now disposes instead of only dropping the Map entry.
  • Worker module-init is wrapped in try/catch and search payloads are type-validated; picomatch compile errors degrade to empty results instead of throwing to the UI.
  • scripts/prepare-package.js whitelists fileIndexWorker.js so the published npm tarball actually ships it.

Reviewer Test Plan

  1. npm ci && npm run build && npm run bundle
  2. Run on a small repo: npm start then press @ — should open instantly, no loading flash.
  3. Run on a large tree: QWEN_WORKING_DIR=~ npm start, press @ within the first second — prompt should stay responsive (cursor keeps blinking, you can still type). Compare to main: freeze.
  4. Force fdir: QWEN_FILESEARCH_USE_RG=0 QWEN_WORKING_DIR=~ npm start — slower crawl but still non-blocking.
  5. .gitignore invalidation: open an @-picker, edit .gitignore to add a dir, press @ again — newly-ignored files disappear without a CLI restart.
  6. Unit + hook tests: npm run test --workspace=packages/core and npm run test --workspace=packages/cli.

Testing Matrix

🍏 🪟 🐧
npm run
npx
Docker
Podman - -
Seatbelt - -

Personally validated on macOS 15.1.1 (arm64) with Node 22.17. Worker spawn latency on Windows is known to be higher (30–80 ms) — pre-warm absorbs it, but independent validation there would be appreciated. Sandbox profiles may need to permit worker thread creation; I haven't repro'd a sandbox issue locally but flagged it as a risk.

Before:

qwencode@before.mp4

after:

qwencodeafter.mp4

Linked issues / bugs

Fixes QwenLM#3454

callmeYe and others added 15 commits April 20, 2026 10:29
Before P1 the recursive @-picker did the fdir crawl and synchronous
`new AsyncFzf(...)` construction on the main thread. On 100k-file
workspaces that blocked the Ink render loop for 500ms–2s, so pressing
`@` felt unresponsive.

This change introduces `FileIndexCore` (pure class), `FileIndexWorker`
(worker_threads entry), and `FileIndexService` (main-thread singleton
that multiplexes search requests and exposes onPartial for future
streaming). `RecursiveFileSearch` becomes a thin proxy, preserving the
public `FileSearch` contract (and vscode-ide-companion's usage)
untouched. The esbuild bundle now emits a self-contained
`dist/fileIndexWorker.js` alongside `dist/cli.js`.
…EFRESH trigger

Changes driven by an open-ended correctness audit:

- crawlError now tears the service down (dispose worker, delete
  from INSTANCES, reject pending searches + whenReady waiters), so a
  transient crawl failure no longer leaks a worker per attempt.
- dispose() now rejects whenReady() waiters with AbortError; previously
  callers awaiting whenReady() through a dispose could hang forever.
  Also reorders unsubscribe before posting 'dispose' so the in-process
  transport's synchronous exit cleanup can't beat the AbortError.
- worker.on('error') is wired, converting uncaught worker errors into
  a normal exit-cleanup path instead of re-throwing on the main process
  and crashing the CLI.
- useAtCompletion gains a monotonic refreshToken in its reducer state
  and effect deps, so REFRESH re-triggers the worker effect even when
  the status was already SEARCHING (previously a partial chunk arriving
  mid-search could not cause a re-run).
- Worker 'dispose' now closes parentPort instead of process.exit(0) so
  in-flight searchResult/searchError messages drain before shutdown.
- FileIndexCore.startCrawl preserves the allFiles reference on the
  cache-hit fallback path to keep concurrent search() iterations stable.
- dispose() races transport.terminate() against a 2s timeout so a
  faulted worker cannot hang shutdown.
- handleExit is now idempotent: if dispose() already set disposed=true,
  a subsequent worker exit won't double-reject pending/readyWaiters.

Added regression tests for whenReady-on-dispose rejection and for
post-dispose FileIndexService.for() creating a fresh instance.
…TANCES hygiene

- `optionsKey()` now hashes the .gitignore / .qwenignore contents via
  loadIgnoreRules().getFingerprint(). Editing those files produces a new
  key so the next `FileIndexService.for()` spawns a fresh worker instead
  of serving a stale cached snapshot with outdated ignore rules.
- `FileIndexService.for()` only memoises an instance that survived
  construction. If the Worker spawn errored synchronously and handleExit
  ran before INSTANCES ever had the key, a permanently-disposed instance
  would otherwise be cached for the lifetime of the process.
- `FileSearch.dispose?()` is now a (optional) part of the public
  interface; the recursive proxy delegates to `FileIndexService.dispose()`.
- `FileMessageHandler.clearFileSearchCache` calls `dispose()` when a
  workspace file create/delete fires so the worker's fzf index is
  rebuilt from disk instead of continuing to serve stale results.

Added regression tests: editing .gitignore mid-session invalidates the
singleton; disposed services don't pollute INSTANCES.
…ttern errors

- Wrap FileIndexCore construction at worker module-init in try/catch.
  A bad projectRoot (e.g. containing NUL) used to crash the worker
  before its message handler attached, leaving the main thread waiting
  forever. We now surface the failure as a normal crawlError / searchError
  so the service sees it on the next message.
- Validate reqId / pattern types on 'search' messages so a malformed
  IPC shape fails just the one request instead of the whole worker.
- filter() now catches picomatch compile errors and returns [] — typing
  an interim `foo[` (common mid-keystroke state) no longer surfaces as
  a TypeError to the UI; it's simply "no matches" until the pattern is
  well-formed.
- Worker error log trims to name+message instead of the full Error
  object, so a CLI transcript/wrapper doesn't capture absolute paths
  from the user's machine by default.

Added regression test: malformed glob pattern in core.search returns [].
…y eviction

- scripts/prepare-package.js now lists `fileIndexWorker.js` in the
  published tarball's `files` whitelist. Without this, `npm i -g` would
  produce an install missing the worker, and the first `@`-picker use
  would throw ERR_MODULE_NOT_FOUND at runtime. Release blocker.
- FileIndexService.for() now disposes any stale instance under a
  different key but the same projectRoot before creating a new one.
  When `.gitignore` is edited mid-session, the fingerprint change would
  otherwise leave the old worker pinned in INSTANCES forever — the
  auditor flagged this as a real leak even though the commit message
  for round 2 implied it was fixed.
- Dropped `__setIndexTransportFactory` from the public barrel; it's a
  test-only hook that shouldn't be part of the external API. Tests
  (and nothing else) can still import it from the module directly.
- `search()` on a disposed service now throws AbortError instead of a
  plain Error, matching how in-flight searches are rejected inside
  dispose().

Added regression tests: stale-key service is disposed and its `search()`
throws AbortError after .gitignore-driven eviction.
…nc throw

Round 4 switched `FileIndexService.search()` on a disposed instance
from `Error` to `AbortError` for notional consistency with how in-flight
searches are rejected inside `dispose()`. The audit correctly called
out that this silently breaks useAtCompletion: the hook's catch block
swallows `AbortError` as "user pressed ESC" and never dispatches the
ERROR state, so a crawlError-driven cascade (where dispose() is called
and the next search hits the sync guard) would leave the UI stuck in
SEARCHING forever.

In-flight rejections inside dispose() remain AbortError — those are
genuine cancellations. The post-dispose sync guard is caller misuse
and must surface as a plain Error to drive the ERROR branch.
…cators

Addresses post-audit UX feedback:

- AppContainer kicks FileIndexService.for(...) as soon as Config.initialize
  resolves. The worker crawl starts in the background before the user types
  `@`, so the first picker open usually finds a ready (or nearly-ready)
  singleton and returns results without any visible wait.
- useAtCompletion no longer flips isLoading on the INITIALIZE dispatch.
  A 200ms timer now arms during INITIALIZING too, so the "Loading
  suggestions..." placeholder only appears if the crawl/search is
  genuinely slow — the common pre-warmed path opens silently.
- Remove the global ConfigInitDisplay render in Composer. The top-of-screen
  "Initializing..." banner is no longer shown while Config/MCP finish
  setting up; the prompt renders without any boot-time placeholder.

Updated the two useAtCompletion tests that asserted the old flash-loading
behaviour to assert the steady-state (isLoading=false) instead.
Initially planned as P2 to replace fdir with ripgrep for the @-picker
crawl. Empirical benchmarking on two representative trees refuted the
premise:

  qwen-code (~2700 files)   fdir ~25ms  vs  rg (spawn+IPC) ~140ms
  node_modules (~48k files) fdir ~640ms vs  rg (spawn+IPC) ~1800ms

Even though `rg --files` is ~70ms in a shell, the child_process spawn
and stdout-pipe overhead from Node dominates for every tree size we
tested. Claude Code's speed almost certainly comes from being a native
binary with an in-process walker, not from shelling out to ripgrep.

The ripgrepCrawler implementation is kept for future re-evaluation on
very large trees or different platforms, and is reachable via the
`QWEN_FILESEARCH_USE_RG=1` env var. fdir remains the default.

Also: useAtCompletion tests now sort before asserting on suggestion
order. The exact ranking is an fzf tiebreak artifact that depends on
crawler emission order — not a behavioural contract worth coupling
tests to.
Earlier commit (21764b5) gated ripgrep off by default based on a
benchmark that only covered small-to-medium trees (<48k files), where
Node's spawn+IPC overhead does beat fdir. User-reported feedback
prompted a re-test on a home-directory-scale target:

  qwen-code repo (~2700 files)     fdir ~25ms   rg ~140ms   fdir wins
  project/node_modules (~48k)      fdir ~640ms  rg ~1800ms  fdir wins
  ~/ home dir (100k-file cap)      fdir ~9s     rg ~2.5s    rg 3-4× wins

The slow case is the painful one — a user typing @ at $HOME shouldn't
wait 9 seconds while Node's single-threaded walker catches up. On small
repos both backends are well below the 200ms loading threshold, so the
perceptual cost of rg's spawn overhead is zero. Flip the default; keep
`QWEN_FILESEARCH_USE_RG=0` as the escape hatch to force fdir.
Windows CI was failing the full filesearch suite (useAtCompletion,
crawler, fileSearch, fileIndexCore, fileIndexService) because ripgrep
on Windows emits `.\file.txt`; the stdout pump stripped `./` before
`toPosixPath`, so the leading `./` survived and the ancestor-directory
walk in `buildRipgrepFileFilter` asked the ignore lib to test `"./"` —
which throws RangeError and poisoned the data handler. Normalize to
posix first, then strip; guard the filter for `.`/`./`/leading-`./`
inputs as defence in depth.

Also addresses review feedback:

- fileIndexService: an early worker exit left `_state` at `'crawling'`,
  so a `whenReady()` call arriving after the exit parked in readyWaiters
  and never settled. `handleExit` now transitions to `'error'` so the
  state check rejects synchronously.
- AppContainer prewarm: guarded on `enableRecursiveFileSearch` so
  opted-out users don't pay for a full crawl + worker at startup.
- FileMessageHandler: added a per-rootPath generation token so an
  in-flight `initialize()` disposes the stale index instead of
  re-caching it when `clearFileSearchCache` fires mid-crawl.

Regression tests added for all four.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ata payloads

- Remove unused `buildCaseInsensitiveGlob` helper and its companion
  `globSpecialChars` Set. Both were carried over from an earlier
  case-insensitive-search attempt and have no callers anywhere in the
  tree.
- Switch `Record<string, unknown>` reads in the message handler
  (`data?.query`, `data.path`, `data.content`, etc.) to bracket syntax
  so the file stays clean under stricter
  `noPropertyAccessFromIndexSignature` variants of the TS config.

Pure cleanup; behaviour unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng tmp

After the ripgrep path fix landed, the only remaining Windows CI
failures were `EBUSY: resource busy or locked, rmdir` on test temp
directories. Root cause: `ripgrep --files` is spawned with
`cwd: crawlDirectory`, and on Windows the OS holds a handle on that
working directory until the child fully exits. If `afterEach` calls
`cleanupTmpDir` before the FileIndexService (and its rg subprocess)
is disposed, rmdir races the handle release and fails.

Three-pronged fix:

- `cleanupTmpDir` now passes `maxRetries: 5, retryDelay: 100` to
  `fs.rm`. Node retries EBUSY/EPERM internally on Windows; half a
  second is plenty for tens-of-ms handle-release races. This is a
  blanket safety net for every caller.
- `fileIndexService.test.ts` afterEach: `__resetForTests()` runs
  before `cleanupTmpDir` so the transport is torn down first.
- `useAtCompletion.test.ts` afterEach: added the same
  `FileIndexService.__resetForTests()` call so the hook's in-process
  worker is disposed before the dir is removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- fileIndexCore: drop the redundant second crawl() on cache hits; use
  the first call's return value when onProgress never fires.
- fileIndexProtocol: new module holding WorkerRequest/WorkerResponse so
  fileIndexService and fileIndexWorker can't silently diverge.
- fileIndexService: LRU cap (8) on the INSTANCES singleton Map; hits
  re-insert to refresh recency. Dispose timeout warns + retries
  terminate() as a force-kill fallback.
- fileIndexService: drop `process.env['VITEST']` transport sniffing in
  favour of an explicit `installInProcessIndexTransport()` DI hook
  wired from each package's test-setup. Survives bundling.
- crawler: `ripgrepDisabled` is now a timestamp with a 5-minute
  cooldown (was permanent-for-process-lifetime). A single spawn failure
  no longer downgrades long-lived hosts like the VSCode extension for
  the whole session.
- crawler: collectRipgrepExcludeDirs now forwards plain directory
  patterns (.git/, build/, user ignoreDirs) to rg as `--glob '!dir'`
  args so rg prunes subtrees at the walker instead of streaming every
  path under them for the Node post-filter to reject.
- useAtCompletion: fileSearchOptions is useMemo'd on [config, cwd] and
  both effects have it in their deps — the eslint-disables are gone.
  Extracted `buildFileSearchOptions(config, cwd)` helper that
  AppContainer uses too, so the prewarm and search paths can't drift
  from the same FileIndexService singleton key.

Tests:
- fileIndexService.test: new "evicts the oldest instance when the LRU
  cap is exceeded" regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict in packages/cli/src/ui/components/Composer.tsx: upstream added
`HistoryItemToolGroup` type import to support per-subagent token
attribution, while this branch removed the `ConfigInitDisplay` banner
as part of the pre-warm UX change. Resolved by keeping the upstream
type import and dropping `ConfigInitDisplay` (import + JSX usage),
which is the intended behaviour of this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…imers

Addressing @wenshao's two findings on QwenLM#3455:

1. `--bare` mode tests failing in `packages/cli/src/config/config.test.ts`

   Root cause (correctly diagnosed): the top-level
   `installInProcessIndexTransport()` call in `test-setup.ts` eagerly
   evaluated `@qwen-code/qwen-code-core`'s module tree, pulling
   `workspaceContext.ts` in with a real `node:fs` binding. Later
   `vi.mock('fs', …)` declarations in individual test files no longer
   took effect, so `fs.existsSync()` fell through to the real FS and
   silently dropped mock directories.

   Fixed by removing the install from both `test-setup.ts` files and
   opt-ing in per-test-file (`beforeAll`/`afterAll`) in the three
   suites that actually exercise the worker-backed filesearch:

   - `packages/core/src/utils/filesearch/fileIndexService.test.ts`
   - `packages/core/src/utils/filesearch/fileSearch.test.ts`
     (also drops live singletons between tests so Windows rmdir
     doesn't race an open ripgrep child)
   - `packages/cli/src/ui/hooks/useAtCompletion.test.ts`

   The two `test-setup.ts` files now carry an explanatory comment so
   this doesn't regress.

2. CI matrix stuck for ~10h on "Run tests and generate reports"

   Likely cause: pending `setTimeout` handles keeping the event loop
   alive past test completion.

   - `FileIndexService.dispose()` raced `terminate()` against a 2 s
     `setTimeout` but never cleared the timer on the healthy-exit
     path — so after every disposal the process held a timer for up
     to 2 s. Now the terminate() arm clears the timer, and the
     timer is also `.unref()`'d as a belt-and-suspenders.
   - `crawlCache` TTL timers (default 30 s) are now `.unref()`'d
     too. They'd keep vitest workers waiting on exit even after
     all assertions passed.

   `clear()` still drops both synchronously between tests, so
   correctness is unchanged.

Verified:
- `npx vitest run src/config/config.test.ts -t "should ignore implicit startup"` now passes (was failing before).
- Full test suites — core 5925 / 5927, cli 4275 / 4282 (skipped tests pre-existing), vscode-ide-companion 204 / 205 — all green.
- Typecheck clean on core + cli.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@BingqingLyu BingqingLyu added conflicting-group-1 conflicting-group-1 Conflicting PR group 1 — review as a batch conflicting-pr Shares at least one cross-PR dependency with other PRs and removed conflicting-group-1 labels May 7, 2026
@BingqingLyu

BingqingLyu commented May 7, 2026

Copy link
Copy Markdown
Owner Author

Conflict Group 1

This PR shares modified functions with 9 other PR(s): #107, #109, #113, #114, #117, #52, #88, #93, #96.

These PRs should be reviewed as a batch — merging one may affect the others.

Function File Also modified by
crawl crawler.ts #93
isToolExecuting AppContainer.tsx #107, #109, #113, #114, #117, #52, #88, #96
graph LR
    PR100["PR #100"]
    Fcrawl_3215["crawl<br>crawler.ts"]
    PR100 -->|modifies| Fcrawl_3215
    PR93["PR #93"]
    PR93 -->|modifies| Fcrawl_3215
    FisToolExecuting_7261["isToolExecuting<br>AppContainer.tsx"]
    PR100 -->|modifies| FisToolExecuting_7261
    PR107["PR #107"]
    PR107 -->|modifies| FisToolExecuting_7261
    PR109["PR #109"]
    PR109 -->|modifies| FisToolExecuting_7261
    PR113["PR #113"]
    PR113 -->|modifies| FisToolExecuting_7261
    PR114["PR #114"]
    PR114 -->|modifies| FisToolExecuting_7261
    PR117["PR #117"]
    PR117 -->|modifies| FisToolExecuting_7261
    PR52["PR #52"]
    PR52 -->|modifies| FisToolExecuting_7261
    PR88["PR #88"]
    PR88 -->|modifies| FisToolExecuting_7261
    PR96["PR #96"]
    PR96 -->|modifies| FisToolExecuting_7261
Loading

Posted by codegraph-ai conflict detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conflicting-group-1 Conflicting PR group 1 — review as a batch conflicting-pr Shares at least one cross-PR dependency with other PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

@-picker freezes the CLI for on large workspaces

2 participants