Skip to content

perf(filesearch): move AsyncFzf index construction to a worker thread#4621

Merged
yiliang114 merged 6 commits into
QwenLM:mainfrom
callmeYe:perf/fzf-worker-async
Jun 12, 2026
Merged

perf(filesearch): move AsyncFzf index construction to a worker thread#4621
yiliang114 merged 6 commits into
QwenLM:mainfrom
callmeYe:perf/fzf-worker-async

Conversation

@callmeYe

Copy link
Copy Markdown
Collaborator

TLDR

Hosts the @-picker's AsyncFzf instance in a worker_threads worker so
the freeze on workspaces with >20k files is removed. Narrow follow-up to
#3455 (which I'm closing); leaves the crawler entirely to upstream's
design from #3214.

Why

The original freeze had two root causes:

  1. Synchronous fdir crawl on the main thread.
  2. Synchronous new AsyncFzf(allFiles, ...) constructor — misleadingly
    named; the constructor itself runs sync and dominates main-thread cost
    on big trees, especially after the v2→v1 algorithm switch at 20k files.

Cause 1 was already addressed by #3214 (feat(core): replace fdir crawler with git ls-files + ripgrep fallback) — the new 3-tier cascade is
spawn-based and async. Cause 2 still freezes the Ink render loop on
large workspaces. This PR moves that construction off the main thread.

I closed #3455 because its surface area (worker, ripgrep crawler, prewarm,
IDE companion changes, UX tweaks; ~16 files) drifted too far from
upstream after 451 commits — the merge had 11 file conflicts with 4
double-rewrites. This PR is intentionally narrow: 7 files, ~575 net
insertions.

Architecture

  • packages/core/src/utils/filesearch/fzfWorker.ts — worker entry. Owns
    the AsyncFzf<string[]> instance, exposes a four-message RPC
    (initready / init-error; findresult / find-error;
    dispose).
  • packages/core/src/utils/filesearch/fzfWorkerHandle.ts — main-thread
    proxy. Same find() shape as AsyncFzf<string[]>, swap is one line in
    fileSearch.ts. Handles transport selection: in-thread fallback below
    ~5k files (where worker spawn + IPC outweighs construction cost),
    worker above. Awaits ready inside create() so init errors surface
    early rather than as a first-find surprise. Worker is unref()-ed so a
    mid-find @-picker doesn't pin process exit.
  • packages/core/src/utils/filesearch/fileSearch.ts — one-line swap of
    the new AsyncFzf call site, plus an optional dispose?(): Promise<void>
    on the FileSearch interface so consumers can release worker
    resources.
  • packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts
    clearFileSearchCache now calls dispose?() on the previous instance
    so the long-running VS Code extension host doesn't accumulate workers
    across watcher-triggered evictions.
  • esbuild.config.js — adds a second esbuild build that emits a
    self-contained dist/fzfWorker.js (fzf inlined). Main bundle remains
    the splitting build introduced upstream.
  • scripts/prepare-package.js — whitelists fzfWorker.js in the dist
    tarball.

Why no whenReady() / no in-worker crawl

These were the two open critical reviews on #3455 (commit 44fcc11dc,
@wenshao 04-24 / 05-09). Both become non-issues by design:

  • whenReady() post-dispose hang → there is no whenReady(). The
    handle awaits ready inside its create() factory, so callers
    observe init success/failure at construction time. Post-dispose
    find() rejects via the per-call pending map's failAll.
  • AbortSignal not threaded through in-process crawl → there is no
    in-worker crawl. The crawl stays on the main thread (already async via
    upstream's spawn-based cascade); only the AsyncFzf construction
    moves. Search-time abort is unchanged from main.

Test / dev mode

installInProcessFzfTransport() forces the in-thread path for tests
and sandboxes that can't spawn workers. It returns a restorer the
caller MUST run in afterAll — installing globally in test-setup.ts
caused the eager-fs-mock issue @wenshao caught on #3455's earlier
attempt, so this PR does NOT do that. __setWorkerThresholdForTests
is a separate hook for forcing worker selection on tiny inputs.

npm run dev (tsx) doesn't have the bundled dist/fzfWorker.js on
disk; the handle's existsSync check silently falls back to the
in-thread path. Acceptable trade-off — devs running tsx against huge
workspaces will see the same freeze the bundled CLI avoids.

Reviewer Test Plan

  1. npm ci && npm run build && npm run bundle — confirm
    dist/fzfWorker.js is emitted alongside dist/cli.js.
  2. npm run test --workspace=packages/core — 104/104 filesearch tests
    pass, including 5 new fzfWorkerHandle.test.ts cases.
  3. Manual: QWEN_WORKING_DIR=~ npm start (or any tree with >20k files);
    press @ — Ink input cursor stays responsive during fzf init.
    Compare to main: visible freeze.
  4. Manual: small repo (<5k files); press @ — should still open
    instantly (in-thread path, no worker spawn cost).

Diff size

7 files, +650 / −74. Roughly half of the +650 is the new
fzfWorkerHandle.ts and the worker entry; the rest is the esbuild
config restructure and the test file.

Linked

* Optional because the directory-only path holds no such resources.
* Implementations must be safe to call multiple times.
*/
dispose?(): Promise<void>;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Worker thread leak in the CLI's @-picker path. useAtCompletion (in packages/cli/src/ui/hooks/useAtCompletion.ts) creates a FileSearch via FileSearchFactory.create() and stores it in fileSearch.current. When cwd/config changes trigger re-initialization or the component unmounts, the old FileSearch is silently overwritten without calling dispose(). The VSCode path was correctly wired up in FileMessageHandler.clearFileSearchCache(), but the CLI hook — by far the most common consumer — was missed.

On workspaces with ≥5 000 files, each RecursiveFileSearch holds a FzfWorkerHandle that owns a worker_threads Worker (~5–10 MB + full file list). In a long-running CLI session, every directory switch or config reload leaks another worker. worker.unref() prevents blocking process.exit(), but the workers accumulate for the lifetime of the process.

Fix in useAtCompletion.ts:

Suggested change
dispose?(): Promise<void>;
// In the useEffect cleanup:
return () => {
searchAbortController.current?.abort();
if (slowSearchTimer.current) {
clearTimeout(slowSearchTimer.current);
}
void fileSearch.current?.dispose?.();
fileSearch.current = null;
};

Also dispose the previous instance before overwriting in initialize().

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81a513d. Added dispose() calls in both the initialize() path (dispose old instance before creating new one) and the effect cleanup function. Worker threads are now properly released on re-init and unmount.

const inst = transport.spawn(files, options);
// Surface init errors at create() time so callers don't get a "fzf not
// initialized" surprise on the first find().
await inst.ready();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Two robustness gaps in create():

  1. Worker leak on init failure: If inst.ready() rejects (worker sends init-error or crashes during construction), create() propagates the error without calling inst.dispose(). The spawned Worker is never terminated. In daemon mode, repeated init failures accumulate idle workers.

  2. No timeout on readyPromise: If the worker hangs during new AsyncFzf(...) (e.g., OOM on a huge file list, OS scheduling stall), create() blocks the caller indefinitely with no diagnostic output. The @-picker UI freezes on "loading..." with no error, no stack, no way out.

Suggested change
await inst.ready();
const inst = transport.spawn(files, options);
try {
const timeoutMs = Math.max(10_000, files.length / 10);
await Promise.race([
inst.ready(),
new Promise<never>((_, rej) =>
setTimeout(() => rej(new Error(
`fzf worker init timed out after ${timeoutMs}ms (${files.length} files)`
)), timeoutMs),
),
]);
} catch (err) {
await inst.dispose();
throw err;
}
return new FzfWorkerHandle(inst);

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81a513d. create() now wraps inst.ready() in a try/catch that calls dispose() on failure. Also added a Promise.race timeout scaled to file count (Math.max(10_000, files.length / 10) ms) so a hung worker doesn't block the @-picker indefinitely.

// Strip the heavy `positions` Set from each item before sending —
// structuredClone serialises Sets but the @-picker only needs the
// ranked `item` strings. Keeps IPC payloads small on big result sets.
const trimmed = items.map((entry) => ({ item: entry.item }));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] fzf.find() returns ALL matching items, which are all serialized via postMessage structured clone. On a large workspace with a broad single-character pattern, this can produce 50k+ results (~2.5 MB of IPC payload). The consumer in fileSearch.ts truncates to maxResults (~72 items) only after the full result set crosses the message channel.

Consider adding a limit field to the FindMessage protocol and slicing results in the worker before serialization:

// In FindMessage:
interface FindMessage {
  type: 'find';
  reqId: number;
  pattern: string;
  limit?: number;
}

// In the .then() handler:
const limit = msg.limit ?? items.length;
const trimmed = items.slice(0, limit).map((entry) => ({ item: entry.item }));

A generous limit (e.g., 200) caps IPC at ~10 KB while giving the main thread enough candidates for downstream filtering.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81a513d. Added limit?: number to FindMessage protocol. Worker now calls .slice(0, limit) before postMessage. The caller in fileSearch.ts passes Math.max(200, maxResults * 3) — generous enough for downstream ignore-filter to drop entries without starving results, while capping IPC at ~10 KB instead of multi-MB payloads on broad patterns.

installInProcessFzfTransport,
} from './fzfWorkerHandle.js';

describe('FzfWorkerHandle', () => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The worker transport code path (138 lines in fzfWorkerHandle.ts — the core feature of this PR) has zero test coverage. All 5 tests here use either the default in-process fallback (tiny inputs below the 5 000 threshold) or explicitly call installInProcessFzfTransport(). No test spawns a real worker_threads Worker.

Untested paths include:

  • Worker spawn → init → find → dispose lifecycle
  • init-error / find-error message handling
  • Worker unexpected exit → failAll() rejection of pending promises
  • Dispose-then-find race condition (pending finds should reject)

The in-process tests verify the interface contract, but bugs in IPC message handling, result reconstruction (positions: new Set<number>() reassembly), or worker lifecycle would go undetected.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81a513d. Added a worker transport (real worker_threads) describe block with 5 tests that use __setWorkerThresholdForTests(1) without installInProcessFzfTransport() to exercise the real spawn/IPC path:

  • spawn → init → find → dispose lifecycle
  • find() rejects after dispose()
  • limit parameter across IPC
  • create() rejects and disposes on init failure
  • concurrent find() calls

The tests build fzfWorker.js on-the-fly via esbuild in beforeAll and clean up in afterAll. Also discovered and fixed a bug: find() after dispose() would hang forever — now rejects immediately.

xaelistic pushed a commit to xaelistic/qwen-code that referenced this pull request Jun 7, 2026

@DragonnZhang DragonnZhang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Verdict: COMMENT (not approving — critical lifecycle issues need resolution; CI also failing on Windows)

Overall Assessment

The architecture is well-designed: the worker/main-thread transport abstraction is clean, the threshold-based fallback is pragmatic, and the protocol between worker and handle is simple and correct. The esbuild config restructure to emit a standalone dist/fzfWorker.js is the right approach. Code quality is high throughout — good comments, clear separation of concerns, and the installInProcessFzfTransport() test hook avoids the pitfalls of global test setup.

However, the existing inline comments from @wenshao identify genuine blockers that must be addressed before merge:

  1. Worker thread leak in useAtCompletion (Critical, fileSearch.ts:106) — The CLI's @-picker hook creates FileSearch instances but never calls the new dispose() method, neither on cwd/config change nor on component unmount. With the worker transport active on large workspaces, this accumulates orphaned worker threads over a session.

  2. Worker leak on init failure in create() (fzfWorkerHandle.ts:309) — If inst.ready() rejects, the spawned worker is never disposed. A try/catch around await inst.ready() that calls inst.dispose() on failure would close this gap.

  3. No test coverage for the worker transport path (fzfWorkerHandle.test.ts:14) — All 5 tests exercise the in-process fallback. The worker transport (the actual feature of this PR) is untested. At minimum, a test using __setWorkerThresholdForTests(1) without installInProcessFzfTransport() would verify the spawn/message/dispose lifecycle.

Independent Findings

I independently verified the following and found them correct:

  • The message protocol between worker and handle is sound — init/find/result/dispose sequencing is clean, failAll handles error and unexpected exit correctly.
  • worker.unref() correctly prevents the worker from blocking process exit.
  • RecursiveFileSearch.dispose() is properly idempotent (interface requirement satisfied).
  • The esbuild parallel build via Promise.all is correct — outputs go to separate paths, and the worker build's metafile omission is intentional.
  • getWorkerScriptPath() caching is acceptable for production; the existsSync fallback for npm run dev (tsx) is a documented and reasonable trade-off.

CI Note

The Windows test failure (Test (windows-latest, Node 22.x)) should also be investigated before merge, as it may relate to worker_threads spawn latency differences on Windows.


Findings: 0 new inline comments (all critical issues already covered by existing review). 3 areas need attention before merge.

— qwen-code via Qwen Code /review

@wenshao

wenshao commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

@qwen-code /review

@callmeYe callmeYe force-pushed the perf/fzf-worker-async branch from 5b4a301 to 81a513d Compare June 8, 2026 09:13
if (slowSearchTimer.current) {
clearTimeout(slowSearchTimer.current);
}
void fileSearch.current?.dispose?.();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Effect cleanup disposes the worker on every dependency change, silently breaking @-completion search.

The effect's dependency array is [state.status, state.pattern, config, cwd]. Every status transition (e.g., INITIALIZING→READY after initialize() completes, or READY→SEARCHING on keystroke) triggers this cleanup, which disposes the just-created FzfWorkerHandle and sets fileSearch.current = null.

Trace: initialize() creates a searcher, assigns fileSearch.current, dispatches INITIALIZE_SUCCESS → status becomes READY → cleanup fires → dispose() kills the worker → fileSearch.current = null. Then SEARCHING triggers search() which early-returns because !fileSearch.current.

The base version's cleanup only aborted the search controller and cleared the timer — the searcher ref was preserved across re-renders.

Suggested change
void fileSearch.current?.dispose?.();
searchAbortController.current?.abort();
if (slowSearchTimer.current) {
clearTimeout(slowSearchTimer.current);
}

Move the dispose() + null assignment to a dedicated lifecycle effect with [cwd, config] deps (matching the existing RESET dispatch), so it only fires on workspace/config changes and unmount — not on every keystroke:

useEffect(() => {
  return () => {
    void fileSearch.current?.dispose?.();
    fileSearch.current = null;
  };
}, [cwd, config]);

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f3b9666. Moved dispose() out of the worker effect cleanup (which fires on every [state.status, state.pattern] change) into a separate effect with [cwd, config] deps — now only fires on workspace/config change and unmount.

await Promise.race([
inst.ready(),
new Promise<never>((_, rej) =>
setTimeout(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] setTimeout inside Promise.race is never cleared when inst.ready() resolves first (the happy path). The timer stays alive for Math.max(10_000, files.length / 10) ms (10 s for ≤100k files, up to 100 s at MAX_CRAWL_FILES), holding a closure over the files array and preventing GC.

Suggested change
setTimeout(
const timeoutMs = Math.max(10_000, files.length / 10);
let timerId: ReturnType<typeof setTimeout> | undefined;
try {
await Promise.race([
inst.ready(),
new Promise<never>((_, rej) => {
timerId = setTimeout(
() =>
rej(
new Error(
`fzf worker init timed out after ${timeoutMs}ms (${files.length} files)`,
),
),
timeoutMs,
);
}),
]);
} catch (err) {
await inst.dispose();
throw err;
} finally {
if (timerId !== undefined) clearTimeout(timerId);
}

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f3b9666. Added timerId variable and clearTimeout(timerId) in a finally block so the timer is cleaned up on both success and failure paths.

const previous = this.fileSearchInstances.get(rootPath);
this.fileSearchInstances.delete(rootPath);
this.fileSearchInitializing.delete(rootPath);
void previous?.dispose?.().catch(() => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] setupFileWatchers() returns a Disposable that disposes file watchers and the workspace-folders change listener, but never iterates this.fileSearchInstances to call dispose() on the cached FileSearch instances. When the webview panel closes, any RecursiveFileSearch instances that promoted to the worker path (≥5k files) retain live worker threads.

Consider extending the Disposable returned by setupFileWatchers():

return {
  dispose: () => {
    for (const watcher of this.fileWatchers.values()) watcher.dispose();
    this.fileWatchers.clear();
    foldersChangeListener.dispose();
    for (const instance of this.fileSearchInstances.values()) {
      void instance.dispose?.().catch(() => {});
    }
    this.fileSearchInstances.clear();
    this.fileSearchInitializing.clear();
  },
};

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f3b9666. Extended the Disposable returned by setupFileWatchers() to iterate fileSearchInstances and call dispose() on each, plus clear the initializing map.

Comment thread scripts/prepare-package.js Outdated
files: [
'cli.js',
// Worker thread entry loaded by FzfWorkerHandle at runtime via
// `new Worker(new URL('./fzfWorker.js', import.meta.url))`. Must ship

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The comment says new Worker(new URL('./fzfWorker.js', import.meta.url)) but the actual code in fzfWorkerHandle.ts uses resolveBundleDir(import.meta.url) + path.join(dir, 'fzfWorker.js') — a filesystem-based path resolution, not a new URL() constructor pattern.

Suggested change
// `new Worker(new URL('./fzfWorker.js', import.meta.url))`. Must ship
// Worker thread entry loaded by FzfWorkerHandle at runtime via
// `resolveBundleDir(import.meta.url)` + `path.join(dir, 'fzfWorker.js')`.
// Must ship in the tarball or the @-picker silently falls back to the
// in-thread AsyncFzf path on big workspaces in npm-installed CLIs.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f3b9666. Updated comment to reference resolveBundleDir(import.meta.url) + path.join(dir, 'fzfWorker.js') instead of the new URL() pattern.

pending.clear();
};

worker.on('error', (err) => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] worker.on('error') and worker.on('exit') handlers that call failAll() have no test coverage. If failAll() has a bug, pending find() calls could hang indefinitely in production when the worker crashes.

Consider adding a test that forces a worker crash (e.g., by passing a script that throws at the top level) and verifying that pending find() calls reject with an appropriate error message.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f3b9666. Added a test that overwrites fzfWorker.js with a script that calls process.exit(1) on find, then verifies handle.find() rejects with /exited unexpectedly/. This exercises the worker.on('exit')failAll() path.

maxFiles: MAX_CRAWL_FILES,
});
this.buildResultCache();
await this.buildResultCache();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] RecursiveFileSearch.dispose() is a new public method but fileSearch.test.ts has no tests for it. Callers (useAtCompletion.ts, FileMessageHandler.ts) rely on dispose() for worker thread cleanup — a regression here would silently leak workers.

Consider adding a test:

it('should release fzf handle on dispose', async () => {
  const fileSearch = FileSearchFactory.create({ ... });
  await fileSearch.initialize();
  await expect(fileSearch.dispose()).resolves.toBeUndefined();
  // Idempotent
  await expect(fileSearch.dispose()).resolves.toBeUndefined();
});

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f3b9666. Added two tests in fileSearch.test.ts: one verifying RecursiveFileSearch.dispose() resolves and is idempotent, another confirming DirectoryFileSearch has no dispose method (expected — it holds no worker resources).

callmeYe and others added 3 commits June 8, 2026 21:14
`new AsyncFzf(allFiles, ...)` is misleadingly named — its constructor is
synchronous and dominates the @-picker's main-thread cost on workspaces
with >20k files (where the v2→v1 algorithm switch kicks in). It's the
remaining freeze source after PR QwenLM#3214 moved the crawl onto async spawn-
based git ls-files / ripgrep / fdir.

This change hosts the AsyncFzf instance in a worker_threads worker so the
construction happens off the main thread; only completed find() results
cross the message channel. Below ~5k files the in-thread path is kept
because worker spawn + IPC overhead exceeds the construction cost there.

Notes:
- `FzfWorkerHandle.create()` awaits the worker `ready` event before
  returning, so init errors surface at create() time rather than as a
  surprise on the first find().
- Worker is `unref()`-ed so a mid-find @-picker doesn't pin process exit.
- `FileSearch` interface gains an optional `dispose?(): Promise<void>`
  to release worker resources. `FileMessageHandler.clearFileSearchCache`
  calls it so the long-running VS Code extension host doesn't accumulate
  workers across watcher-triggered evictions.
- `installInProcessFzfTransport()` is provided for tests / sandboxed
  environments that need to skip worker spawn entirely.
- Bundle gains a second esbuild build that emits `dist/fzfWorker.js`
  alongside `dist/cli.js`; `prepare-package.js` ships it in the tarball.
- In `npm run dev` (tsx) the bundled worker file isn't on disk, so the
  handle silently falls back to in-thread — acceptable for a dev script.

Closes the door on QwenLM#3455's two open `FileIndexService` review items by
design: there is no `whenReady()` (handle awaits ready inside `create()`)
and there is no in-worker crawl (crawl stays on main, only fzf moves).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix worker thread leak in useAtCompletion: dispose previous FileSearch
  instance on re-init and component cleanup
- Fix worker leak on init failure: wrap inst.ready() in try/catch with
  timeout, dispose on failure
- Fix find()-after-dispose hang: reject immediately when worker is disposed
- Add limit field to FindMessage protocol to cap IPC payload before
  postMessage serialization
- Add real worker_threads test coverage (spawn/find/dispose lifecycle,
  concurrent finds, init failure, post-dispose rejection)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix useAtCompletion cleanup: move dispose() to [cwd,config] effect so
  it only fires on workspace change, not every keystroke/status transition
- Clear setTimeout in Promise.race on happy path to prevent timer leak
- Dispose FileSearch instances in FileMessageHandler on panel close
- Fix incorrect comment in prepare-package.js (resolveBundleDir, not URL)
- Add worker crash test: verify failAll rejects pending find() on exit
- Add RecursiveFileSearch.dispose() test in fileSearch.test.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@callmeYe callmeYe force-pushed the perf/fzf-worker-async branch from f3b9666 to 6be81d3 Compare June 8, 2026 13:15
const initialize = async () => {
try {
// Dispose previous instance to prevent worker thread leaks on
// re-initialization (cwd/config change triggers RESET → re-init).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The async initialize() is fire-and-forget with no cancellation guard. If the component unmounts or cwd/config changes while await searcher.initialize() is in flight (worker spawn + ready handshake takes several ms), the cleanup effect runs fileSearch.current?.dispose?.() — but fileSearch.current is still null from line 164, so nothing is disposed. When the pending initialize() resolves, fileSearch.current = searcher writes a live worker into the ref with no owner to dispose it.

In React 18 Strict Mode (dev), the double-invocation makes this deterministic: invocation-1's async init completes after cleanup already ran, leaking a worker thread.

Consider adding a cancelled flag or AbortController that the effect cleanup signals:

const initialize = async (signal: AbortSignal) => {
  try {
    await fileSearch.current?.dispose?.();
    fileSearch.current = null;
    const searcher = FileSearchFactory.create({ ... });
    await searcher.initialize();
    if (signal.aborted) { await searcher.dispose?.(); return; }
    fileSearch.current = searcher;
    // ...
  } catch (err) { /* ... */ }
};

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0dae073. Added a cancelled flag inside the effect closure — set to true in cleanup. After await searcher.initialize(), if cancelled is true the searcher is disposed immediately instead of being written to fileSearch.current. This covers both the unmount-during-init and the Strict Mode double-invocation scenarios.


describe('worker transport (real worker_threads)', () => {
let workerExists = false;
const outfile = path.resolve(import.meta.dirname, 'fzfWorker.js');

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] beforeAll builds fzfWorker.js into the source tree (packages/core/src/utils/filesearch/fzfWorker.js), but .gitignore has no entry for it. If the test process crashes (OOM, Ctrl-C, segfault) before afterAll runs fs.unlinkSync(outfile), the compiled artifact persists and could be accidentally committed.

A stale fzfWorker.js in the source tree also causes getWorkerScriptPath() to return non-null in tsx/dev mode, making the dev path unexpectedly use the worker transport with a potentially outdated build.

Add packages/core/src/utils/filesearch/fzfWorker.js to .gitignore, or write the test artifact to os.tmpdir() instead.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0dae073. Added packages/core/src/utils/filesearch/fzfWorker.js to .gitignore so a crash-orphaned test artifact won't be accidentally committed or cause dev-mode to use a stale worker build.

@DragonnZhang DragonnZhang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary: perf(filesearch): move AsyncFzf index construction to a worker thread

Verdict: COMMENT (1 finding)

This is a well-engineered PR. The transport abstraction (in-process vs worker_threads) with threshold-based selection is clean, the IPC protocol is simple and correct, and disposal is handled properly at all integration points (useAtCompletion, FileMessageHandler, RecursiveFileSearch). The test coverage is thorough — covering both transports, concurrent find(), crash recovery, and dispose idempotency.

Findings: 1 medium-severity

MEDIUM — Concurrent initialize() can leak worker threads (useAtCompletion.ts:164)

When cwd/config changes rapidly, overlapping initialize() calls can leak a worker thread. The await fileSearch.current?.dispose?.() only disposes the assigned reference. If two inits overlap, the first init's in-flight searcher (with its worker) is never disposed when the second init overwrites fileSearch.current.

This race pre-dates the PR, but the cost is materially higher with worker_threads (~1–4 MB stack + V8 isolate per leaked thread vs bounded heap for in-process AsyncFzf). A generation counter or AbortController in the initialize effect would close the gap.

What looks good

  • worker.unref() prevents the worker from blocking process exit
  • Init timeout scaled by file count (Math.max(10_000, files.length / 10)) is pragmatic
  • Stripping positions Set before IPC serialization keeps payloads small
  • installInProcessFzfTransport() with mandatory restorer avoids test pollution
  • Dual esbuild builds (main + worker) with Promise.all is clean
  • prepare-package.js whitelisting fzfWorker.js ensures the tarball includes the worker

try {
// Dispose previous instance to prevent worker thread leaks on
// re-initialization (cwd/config change triggers RESET → re-init).
await fileSearch.current?.dispose?.();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM — Concurrent initialize() calls can leak worker threads

The await fileSearch.current?.dispose?.() here only disposes the previously assigned reference. When cwd/config changes rapidly, two initialize() invocations can overlap:

  1. Init A starts — disposes old searcher, creates searcherA (spawns worker A), awaits searcherA.initialize().
  2. cwd changes again — Init B starts — fileSearch.current is still null (Init A hasn't set it yet) — dispose is a no-op.
  3. Init A completes — sets fileSearch.current = searcherA.
  4. Init B completes — sets fileSearch.current = searcherBsearcherA and its worker thread are leaked.

This race pre-dates the PR, but each leaked worker_threads worker holds ~1–4 MB stack + a V8 isolate, making the leak materially more expensive than the old in-process AsyncFzf.

Suggested change
await fileSearch.current?.dispose?.();
const initialize = async () => {
const myGeneration = ++initGeneration;
try {
// Dispose previous instance to prevent worker thread leaks on
// re-initialization (cwd/config change triggers RESET → re-init).
await fileSearch.current?.dispose?.();
fileSearch.current = null;
const searcher = FileSearchFactory.create({

A module-level let initGeneration = 0; plus a stale-check after await searcher.initialize() would cancel superseded initializations:

await searcher.initialize();
if (myGeneration !== initGeneration) {
  // A newer initialize() already started — dispose ours and bail.
  await searcher.dispose?.();
  return;
}
fileSearch.current = searcher;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0dae073. Added a cancelled flag inside the effect — the cleanup sets it to true, and after await searcher.initialize() the code checks it before assigning to fileSearch.current. If cancelled (because a newer init already started or the component unmounted), the newly-created searcher is disposed immediately. This prevents the overlap scenario you described where Init A's searcher would be orphaned when Init B overwrites the ref.

callmeYe and others added 2 commits June 9, 2026 13:46
The second test case in the formatDateForContext describe block was
missing its closing `});`, causing a TS1005 parse error that broke CI
on all platforms. This was introduced upstream in 6a99ed1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add cancelled flag to initialize() effect so a superseded or
  unmounted init disposes the newly-created searcher instead of
  writing it into fileSearch.current with no owner
- Add fzfWorker.js test artifact to .gitignore to prevent accidental
  commits if the test process crashes before afterAll cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
const failAll = (err: Error) => {
if (readyState === 'pending') {
readyState = 'failed';
readyReject(err);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] failAll() only transitions readyState to 'failed' when the current state is 'pending'. If the worker crashes after a successful init (state 'ready'), readyState remains 'ready' permanently. Subsequent find() calls pass the 'disposed' guard, await readyPromise resolves instantly, and worker.postMessage() throws on a dead worker. Combined with the .catch(() => { return []; }) in fileSearch.ts:178, the @-picker silently returns zero results for every keystroke with no self-healing until cwd/config change.

Consider making failAll unconditionally mark the handle as failed when not disposed:

const failAll = (err: Error) => {
  if (readyState === 'pending') {
    readyReject(err);
  }
  if (readyState !== 'disposed') {
    readyState = 'failed';
  }
  for (const slot of pending.values()) {
    slot.reject(err);
  }
  pending.clear();
};

And guard find() against the 'failed' state:

async find(pattern, limit?) {
  if (readyState === 'disposed' || readyState === 'failed') {
    throw new Error(`fzf worker not available (${readyState})`);
  }
  // ...
}

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 769a236. failAll() now unconditionally sets readyState = 'failed' when the state is not 'disposed', so a post-init worker crash correctly marks the handle as dead. find() now guards against both 'disposed' and 'failed' states, rejecting immediately instead of silently returning empty results via the .catch(() => []) in fileSearch.ts. Added a test that verifies: crash worker on first find(), then assert subsequent find() rejects with /not available.*failed/.

failAll() previously only set readyState to 'failed' when the worker
crashed during init (state 'pending'). A crash after successful init
left the handle in 'ready' state, causing subsequent find() calls to
silently return empty results via the catch handler. Now failAll()
unconditionally marks the handle as 'failed' (unless already disposed),
and find() rejects immediately on a failed handle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@DragonnZhang DragonnZhang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary — PR #4621

perf(filesearch): move AsyncFzf index construction to a worker thread

This PR moves the synchronous new AsyncFzf(allFiles, ...) constructor off the main thread into a worker_threads worker, eliminating the Ink render-loop freeze on workspaces with >20k files. The architecture is clean: a 4-message RPC protocol between worker and main thread, with an in-process fallback for small file counts (<5k) where worker spawn overhead would exceed the construction cost.

Verification Results

  • CI: All checks pass (Lint, CodeQL, Tests on macOS/Ubuntu/Windows)
  • Tests: 42/42 filesearch tests pass (27 existing + 2 new dispose tests + 13 FzfWorkerHandle tests including 7 real worker_threads scenarios)
  • Lint: Clean on all 6 changed source files
  • Existing review comments: 14 issues raised by prior reviewers (wenshao, DragonnZhang) across 3 rounds of feedback — all addressed in commits 81a513d, f3b9666, 0dae073, 769a236

Analysis

The prior reviewers were extremely thorough, covering: worker thread lifecycle leaks, init failure handling, IPC payload optimization, timer cleanup, race conditions in concurrent initialize() calls, React effect cleanup semantics, test coverage gaps, and .gitignore hygiene. The author addressed every finding with appropriate fixes.

My independent analysis of the current codebase confirmed:

  • cancelled flag correctly guards against stale async completions in useAtCompletion.ts
  • failAll() now unconditionally marks the handle as 'failed' on post-init crashes
  • FzfWorkerHandle.create() properly disposes on init failure and cleans up the timeout timer
  • RecursiveFileSearch.dispose() is idempotent and safely handles undefined fzf
  • The [cwd, config] effect correctly separates resource disposal from per-keystroke effect cleanup
  • setupFileWatchers() Disposable now iterates and disposes all cached FileSearch instances

No new high-confidence issues found beyond what prior reviewers already identified and the author fixed.

Verdict: APPROVE

@wenshao

wenshao commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Local runtime verification (tmux-driven, bundled CLI)

Ran the PR's actual surface — the bundled CLI's @-picker — under a tmux harness that types one character every ~120 ms after pressing @ and polls capture-pane every 25 ms, measuring per-keystroke echo latency (a frozen Ink loop shows up as multi-hundred-ms echo gaps). Synthetic trees at /tmp; PR built from 769a2365c, baseline built from current main (53349a0b2-era), both via npm ci && npm run build && npm run bundle.

Environment: Apple M5 Pro, macOS 26.5, Node v22.22.2.

Verdict

Mechanically: everything checks out. Worker is spawned exactly when promised, results are correct, the fallback is robust, lifecycle is clean, packaging is right.

Performance premise: not reproducible on this machine. Current main shows no observable freeze at the crawler's 100k-file cap, so there was nothing for the worker to remove here. Details below — this is the part worth discussing before merge.

Scenario matrix (echo latency while typing through init + search)

scenario build tree max echo latency stalls >300ms worker used?
pr-big PR 80k files 46 ms 0 ✅ proven (marker, +1 thread)
pr-small PR 300 files 61 ms 0 ❌ in-thread (no marker) — correct cutover
pr-nofile PR, dist/fzfWorker.js deleted 80k 70 ms 0 ❌ silent in-thread fallback, no crash
main-big main 80k 39 ms 0 n/a
pr-gitrepo PR qwen-code repo itself (git crawl path) 112 ms 0 in-thread (~2k files)

"Worker used" was proven by instrumenting dist/fzfWorker.js with a write-marker-file-on-load line: marker appears on the 80k run, absent on the 300-file run — the ≥5k threshold cutover works exactly as designed. All runs rendered the same 24 matches for the same query; visible tie-order varies run-to-run with crawl enumeration order (PR-in-thread output matched main exactly), so no ranking regression.

Edge probes (all on PR build, 80k tree)

  • 🔍 garbage pattern @zzqqxxjjvv → graceful empty suggestion state, process alive
  • 🔍 backspace garbage, retype worker → results return (repeated IPC find OK)
  • 🔍 Esc → picker closes, input text retained
  • 🔍 /quit with worker proven alive (load marker present in that session) → process exited in 0.15 sunref() claim holds, no exit pinning
  • 🔍 dist/package.json after prepare-package.jsfiles includes "fzfWorker.js"
  • 🔍 worker bundle is self-contained (28 KB, fzf inlined, loads node:worker_threads only)

Final pane, PR build, 80k tree (typing stayed responsive throughout):

* @workerhandle ​
──────────────────────────────────────────────
  charlie2/lima11/worker_handler_70.md
  charlie2/lima11/worker_handler_40.md
  ...
  (1/24)

⚠️ The performance premise doesn't hold up locally

The PR's stated root cause is "new AsyncFzf(...) constructor runs sync and dominates main-thread cost on big trees, especially after the v2→v1 switch at 20k files". Two problems with that, from measurement and from fzf's source:

1. Constructor cost at the crawler cap is tens of milliseconds. crawl() caps input at MAX_CRAWL_FILES = 100_000, so this bounds the worst case. Measured with the repo's own bundled fzf (averaged, this machine):

input ctor time
80k × 35-char paths 26 ms
100k × 35-char 28 ms
100k × 120-char 61 ms
100k × node_modules-style deep paths 57 ms
100k × CJK filenames 50 ms

The PR cites "~30 ms at ~5k files" from fzf-bench, which would extrapolate to ~600 ms at 100k — my measurements come in ~20× under that. Even granting several-fold slower hardware, the ctor stays sub-frame-budget-×-a-few at the cap.

2. The v1/v2 switch can't affect ctor cost, and find() is already non-blocking by design. In fzf's BaseFinder the constructor work (strToRunes over all items) is identical for v1/v2 — the option only selects algoFn. And AsyncFzf.find() runs through asyncMatcher, which processes 1000 items per chunk with setImmediate yields between chunks — at 80k files that's ~1.6 ms of main-thread work per chunk. That's why main measures flat ≤39 ms echo latency: the "Async" in AsyncFzf is doing its job. (Total find() cost is real — 86–184 ms per keystroke at 15k–80k files in isolation — but it's sliced, not a freeze.)

The genuinely sync freeze (the original issue) was the fdir crawl, which #3214 already fixed. After #3214, on this hardware, I could not produce any input stall on main with the biggest tree the crawler will accept.

Other findings

  • environmentContext.test.ts in this PR is now byte-identical to current main (main fixed the same missing-braces issue independently) — that commit becomes a no-op on merge, no conflict.
  • PR body says "7 files, +650/−74"; the branch is actually 11 files, +958/−76. Stale description, worth refreshing.
  • Not exercised here: the vscode-ide-companion dispose-on-evict path (needs a VS Code harness), worker-crash-mid-find at the surface (covered by the 5 new unit tests), Windows, and genuinely slow hardware.

Suggestion for the merge decision

The implementation quality is good — fallback, threshold, dispose, unref, packaging all verified working at the real surface. What's missing is evidence the disease exists post-#3214: @callmeYe, can you reproduce a measurable input freeze on current main (exact machine, OS, Node version, file count, and how it was measured)? If yes, this merges on solid ground. If the freeze only reproduces on the pre-#3214 build the original issue was filed against, then this PR buys a second build artifact + worker lifecycle complexity for headroom rather than a fix, and that's a different (still discussable) trade.

@yiliang114 yiliang114 merged commit be31e84 into QwenLM:main Jun 12, 2026
33 checks passed
@yiliang114

Copy link
Copy Markdown
Collaborator

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants