Skip to content

feat(core): replace fdir crawler with git ls-files + ripgrep fallback#3214

Merged
wenshao merged 47 commits into
QwenLM:mainfrom
scrollDynasty:feat/git-ls-files-file-search
May 12, 2026
Merged

feat(core): replace fdir crawler with git ls-files + ripgrep fallback#3214
wenshao merged 47 commits into
QwenLM:mainfrom
scrollDynasty:feat/git-ls-files-file-search

Conversation

@scrollDynasty

Copy link
Copy Markdown
Contributor

Summary

Replaces the fdir-based filesystem crawler with a two-tier strategy for the @ file mention autocomplete feature.

Closes #3137

Problem

The current fdir crawler re-scans the entire directory tree on every keystroke, which is slow on large repos and doesn't respect .gitignore natively.

Solution

Three-level fallback strategy:

  1. Primary: git ls-files — for git repos. Fast, .gitignore-aware, reads directly from git index
  2. Fallback 1: ripgrep --files — for non-git directories. 20s timeout, 20MB buffer cap, respects .gitignore automatically
  3. Fallback 2: fdir — original behavior preserved as last resort

Supporting improvements:

  • Mtime-based change detection — watches .git/index mtime, skips re-crawl when nothing changed
  • 5s refresh throttling — prevents rebuilding index on every keystroke
  • Async chunked indexing — yields to event loop every 1000 entries, keeps UI responsive with 200k+ files
  • Background untracked file merge — untracked files fetched async with 10s timeout
  • Windows path normalization — POSIX-style paths throughout
  • Bug fix — handle empty string from path.posix.relative when crawlDirectory equals gitRoot

Testing

  • 76/76 filesearch tests pass
  • 4 new tests added: git repo path, non-git fallback, throttling, mtime detection
  • npm run preflight

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Great idea — git ls-files is a big perf win over fdir, and the three-level fallback with execFile (no shell injection risk) is well-designed.

Issues

  1. maxDepth silently ignored on git and rg paths. Neither crawlWithGitLsFiles nor crawlWithRipgrep applies depth filtering. Callers using crawl({ maxDepth: 0 }) for one-level autocomplete will get full recursive results — a behavioral regression.

  2. Subprocess ignore rules override caller options. git ls-files --others --exclude-standard drops .gitignore-ignored files, and rg --files applies its own filtering, regardless of the caller's useGitignore setting. The fdir path respects it; the new paths don't.

  3. Throttle broken for non-git paths. crawlWithRipgrep calls recordRebuild but never updateChangeState, so the 5s throttle has no effect outside git repos. Same issue for the fdir fallback.

  4. Tests don't isolate strategies. The test suite runs inside a git repo, so findGitRoot(tmpDir) always resolves to qwen-code's .git and the git path is always taken — including the "fdir fallback" test. Consider git init in temp dirs for git tests, and mocking findGitRoot to return null for non-git tests.

Verdict

COMMENT — The core approach is solid but these gaps need to be addressed before merge. Items 1-3 are behavioral regressions; item 4 means the new code paths have no real test coverage.

@scrollDynasty

scrollDynasty commented Apr 15, 2026

Copy link
Copy Markdown
Contributor Author

Everything seems to be fixed @tanzhenxin if you can point out any errors or shortcomings, I will try to correct them.

@tanzhenxin

Copy link
Copy Markdown
Collaborator

Thanks for the thorough rework — maxDepth, useGitignore, and the test coverage all look good now.

One scenario I'd like you to verify before merging: the non-git hasFileListChanged branch returns false unconditionally when state exists, so non-git directories end up cached for the entire process lifetime — new files added after the first crawl won't show in @ autocomplete until restart.

Trace:

  • Call 1 (T=0): no state → crawl, populate state
  • Call 2 (T=2s, within throttle): isThrottled=true → return cache ✓
  • Call 3 (T=6s, past throttle): isThrottled=false, hasFileListChanged=false → return cache
  • Call 4 (T=5min): same → return cache

The new should throttle re-crawl on non-git fallback paths test only covers the within-window case, so this slipped through. Could you add a test that advances time past the throttle window (and creates a new file) to confirm the intended behavior? If non-git paths really are meant to cache forever, that should be documented; otherwise something like return !isThrottled(stateKey) in that branch would force a fresh crawl once the throttle window expires.

Not blocking — most users are in git repos and the git path is correct — but worth closing the loop before merge.

@scrollDynasty

Copy link
Copy Markdown
Contributor Author

Thanks — I verified this and updated the non-git fallback so it re-crawls once the throttle window expires instead of caching forever. I also added a regression test that advances time past the 5s window, creates a new file, and confirms the fresh crawl picks it @tanzhenxin

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Good progress — the non-git re-crawl, useGitignore, and test isolation concerns from the previous review are all addressed. One remaining issue:

maxDepth computed relative to cwd, not crawlDirectory

applyMaxDepthLimit filters on cwd-relative paths (already prefixed with relativeToCrawlDir). When crawlDirectory is a subdirectory of cwd, the prefix inflates the depth count.

Example: crawl({ crawlDirectory: '/project/src', cwd: '/project', maxDepth: 0 })

  • Git path produces src/file.ts (cwd-relative)
  • getEntryDepth('src/file.ts') = 1 → filtered out
  • But file.ts is depth 0 relative to the crawl root

The fdir fallback handles this correctly because it applies maxDepth natively before prefixing. fileSearch.ts calls exactly this pattern (maxDepth: 0 with a subdirectory), so this breaks file completion.

Suggestion: strip the relativeToCrawlDir prefix before depth filtering (and re-add after), or subtract the prefix depth from the limit.

@scrollDynasty

scrollDynasty commented Apr 16, 2026

Copy link
Copy Markdown
Contributor Author

I fixed the issue by making maxDepth relative to crawlDirectory instead of cwd when filtering git and ripgrep results. This now matches the fdir behavior for subdirectory crawls, so maxDepth: 0 works correctly for file completion. I also added a regression test for the nested crawlDirectory case, and crawler.test.ts plus fileSearch.test.ts are passing locally. @tanzhenxin

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The critical maxDepth bug from the previous review is fixed — stripCrawlDirectoryPrefix correctly computes depth relative to the crawl directory, and the new test validates subdirectory crawls with maxDepth: 0. The fallback chain is well-structured and the test infrastructure is solid.

One thing worth addressing: git ls-files --others (line 313) runs without --exclude-standard, so gitignored directories like node_modules/ get fully enumerated into the 20MB buffer before applyFilters removes them. The old fdir path skipped ignored directories during traversal, so this is a performance regression for repos with large gitignored trees. The fix is one line — add --exclude-standard to the --others args.

Verdict

COMMENT — No blockers remaining. The --exclude-standard omission is the most actionable suggestion; everything else is minor.

@scrollDynasty

Copy link
Copy Markdown
Contributor Author

Fixed: Added --exclude-standard to git ls-files --others . Now gitignored and untracked files are not listed, eliminating a performance regression. @tanzhenxin

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the --exclude-standard fix in 8156924 — clean addition with the test updated to document the new semantics on the git path. Approving.

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retracting my earlier approval — the CI is failing on macOS and Windows. Both failing tests are the ones added for the fixes I approved:

  1. should treat maxDepth as relative to the crawl directory — fails with expected [ 'level1', …(2) ] to deeply equal ArrayContaining{…}, meaning the maxDepth fix from stripCrawlDirectoryPrefix / applyMaxDepthLimit doesn't behave the same way outside Linux.

  2. should avoid enumerating gitignored untracked files on git path — fails with expected [ '.', '.gitignore', 'keep.log', …(1) ] to not include 'keep.log', meaning --exclude-standard isn't actually filtering out the gitignored file in the macOS/Windows test fixtures.

Linux passes, macOS 20.x/22.x/24.x and Windows 20.x/22.x/24.x all fail. Likely candidates:

  • Path separator handling (/ vs \ on Windows) in stripCrawlDirectoryPrefix or in how depths are computed from the relative path
  • The git fixture setup — maybe the repo init isn't complete on macOS/Windows before git ls-files --others --exclude-standard is invoked, so git treats files as untracked-not-ignored

Please reproduce locally on macOS/Windows (or in CI logs) and address before merging.

@scrollDynasty

Copy link
Copy Markdown
Contributor Author

Okay, I'll fix it right now. @tanzhenxin

Comment thread packages/core/src/utils/filesearch/crawler.ts
Comment thread packages/core/src/utils/filesearch/crawler.ts
Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

@scrollDynasty scrollDynasty left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

@scrollDynasty scrollDynasty requested a review from wenshao April 18, 2026 06:39
@scrollDynasty

Copy link
Copy Markdown
Contributor Author

Fixed the path separator issue on Windows/macOS — stripCrawlDirectoryPrefix now uses path.posix normalization throughout. @tanzhenxin @wenshao could you take another look?

@wenshao

wenshao commented May 10, 2026

Copy link
Copy Markdown
Collaborator

@scrollDynasty d90feaf9d is red on the same 3 tests (third attempt with the same failure pattern):

Run: https://github.com/QwenLM/qwen-code/actions/runs/25635817286

FAIL  src/utils/filesearch/crawler.test.ts > crawler > with maxDepth > should treat maxDepth as relative to the crawl directory
  expected [ 'level1', …(2) ] to deeply equal ArrayContaining{…}
FAIL  src/utils/filesearch/crawler.test.ts > crawler > two-tier strategy: git ls-files + ripgrep fallback > should resolve the git root from a subdirectory crawl
  expected [ 'src', 'src/file2.js' ] to include 'src/'
FAIL  src/utils/filesearch/crawler.test.ts > crawler > two-tier strategy: git ls-files + ripgrep fallback > should not drop files after directory expansion when maxFiles is small
  expected [ '.' ] to include 'nested/deep.txt'

I just re-ran cd packages/core && npx vitest run src/utils/filesearch/crawler.test.ts against d90feaf9d locally → 43/43 ✅ on macOS host. Three rewrites of posixPathUnderGitRoot haven't shifted the CI failure shape, which is a strong signal the root cause is not in that function.

Look at the expected [ 'src', 'src/file2.js' ] to include 'src/' line: the 'src/file2.js' file row is correct, the '.' synthetic row is missing entirely, and the 'src' row in actual results is suspicious — buildResultsFromFileSet only ever emits 'src/' (with trailing slash) for directory rows. Where does the unsuffixed 'src' come from?

Two strong candidates that would behave differently on CI than on your dev box:

  1. git ls-files -z --cached -t output format. -t is documented as deprecated/ad-hoc, and combining it with -z is not well-specified across git versions. On CI the line buffer may yield a record whose first record after a \0 is just 'src' (the tag column?) instead of 'H src/file2.js'. The new recordDelimiter: '\0' code path in runCommand (commit d90feaf9d) splits on \0 only — but -t interleaves a leading status char + space before the path, so the per-record content shape can drift from your assumption.
  2. CI git version differences. macOS / Ubuntu / Windows runners ship different git versions than your local. git --version on the runner vs your laptop might be enough to swing ls-files -t -z output. (Lint and CodeQL pass on the same runners, so the box is fine — only the git CLI behavior differs.)

Concrete debug step

Temporarily add a console.error dump inside processTrackedFile:

console.error('[debug-3214] processTrackedFile', JSON.stringify({
  rawLine: file,
  parsed,
  normalizedFile,
  relativeToGitRoot,
  relativeToCrawlDir,
  fullPath,
}));

…push, re-run CI, the failed-job log will tell you exactly what git ls-files is feeding the parser on CI. (Revert before merge, obviously.)

Alternatively: drop -z for the -t path and keep newline framing — -t paths can't contain newlines anyway because they're tracked-files-only and git rejects newline filenames in the index. Then -z is reserved for --others/--deleted where actual unusual filenames matter.

I'll re-review the moment CI flips green.

- Enhanced the logic for handling git commands in the `crawlWithGitLsFiles` function by removing the NUL record delimiter for better compatibility with various Git versions.
- Updated the `equalsRepoRelativeCaseAware` function to utilize a new helper for case-folding path comparisons, improving consistency across platforms.
- Refactored the return logic in the `crawl` function to ensure proper handling of relative paths, enhancing the overall file crawling process.
@scrollDynasty

Copy link
Copy Markdown
Contributor Author

@wenshao fixed

@scrollDynasty scrollDynasty requested a review from wenshao May 10, 2026 18:31
@wenshao

wenshao commented May 10, 2026

Copy link
Copy Markdown
Collaborator

@scrollDynasty a12db81a2 made it worse — the -z removal fixed the core git ls-files shape ✅, but introducing buildResultsFromFileSet(new Set(cwdRelative)) into crawlWithFdir regressed 6 brand-new CLI tests in src/ui/hooks/useAtCompletion.test.ts plus several more in crawler.test.ts:

Run: https://github.com/QwenLM/qwen-code/actions/runs/25636266712

All new failures share the same shape — duplicated directory entries:

expected [ 'src/', 'src/', …(6) ]    to deeply equal [ 'src/', 'src/components/', …(4) ]
expected [ 'dir/', 'dir/', 'file.txt' ] to deeply equal [ 'dir/', 'file.txt' ]
expected [ 'src/', 'src/', '.gitignore' ] to deeply equal [ 'src/', '.gitignore' ]
expected [ 'src/', 'src/', 'src/main.js' ] to deeply equal [ 'src/', 'src/main.js' ]
expected [ 'build/', 'build/', …(7) ]  to deeply equal [ 'build/', 'build/public/', …(4) ]

Root cause

In the new crawlWithFdir tail:

const cwdRelative = results               // fdir already emits 'src/' (withDirs() + trailing slash)
  .filter((p) => p.length > 0 && p !== '.')
  .map((p) => path.posix.join(relativeToCrawlDir, p));
return buildResultsFromFileSet(new Set(cwdRelative));

fdir().withDirs() already returns directory rows with trailing slashes ('src/'). buildResultsFromFileSet also derives synthetic directory rows from every file path's parent chain. So 'src/' ends up in both dirSet (derived) and files (passed through from fdir), and the final [..., ...dirSet, ...files] concatenation emits it twice. new Set(cwdRelative) only dedupes the input list, not the output union.

Fix

Either:

  • Don't pass directory rows into buildResultsFromFileSet — filter cwdRelative to files only before constructing the set, and let buildResultsFromFileSet reconstruct the directory parents:
    const fileOnly = results
      .filter((p) => p.length > 0 && p !== '.' && !p.endsWith('/'))
      .map((p) => path.posix.join(relativeToCrawlDir, p));
    return buildResultsFromFileSet(new Set(fileOnly));
  • Or, skip buildResultsFromFileSet for the fdir path entirely — fdir's native withDirs() output is already in the expected shape; the previous return results.map(...) was correct, the regression came from over-normalizing. The git/rg paths need buildResultsFromFileSet because git only emits files; fdir does not.

Core crawler.test.ts status

Three of the original failures are gone (good — -z removal fixed them). One remaining:

FAIL src/utils/filesearch/crawler.test.ts > should not drop files after directory expansion when maxFiles is small
  expected [ '.' ] to include 'nested/deep.txt'

Plus several new dup-dir failures in crawler.test.ts from the same fdir issue above.

This iteration is whack-a-mole because the posixPathUnderGitRoot rewrites all touch path normalization rules that the rest of the pipeline (buildResultsFromFileSet, applyFilters, withDirs) implicitly depends on. Suggest stepping back: revert crawlWithFdir to its pre-56c16d12a form (it was passing), then re-attack the original 3 core failures with focused, smaller commits — ideally with one new test case per fix that pins the exact behavior.

- Modified tests to expect two files instead of one in the results of the `crawl` function, ensuring accuracy in file counting.
- Enhanced the validation of the returned file array to include specific expected file names, improving test reliability.
- Adjusted the logic in the `crawlWithFdir` function to ensure proper handling of directory entries and file paths during crawling.
@scrollDynasty

Copy link
Copy Markdown
Contributor Author

@wenshao fixed

options.ignore.getFingerprint(),
options.maxDepth,
options.maxFiles,
function normalizeGitRefPath(p: string): string {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] normalizeGitRefPath and normalizeGitLsOutputLine are nearly identical — both do toPosixPath + strip ./ prefix. The only difference is that normalizeGitRefPath additionally strips trailing slashes. Consider merging into a single function with an optional stripTrailingSlash parameter to reduce maintenance burden.

Suggested change
function normalizeGitRefPath(p: string): string {
function normalizeGitPath(p: string, stripTrailingSlash = false): string {
let s = toPosixPath(p);
if (s.startsWith('./')) s = s.slice(2);
return stripTrailingSlash ? s.replace(/\/+$/, '') : s;
}

— deepseek-v4-pro via Qwen Code /review

}

const before = fileSet.size;
fileSet.add(fullPath);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] processTrackedFile unconditionally calls fileSet.add(fullPath) for every tracked file, even when the file exceeds maxDepth. The depth check at shouldCountTowardBudget only controls whether the file counts against maxFiles — it does not prevent the file from being added to fileSet. When maxDepth is restrictive but maxFiles is unset (or generous), fileSet grows to O(all files) and most are later discarded by applyFilters. Consider early-skipping fileSet.add for out-of-depth files, or documenting that callers must set maxFiles when using maxDepth.

— deepseek-v4-pro via Qwen Code /review

console.warn('[crawler] falling back to fdir (ripgrep unavailable)');
}

let fdirResults = await crawlWithFdir(options);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The fdir fallback path in crawl() does not call applyFilters() on fdirResults before applyMaxFilesLimit. The git/rg paths run applyFilters via buildResultsFromFileSet (which does depth + ignore filtering). The fdir path relies solely on its .exclude/.filter callbacks, with no secondary filtering pass. If a future change introduces a gap in those callbacks, there is no safety net. Consider running applyFilters on fdir results for consistency.

Suggested change
let fdirResults = await crawlWithFdir(options);
fdirResults = applyFilters(fdirResults, options, relativeToCrawlDir);
const limitedResults = applyMaxFilesLimit(fdirResults, options.maxFiles);

— deepseek-v4-pro via Qwen Code /review

return results;
}

if (gitResult.gitRepoListingFailed) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The fdir fallback warning ([crawler] falling back to fdir) is gated on gitResult.gitRepoListingFailed. In non-git directories this flag is false, so the fdir degradation is silent — even though ripgrep failed and the user is now on the slowest path. The git path correctly shows both degradation warnings. Consider tracking whether ripgrep failed independently so the fdir warning also fires in non-git directories.

— deepseek-v4-pro via Qwen Code /review

…zation

- Updated tests to expect specific warning messages when falling back to fdir due to ripgrep unavailability, ensuring accurate feedback during file crawling.
- Refactored the `normalizeGitRefPath` function to `normalizeGitPath`, adding an option to strip trailing slashes for improved path handling.
- Enhanced comments in the `applyFilters` function to clarify its purpose and behavior regarding depth limits and ignore rules.
@scrollDynasty

Copy link
Copy Markdown
Contributor Author

@wenshao fixed

@wenshao

wenshao commented May 11, 2026

Copy link
Copy Markdown
Collaborator

CI failure analysis (macOS + Windows)

Both failing jobs (Test (macos-latest, Node 22.x), Test (windows-latest, Node 22.x)) fail on the same two new tests in packages/core/src/utils/filesearch/crawler.test.ts. Ubuntu passes.

Failing assertions

  1. crawler > with maxDepth > should treat maxDepth as relative to the crawl directory (test ~line 707)
    • Expected to contain: 'level1/'
    • Actual results: ['.', 'level1/file-level1.txt', 'level1/level2/']
  2. crawler > two-tier strategy: git ls-files + ripgrep fallback > should resolve the git root from a subdirectory crawl (test ~line 872)
    • Expected to contain: 'src/'
    • Actual results: ['.', 'src/file2.js']

In both cases the synthetic crawl-root directory entry is missing.

Root cause

The two tests use crawlDirectory = path.join(tmpDir, 'src' | 'level1') against a real initGitRepo(tmpDir). On macOS, os.tmpdir() returns the symlinked form (/var/folders/...) while git rev-parse --show-toplevel resolves the symlink and returns the canonical form (/private/var/folders/...). Concretely on a macOS runner:

tmp           : /var/folders/.../crawl-XXXX
gitRoot       : /private/var/folders/.../crawl-XXXX   ← canonicalized by git
crawlDir      : /var/folders/.../crawl-XXXX/src
path.relative : "../../../../../../../var/folders/.../crawl-XXXX/src"   ← garbage pathspec

crawlWithGitLsFiles then computes relativeToGitRoot = getPosixRelative(gitRoot, crawlDirectory) and passes that garbage as a git ls-files pathspec. git matches nothing, the git tier reports zero tracked files, the ripgrep fallback is unavailable (not preinstalled on macOS/Windows GitHub runners by default), and execution lands in the fdir branch.

The fdir branch crawls from inside crawlDirectory, so it never emits the crawl-root directory row itself. Its output for the two tests matches the failing CI results byte for byte:

  • Test 1 — fdir from tmpDir/level1 → maps via relativeToCrawlDir='level1' → applies maxDepth=0 filter → prepends .['.', 'level1/file-level1.txt', 'level1/level2/'] ✅ matches failure.
  • Test 2 — fdir from tmpDir/src → maps via relativeToCrawlDir='src' → prepends .['.', 'src/file2.js'] ✅ matches failure.

Windows has the analogous problem (short-name / drive-letter path normalization between git rev-parse output and Node's idea of crawlDirectory).

Suggested fix directions

  1. Canonicalize crawlDirectory before computing relativeToGitRoot — call fs.realpathSync (or async equivalent) on crawlDirectory in crawlWithGitLsFiles so its base matches the symlink-resolved gitRoot. This is the smallest, most targeted change.
  2. Make tests robust to tmpdir realpathtmpDir = await fs.realpath(tmpDir) right after createTmpDir(...) in the new test cases. Cheap, but it papers over the actual production-path bug surfaced by (1).
  3. Verify the crawl-root row is also emitted by the fdir fallback — if the contract is "every successful crawl includes the crawl-root directory row", the fdir branch needs to be made consistent with the git/rg branches (or have explicit reasoning for the divergence).

I'd recommend (1) + (2) together: (1) fixes the real bug (this would also affect production users on macOS who run the agent against any symlinked path); (2) keeps the new tests deterministic against tmpdir quirks.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Request changes to Comment: CI failing: Test (macos-latest, Node 22.x), Test (windows-latest, Node 22.x).

Found 2 Critical and 14 Suggestion issues after multi-dimensional review (9 agents + verification + 2 reverse-audit rounds). Posting the two Critical findings as inline comments. Full finding list in the review report.

— glm-5.1 via Qwen Code /review

return toPosixPath(p);
}

function getPosixRelative(from: string, to: string): string {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] getPosixRelative symlink regression — symlinked projects return empty file listings

When a user opens a project via a symlinked path (e.g. ~/projects/link~/projects/real), findGitRoot returns the real path via git rev-parse --show-toplevel, but crawlDirectory remains the symlinked path. getPosixRelative computes a ../-prefixed relativeToGitRoot (e.g. ../link), which causes all three git ls-files calls to return zero matching entries (the pathspec resolves outside the repo). crawlWithGitLsFiles returns { success: true, files: ['.'] } — no ripgrep/fdir fallback is triggered.

The PR removed fs.realpathSync.native (the old canonicalizePath helper) that previously resolved symlinks before computing relative paths.

Suggested change
function getPosixRelative(from: string, to: string): string {
function getPosixRelative(from: string, to: string): string {
const fromAbs = fs.realpathSync(path.resolve(from));
const toAbs = fs.realpathSync(path.resolve(to));
const relative = path.relative(fromAbs, toAbs);
return toPosixPath(relative) || '.';
}

— glm-5.1 via Qwen Code /review

trackedArgs.push(relativeToGitRoot);
}

const deletedSet = new Set(deletedFiles ?? []);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Partial git listing failures return { success: true } — deleted files shown as present, new files silently omitted

When listDeletedTrackedFiles returns null (command failure), deletedSet is empty, so files deleted from the working tree are NOT filtered from results. When listUntrackedFiles returns null, newly created files are silently skipped. Both cases return { success: true }, preventing ripgrep fallback.

Suggested change
const deletedSet = new Set(deletedFiles ?? []);
if (untrackedFiles === null || deletedFiles === null) {
return { success: false, files: [], gitRepoListingFailed: true };
}
const deletedSet = new Set(deletedFiles);

— glm-5.1 via Qwen Code /review

clearTimeout(timeout);
}

if (timedOut) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] runCommand close handler checks timedOut before killedByLimit — race loses valid partial data

When a budget-limited kill occurs close to the timeout deadline, both flags can be set. Since timedOut is checked first, valid partial data collected before the budget limit is discarded as a timeout failure. The budget-limited kill is a deliberate successful operation — it should take priority.

Suggested change
if (timedOut) {
if (killedByLimit) {
finalize(true, lines);
return;
}
if (timedOut) {

— glm-5.1 via Qwen Code /review


let workingTreePrefetch: GitWorkingTreePrefetch | undefined;

if (!options.cache) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Change detection code is dead code in production

The hasFileListChanged/scanWorkingTreeForChange logic (~60 lines of mtime + fingerprint change detection) is gated behind if (!options.cache), but all production callers (RecursiveFileSearch, DirectoryFileSearch) pass cache: true. This sophisticated change-detection subsystem is never exercised in production — only the TTL-based crawlCache is used. Either document this explicitly or unify the two caching paths.

— glm-5.1 via Qwen Code /review

let stderrBuf = '';
let child;
try {
child = spawn(command, args, {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] spawn() inherits the parent process environment without sanitization. If the parent process has GIT_DIR, GIT_WORK_TREE, or GIT_INDEX_FILE set (e.g., from a git hook, CI wrapper, or IDE integration), all git child processes silently operate on the wrong repository — autocomplete shows files from an unrelated project with no error indication.

Suggested change
child = spawn(command, args, {
const child = spawn(command, args, {
cwd,
windowsHide: true,
stdio: ['ignore', 'pipe', 'pipe'],
env: {
...process.env,
GIT_DIR: undefined,
GIT_WORK_TREE: undefined,
GIT_INDEX_FILE: undefined,
},
});

— DeepSeek/deepseek-v4-pro via Qwen Code /review

}

interface RunCommandOptions {
maxLines?: number;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The maxLines parameter in RunCommandOptions and all associated logic (early-return on maxLines <= 0, lines.length >= maxLines check, killedByLimit flag) are dead code — every production caller uses collectLines: false and never sets maxLines.

— DeepSeek/deepseek-v4-pro via Qwen Code /review

if (!state) return true;

if (currentMtime !== null && state.gitRootMtimeMs !== null) {
return currentMtime > state.gitRootMtimeMs || !isThrottled(stateKey);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] When currentMtime > state.gitRootMtimeMs, the throttle (5s) is bypassed and a full re-crawl always fires. But git index mtime changes on most git operations (git status, git add, branch switches), even when no tracked files changed. A git status in another terminal triggers an unnecessary triple-subprocess re-crawl on the next autocomplete keystroke.

Suggested change
return currentMtime > state.gitRootMtimeMs || !isThrottled(stateKey);
if (currentMtime > state.gitRootMtimeMs) {
// Index touched — verify tracked file list actually changed
const change = scanWorkingTreeForChange(stateKey, crawlDirectory);
return change.changed || !isThrottled(stateKey);
}

— DeepSeek/deepseek-v4-pro via Qwen Code /review

};
}

function posixPathUnderGitRoot(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] posixPathUnderGitRoot() recomputes normalizeGitPath(relativeToGitRoot, true) and normalizeGitPath(relativeToCrawlDir, true) on every file in the streaming callback. These two values are loop-invariant per crawl call.

Suggested change
function posixPathUnderGitRoot(
// Hoist normalized paths in crawlWithGitLsFiles before the streaming callback:
const normGitRoot = normalizeGitPath(relativeToGitRoot, true);
const normCrawlDir = normalizeGitPath(relativeToCrawlDir, true);

— DeepSeek/deepseek-v4-pro via Qwen Code /review

- Updated tests to ensure the `maxDepth` is treated correctly relative to the crawl directory by resolving the real path of the temporary directory.
- Refactored the `hasFileListChanged` function to improve logic for determining file list changes based on modification times.
- Introduced a new function, `refreshChangeStateGitMtime`, to update the state with the latest git root modification time, enhancing state management during crawling.
- Improved path normalization in `getPosixRelative` and `posixPathUnderGitRoot` to handle cases where paths may not exist, ensuring better relativity in path calculations.
@scrollDynasty

Copy link
Copy Markdown
Contributor Author

@wenshao fixed

@wenshao

wenshao commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Local verification report

Pulled 22175d674 as a worktree and ran targeted unit tests + a real-world scenario suite. All clean.

Unit tests

  • crawler.test.ts + fileSearch.test.ts + crawlCache.test.ts: 81/81 pass
  • CI on this HEAD: Linux / macOS / Windows / CodeQL all green

Real-world scenarios (9/9)

# Scenario Result
S1 git repo: tracked + untracked included, .gitignore (incl. *.secret, node_modules/) filtered
S2 deleted tracked file is NOT returned within 5s throttle window (the regression flagged in earlier reviews)
S3 untracked file added after initial crawl appears on next call
S4 git worktree (.git as file pointing to gitdir) — both tracked-from-parent and untracked-in-worktree visible
S5 crawlDirectory=src/ maxDepth=0 returns only direct children, no leakage to deeper levels
S6 maxFiles=50 cap honored on git path
S7 non-git directory fallback (rg → fdir) still respects .gitignore
S8 end-to-end FileSearch.search() on this repo: init=44ms, sensible fuzzy matches
S9 10,000-file gitignored dist/ tree fully pruned by --exclude-standard, zero leakage

Performance (qwen-code repo, ~2,900 files)

Metric PR-3214 main
FileSearch.initialize() median 8.9 ms 22.9 ms
Cold crawl() median (cache: false) 21 ms 8.6 ms
Hot path within throttle window p50 5.2 ms 5.8 ms

On a synthetic 30k-tracked-file repo, main is actually faster on cold (35ms vs 103ms) because git subprocess startup overhead dominates fdir's raw walk on already-clean trees. Speed is not the headline win here — the real value is correct .gitignore semantics delegated to git, proper worktree support, and the mtime+throttle fast path on subsequent @ invocations.

Earlier Critical comments — confirmed addressed on 22175d674

  • child.on('error') now captures err: NodeJS.ErrnoException and logs via logCommandProblem (crawler.ts:559)
  • The "dead code" guard loop at crawler.ts:1127 has a comment explaining it's a test-double fallback
  • applyMaxFilesLimit is now applied on git, rg, and fdir paths (crawler.ts:1176 / 1264 / 1455)

LGTM. Merging.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after local verification — see comment above.

@wenshao wenshao merged commit 664208c into QwenLM:main May 12, 2026
8 checks passed
return { changed: true };
}

const changed =

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] scanWorkingTreeForChange() only compares the untracked/deleted fingerprints after the index mtime advances. If a file is staged or otherwise added to the tracked set, it disappears from --others and is not in --deleted, so those fingerprints can stay unchanged and this branch returns changed: false; crawl() then refreshes the stored mtime and returns the old state.fileList. That leaves file search missing newly tracked files (or retaining stale tracked entries) after index-changing operations.

Please treat a newer index mtime as requiring a tracked git ls-files --cached refresh, or store/compare a tracked/index fingerprint in ChangeState; the untracked/deleted fingerprint optimization should not skip refreshing tracked files when the index changed.

— gpt-5.5 via Qwen Code /review

untrackedArgs.push('--exclude-standard');
}
if (relativeToGitRoot && relativeToGitRoot !== '.') {
untrackedArgs.push(relativeToGitRoot);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] The pathspec derived from crawlDirectory is appended directly after git ls-files options without a -- separator. --literal-pathspecs only disables pathspec magic; it does not stop a path beginning with - from being parsed as another ls-files option. A repository subdirectory such as --stage/ can therefore make this command run with unintended options instead of limiting the listing to that path.

Please insert -- before every user/path-derived pathspec in the git commands, e.g. push -- before relativeToGitRoot here and in the deleted/tracked listings.

— gpt-5.5 via Qwen Code /review

): Promise<string[] | null> {
const deletedArgs = ['--literal-pathspecs', 'ls-files', '-z', '--deleted'];
if (relativeToGitRoot && relativeToGitRoot !== '.') {
deletedArgs.push(relativeToGitRoot);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Same pathspec parsing issue as above: this deleted-files listing appends relativeToGitRoot without a -- separator, so a crawl directory whose repo-relative path starts with - can be parsed as a git ls-files option instead of a literal pathspec.

Please add -- before appending the pathspec.

— gpt-5.5 via Qwen Code /review

const trackedArgs = ['--literal-pathspecs', 'ls-files', '--cached'];
trackedArgs.push('-t');
if (relativeToGitRoot && relativeToGitRoot !== '.') {
trackedArgs.push(relativeToGitRoot);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Same pathspec parsing issue as above: the tracked-files command appends relativeToGitRoot without a -- separator. If the requested crawl directory starts with -, Git can treat it as an option and list the wrong scope or emit option-specific output that this parser was not expecting.

Please add -- before appending the pathspec.

— gpt-5.5 via Qwen Code /review

wenshao pushed a commit that referenced this pull request May 17, 2026
…#3214)

Replaces the fdir filesystem crawler with a three-tier strategy for `@` file mention autocomplete.

Strategy:
1. `git ls-files --cached` + `git ls-files --others --exclude-standard` for git repos — reads from the git index, gitignore correctness delegated to git.
2. `ripgrep --files` for non-git directories when `rg` is available.
3. `fdir` as the original last-resort fallback.

Additional improvements:
- mtime-based change detection on `.git/index` skips re-crawl when nothing changed.
- 5s refresh throttling avoids rebuilding the index on every keystroke.
- Async chunked indexing yields to the event loop every 1000 entries to stay responsive on 200k+ file trees.
- Background untracked-file merge with a 10s timeout.
- `resolveGitDir()` correctly handles `.git` as a file (worktrees) including relative `gitdir:` pointers and walks up to parent repos.
- POSIX-style path normalization on Windows.
- Streaming `spawn()` with line-by-line `onLine` processing and an enforced `maxFiles` budget that terminates the child early, preventing the previous 20MB stdout buffer hazard on large repos.
- `--exclude-standard` on the untracked listing so gitignored directories (e.g. `node_modules/`, `dist/`) are not enumerated.

Closes #3137
callmeYe added a commit to callmeYe/qwen-code that referenced this pull request Jun 8, 2026
`new AsyncFzf(allFiles, ...)` is misleadingly named — its constructor is
synchronous and dominates the @-picker's main-thread cost on workspaces
with >20k files (where the v2→v1 algorithm switch kicks in). It's the
remaining freeze source after PR QwenLM#3214 moved the crawl onto async spawn-
based git ls-files / ripgrep / fdir.

This change hosts the AsyncFzf instance in a worker_threads worker so the
construction happens off the main thread; only completed find() results
cross the message channel. Below ~5k files the in-thread path is kept
because worker spawn + IPC overhead exceeds the construction cost there.

Notes:
- `FzfWorkerHandle.create()` awaits the worker `ready` event before
  returning, so init errors surface at create() time rather than as a
  surprise on the first find().
- Worker is `unref()`-ed so a mid-find @-picker doesn't pin process exit.
- `FileSearch` interface gains an optional `dispose?(): Promise<void>`
  to release worker resources. `FileMessageHandler.clearFileSearchCache`
  calls it so the long-running VS Code extension host doesn't accumulate
  workers across watcher-triggered evictions.
- `installInProcessFzfTransport()` is provided for tests / sandboxed
  environments that need to skip worker spawn entirely.
- Bundle gains a second esbuild build that emits `dist/fzfWorker.js`
  alongside `dist/cli.js`; `prepare-package.js` ships it in the tarball.
- In `npm run dev` (tsx) the bundled worker file isn't on disk, so the
  handle silently falls back to in-thread — acceptable for a dev script.

Closes the door on QwenLM#3455's two open `FileIndexService` review items by
design: there is no `whenReady()` (handle awaits ready inside `create()`)
and there is no in-worker crawl (crawl stays on main, only fzf moves).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
callmeYe added a commit to callmeYe/qwen-code that referenced this pull request Jun 8, 2026
`new AsyncFzf(allFiles, ...)` is misleadingly named — its constructor is
synchronous and dominates the @-picker's main-thread cost on workspaces
with >20k files (where the v2→v1 algorithm switch kicks in). It's the
remaining freeze source after PR QwenLM#3214 moved the crawl onto async spawn-
based git ls-files / ripgrep / fdir.

This change hosts the AsyncFzf instance in a worker_threads worker so the
construction happens off the main thread; only completed find() results
cross the message channel. Below ~5k files the in-thread path is kept
because worker spawn + IPC overhead exceeds the construction cost there.

Notes:
- `FzfWorkerHandle.create()` awaits the worker `ready` event before
  returning, so init errors surface at create() time rather than as a
  surprise on the first find().
- Worker is `unref()`-ed so a mid-find @-picker doesn't pin process exit.
- `FileSearch` interface gains an optional `dispose?(): Promise<void>`
  to release worker resources. `FileMessageHandler.clearFileSearchCache`
  calls it so the long-running VS Code extension host doesn't accumulate
  workers across watcher-triggered evictions.
- `installInProcessFzfTransport()` is provided for tests / sandboxed
  environments that need to skip worker spawn entirely.
- Bundle gains a second esbuild build that emits `dist/fzfWorker.js`
  alongside `dist/cli.js`; `prepare-package.js` ships it in the tarball.
- In `npm run dev` (tsx) the bundled worker file isn't on disk, so the
  handle silently falls back to in-thread — acceptable for a dev script.

Closes the door on QwenLM#3455's two open `FileIndexService` review items by
design: there is no `whenReady()` (handle awaits ready inside `create()`)
and there is no in-worker crawl (crawl stays on main, only fzf moves).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yiliang114 pushed a commit that referenced this pull request Jun 12, 2026
…#4621)

* perf(filesearch): move AsyncFzf index construction to a worker thread

`new AsyncFzf(allFiles, ...)` is misleadingly named — its constructor is
synchronous and dominates the @-picker's main-thread cost on workspaces
with >20k files (where the v2→v1 algorithm switch kicks in). It's the
remaining freeze source after PR #3214 moved the crawl onto async spawn-
based git ls-files / ripgrep / fdir.

This change hosts the AsyncFzf instance in a worker_threads worker so the
construction happens off the main thread; only completed find() results
cross the message channel. Below ~5k files the in-thread path is kept
because worker spawn + IPC overhead exceeds the construction cost there.

Notes:
- `FzfWorkerHandle.create()` awaits the worker `ready` event before
  returning, so init errors surface at create() time rather than as a
  surprise on the first find().
- Worker is `unref()`-ed so a mid-find @-picker doesn't pin process exit.
- `FileSearch` interface gains an optional `dispose?(): Promise<void>`
  to release worker resources. `FileMessageHandler.clearFileSearchCache`
  calls it so the long-running VS Code extension host doesn't accumulate
  workers across watcher-triggered evictions.
- `installInProcessFzfTransport()` is provided for tests / sandboxed
  environments that need to skip worker spawn entirely.
- Bundle gains a second esbuild build that emits `dist/fzfWorker.js`
  alongside `dist/cli.js`; `prepare-package.js` ships it in the tarball.
- In `npm run dev` (tsx) the bundled worker file isn't on disk, so the
  handle silently falls back to in-thread — acceptable for a dev script.

Closes the door on #3455's two open `FileIndexService` review items by
design: there is no `whenReady()` (handle awaits ready inside `create()`)
and there is no in-worker crawl (crawl stays on main, only fzf moves).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(filesearch): address review feedback on fzf worker PR

- Fix worker thread leak in useAtCompletion: dispose previous FileSearch
  instance on re-init and component cleanup
- Fix worker leak on init failure: wrap inst.ready() in try/catch with
  timeout, dispose on failure
- Fix find()-after-dispose hang: reject immediately when worker is disposed
- Add limit field to FindMessage protocol to cap IPC payload before
  postMessage serialization
- Add real worker_threads test coverage (spawn/find/dispose lifecycle,
  concurrent finds, init failure, post-dispose rejection)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(filesearch): address second round of review feedback

- Fix useAtCompletion cleanup: move dispose() to [cwd,config] effect so
  it only fires on workspace change, not every keystroke/status transition
- Clear setTimeout in Promise.race on happy path to prevent timer leak
- Dispose FileSearch instances in FileMessageHandler on panel close
- Fix incorrect comment in prepare-package.js (resolveBundleDir, not URL)
- Add worker crash test: verify failAll rejects pending find() on exit
- Add RecursiveFileSearch.dispose() test in fileSearch.test.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(core): close missing braces in formatDateForContext test

The second test case in the formatDateForContext describe block was
missing its closing `});`, causing a TS1005 parse error that broke CI
on all platforms. This was introduced upstream in 6a99ed1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(filesearch): guard initialize() against races and stale completions

- Add cancelled flag to initialize() effect so a superseded or
  unmounted init disposes the newly-created searcher instead of
  writing it into fileSearch.current with no owner
- Add fzfWorker.js test artifact to .gitignore to prevent accidental
  commits if the test process crashes before afterAll cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(filesearch): mark worker handle as failed on post-init crash

failAll() previously only set readyState to 'failed' when the worker
crashed during init (state 'pending'). A crash after successful init
left the handle in 'ready' state, causing subsequent find() calls to
silently return empty results via the catch handler. Now failAll()
unconditionally marks the handle as 'failed' (unless already disposed),
and find() rejects immediately on a failed handle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace fdir filesystem crawler with git ls-files + ripgrep for file search

3 participants