Skip to content

tools: search_content has no per-file cap, high-frequency hits drown the result #489

@esengine

Description

@esengine

Problem

searchContent (src/tools/fs/search.ts:70-201) has exactly one truncation knob: a byte budget (ctx.maxListBytes). When the budget runs out it appends [… truncated at N bytes — refine pattern or path …] and returns.

Failure shape: a moderately popular symbol matches 200+ lines in a single big file (think App.tsx, loop.ts). The first file's hits eat the entire byte budget, every other file in the tree gets zero coverage, and the caller sees a wall of App.tsx:NNN: ... lines with no signal that other files also have hits. Worse, with context: N set, each hit becomes 2N+1 lines and the byte cap collapses even faster.

The user-visible bug: search_content "useState" in a React app returns 90% App.tsx, 10% truncation marker, and the caller has no idea the symbol is also defined / used in useChatState.ts.

Proposal

1. Per-file match cap.

Cap matches at MAX_PER_FILE = 30 (rough; tunable). When exceeded:

src/cli/ui/App.tsx:312: const [foo, setFoo] = useState(...)
src/cli/ui/App.tsx:418: const [bar, setBar] = useState(...)
... (28 more)
[src/cli/ui/App.tsx: 47 more matches in this file — re-grep with a tighter pattern or use read_file to see them]

2. Histogram fallback when the byte cap would otherwise fire.

When totalBytes > 0.8 * maxListBytes and there are still files left to scan, switch to summary mode for the rest:

[switching to summary mode — byte budget at 80%]
src/loop.ts: 14 matches
src/cli/ui/App.tsx: 47 matches  (already shown 30 above)
src/server/handlers/chat.ts: 6 matches
... (5 more files)

Caller sees the full shape (which files, how many) instead of one file's noise + an opaque truncation marker.

3. Optional summary_only: true arg.

For "where does this exist at all" questions, skip line content entirely and return only the histogram. One round-trip to map distribution, then targeted reads.

Why this matters

search_content is the load-bearing exploration tool — it's what we tell agents and humans to reach for when "where is X?" comes up. The current truncation behavior actively hides distribution information, and that's the one signal the caller most needs.

Out of scope

  • Replacing the byte cap. It's still the safety net; the proposals above just give better behavior before we hit it.
  • Ranking / scoring matches. Order stays file-walk order.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions