Skip to content

Session search truncation ignores FTS5 match positions, uses naive text search #4239

@Rizk-Taker

Description

@Rizk-Taker

Bug Description

_truncate_around_matches() in tools/session_search_tool.py discards the message-level match positions returned by FTS5 and instead re-searches the formatted conversation text naively using str.find(). For long sessions, this causes the 100K char truncation window to miss the actual matched content entirely.

Detail

The session search pipeline works in two stages:

  1. FTS5 search (db.search_messages) — returns matched message IDs, session IDs, snippets, and timestamps. This correctly identifies where in the conversation the match occurs.

  2. Summarization — loads the full session conversation, truncates to MAX_SESSION_CHARS (100K) via _truncate_around_matches(), then sends to an LLM for summarization.

The problem is in step 2. _truncate_around_matches() receives only (full_text, query, max_chars) — it has no access to the FTS5 match positions. It re-searches for query terms using str.find() on the formatted text, which may find different (earlier) occurrences than FTS5 matched.

For a real-world session with 402 messages and 280K chars of content, the relevant content (a resume review at 2:44 PM) starts at ~177K chars into the transcript. The naive text search finds a query term match much earlier, centers the 100K window there, and the summarizer never sees the relevant content.

Steps to Reproduce

  1. Have a long session (>100K chars) with diverse topics throughout the day
  2. The relevant content is in the latter portion (e.g., message ~370 of 402)
  3. Session search finds the correct session via FTS5
  4. _truncate_around_matches centers on an earlier text occurrence
  5. Summarizer reports "no discussion about [topic] found" despite FTS5 having matched it

Suggested Fix

Pass the FTS5 match metadata (message timestamps or character positions) through to _truncate_around_matches() so it can center the window on the actual matched content rather than re-searching naively:

def _truncate_around_matches(
    full_text: str, query: str, max_chars: int = MAX_SESSION_CHARS,
    match_positions: list[int] = None,  # character offsets from FTS5 matches
) -> str:

Alternatively, since session_search() already has the matched message IDs and timestamps, the conversation could be loaded with a message-range filter centered on the match rather than loading the entire session and truncating after formatting.

Impact

Any session longer than 100K chars where the matched content is not near the first text occurrence of a query term will fail to recall. This is common for full-day Telegram sessions where the agent handles multiple topics. The data is in the DB, FTS5 finds it correctly, but the summarization window misses it.

Related: #4238 (FTS5 operators treated as search terms compounds this issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/toolsTool registry, model_tools, toolsetssweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions