Bug Description
_truncate_around_matches() in tools/session_search_tool.py discards the message-level match positions returned by FTS5 and instead re-searches the formatted conversation text naively using str.find(). For long sessions, this causes the 100K char truncation window to miss the actual matched content entirely.
Detail
The session search pipeline works in two stages:
-
FTS5 search (db.search_messages) — returns matched message IDs, session IDs, snippets, and timestamps. This correctly identifies where in the conversation the match occurs.
-
Summarization — loads the full session conversation, truncates to MAX_SESSION_CHARS (100K) via _truncate_around_matches(), then sends to an LLM for summarization.
The problem is in step 2. _truncate_around_matches() receives only (full_text, query, max_chars) — it has no access to the FTS5 match positions. It re-searches for query terms using str.find() on the formatted text, which may find different (earlier) occurrences than FTS5 matched.
For a real-world session with 402 messages and 280K chars of content, the relevant content (a resume review at 2:44 PM) starts at ~177K chars into the transcript. The naive text search finds a query term match much earlier, centers the 100K window there, and the summarizer never sees the relevant content.
Steps to Reproduce
- Have a long session (>100K chars) with diverse topics throughout the day
- The relevant content is in the latter portion (e.g., message ~370 of 402)
- Session search finds the correct session via FTS5
_truncate_around_matches centers on an earlier text occurrence
- Summarizer reports "no discussion about [topic] found" despite FTS5 having matched it
Suggested Fix
Pass the FTS5 match metadata (message timestamps or character positions) through to _truncate_around_matches() so it can center the window on the actual matched content rather than re-searching naively:
def _truncate_around_matches(
full_text: str, query: str, max_chars: int = MAX_SESSION_CHARS,
match_positions: list[int] = None, # character offsets from FTS5 matches
) -> str:
Alternatively, since session_search() already has the matched message IDs and timestamps, the conversation could be loaded with a message-range filter centered on the match rather than loading the entire session and truncating after formatting.
Impact
Any session longer than 100K chars where the matched content is not near the first text occurrence of a query term will fail to recall. This is common for full-day Telegram sessions where the agent handles multiple topics. The data is in the DB, FTS5 finds it correctly, but the summarization window misses it.
Related: #4238 (FTS5 operators treated as search terms compounds this issue)
Bug Description
_truncate_around_matches()intools/session_search_tool.pydiscards the message-level match positions returned by FTS5 and instead re-searches the formatted conversation text naively usingstr.find(). For long sessions, this causes the 100K char truncation window to miss the actual matched content entirely.Detail
The session search pipeline works in two stages:
FTS5 search (
db.search_messages) — returns matched message IDs, session IDs, snippets, and timestamps. This correctly identifies where in the conversation the match occurs.Summarization — loads the full session conversation, truncates to
MAX_SESSION_CHARS(100K) via_truncate_around_matches(), then sends to an LLM for summarization.The problem is in step 2.
_truncate_around_matches()receives only(full_text, query, max_chars)— it has no access to the FTS5 match positions. It re-searches for query terms usingstr.find()on the formatted text, which may find different (earlier) occurrences than FTS5 matched.For a real-world session with 402 messages and 280K chars of content, the relevant content (a resume review at 2:44 PM) starts at ~177K chars into the transcript. The naive text search finds a query term match much earlier, centers the 100K window there, and the summarizer never sees the relevant content.
Steps to Reproduce
_truncate_around_matchescenters on an earlier text occurrenceSuggested Fix
Pass the FTS5 match metadata (message timestamps or character positions) through to
_truncate_around_matches()so it can center the window on the actual matched content rather than re-searching naively:Alternatively, since
session_search()already has the matched message IDs and timestamps, the conversation could be loaded with a message-range filter centered on the match rather than loading the entire session and truncating after formatting.Impact
Any session longer than 100K chars where the matched content is not near the first text occurrence of a query term will fail to recall. This is common for full-day Telegram sessions where the agent handles multiple topics. The data is in the DB, FTS5 finds it correctly, but the summarization window misses it.
Related: #4238 (FTS5 operators treated as search terms compounds this issue)