fix(session-search): index tool_calls and tool_name columns in FTS5 (#16751)#16770
fix(session-search): index tool_calls and tool_name columns in FTS5 (#16751)#16770briandevans wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Extends SessionDB.search_messages() coverage so session search can find tokens that appear only in messages.tool_calls (serialized tool-call JSON) or messages.tool_name, by widening both FTS5 tables and aligning the CJK short-query fallback.
Changes:
- Bump state DB schema to v11 and recreate/backfill
messages_fts+messages_fts_trigramwithcontent,tool_calls, andtool_namecolumns. - Update FTS triggers to keep the widened index in sync and adjust snippet selection to center on the matched column (
iCol = -1). - Add regression tests covering tool-call/tool-name indexing plus an in-place v10→v11 upgrade scenario; update schema-version assertions to 11.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
hermes_state.py |
Implements schema v11 migration, expands FTS schemas/triggers to index tool_calls/tool_name, updates snippet column selection, and broadens the short-CJK LIKE fallback to search the same surface. |
tests/test_hermes_state.py |
Adds regression tests for tool-call/tool-name search and v11 migration backfill; updates expected schema version to 11. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| like_pattern = f"%{escaped}%" | ||
| like_where = [ | ||
| "(m.content LIKE ? ESCAPE '\\' " | ||
| "OR m.tool_calls LIKE ? ESCAPE '\\' " | ||
| "OR m.tool_name LIKE ? ESCAPE '\\')" | ||
| ] | ||
| like_params: list = [like_pattern, like_pattern, like_pattern] |
There was a problem hiding this comment.
The short-CJK LIKE fallback now matches against m.tool_calls/m.tool_name, but the selected snippet is still derived solely from m.content (and uses instr(m.content, ...)). For tool-only matches (common for assistant tool-call rows where content may be NULL/empty), this yields an empty/NULL snippet even though the row matched, and can propagate NULL into the returned snippet field. Consider deriving the snippet from the column that matched (or at least COALESCE(m.content,'') in instr/substr, and/or a COALESCE(m.content, m.tool_calls, m.tool_name, '') search text) so tool-call matches produce a meaningful, non-NULL snippet in this code path.
|
@copilot Finding addressed in commit Finding ( Added |
…ousResearch#16751) `messages_fts` and `messages_fts_trigram` had a single `content` column, so tokens that only appeared in `messages.tool_calls` (serialized JSON args) or `messages.tool_name` were never indexed. Searching for a function name or tool argument returned 0 hits even though the row was in the DB — this is not the CJK tokenizer issue tracked in NousResearch#14829, the gap is at the schema and trigger layer. Why the existing fix didn't already cover this: * External-content FTS5 columns are fixed at CREATE time. With `content=messages, content_rowid=id`, the FTS table's column list has to match column names in `messages` — adding indexed columns means a drop-and-recreate plus backfill, not an ALTER. * The triggers only inserted `new.content`, so even rows with non-empty `tool_calls` / `tool_name` produced empty FTS entries. Fix: * Extend both FTS5 tables (unicode61 and trigram) with `tool_calls` and `tool_name` columns. * Update insert/delete/update triggers to populate all three columns. * Add v11 migration that drops + recreates both tables, then backfills every message (no `WHERE content IS NOT NULL` filter so tool-only assistant messages with empty content are also indexed). * Auto-pick the matching column for `snippet()` via `iCol = -1` so the snippet centers on the actual hit instead of always showing `content`. * For the short-CJK (1–2 char) LIKE fallback, broaden the `WHERE` to match content / tool_calls / tool_name so the short-query path stays consistent with the FTS path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l_name Address Copilot review on PR NousResearch#16770: The short-CJK LIKE fallback (1–2 char queries that the trigram FTS5 table can't match) was extended in NousResearch#16751 to also match ``m.tool_calls`` / ``m.tool_name``, but the ``snippet`` column was still derived solely from ``substr(m.content, instr(m.content, ?), …)``. For tool-only matches — common on assistant tool-call rows where ``content`` is NULL/empty and only ``tool_name`` carries the query token — the snippet came back empty, leaving session_search results without any usable preview even though the row matched. Replace the single-column ``substr`` with a ``CASE`` expression that picks the snippet from whichever of content / tool_calls / tool_name actually contains the match. The condition order mirrors how the existing match-source preference cascades (content first, then tool_calls, then tool_name, with an empty-string fallback that mirrors the existing FTS path's behavior on degenerate input). Add ``test_cjk_like_fallback_snippet_pulls_from_tool_name_when_content_empty`` to ``TestCJKSearchFallback`` covering the realistic scenario: a 1-char CJK ``tool_name`` (``tool_name`` is a plain TEXT column, not JSON- escaped, so CJK survives the round-trip unlike serialized ``tool_calls`` JSON which goes through ``json.dumps(ensure_ascii=True)``). Regression guard verified: without the snippet fix the new test fails with ``assert ''``; with the fix it returns a snippet containing the matched character. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a349755 to
0006781
Compare
|
Closing — superseded by @teknium1's #16914 (salvaging @yes999zc's #16866), which landed first and is the better fix here. Coverage comparison:
No remaining gap that warrants a follow-up. Thanks @yes999zc and @teknium1. |
Summary
messages_ftsandmessages_fts_trigramwith thetool_callsandtool_namecolumns frommessagessosession_searchfinds tokens that only appear in serialized tool-call args or tool names.messages.iCol = -1) to keep the snippet centered on the actual hit.The bug (#16751)
messages_ftsis an external-content FTS5 table with a singlecontentcolumn. The triggers only insertnew.content, so any token that lives inmessages.tool_calls(TEXT JSON of function name + arguments) ormessages.tool_nameis never indexed.messages_fts_trigram(added in v10 for CJK substring search) has the same single-column shape.The reporter showed that even with ASCII tokens,
db.search_messages("FUNCNAMEMARKER")returns 0 hits when the marker only appears inside a tool call — although the row is in the DB. This is not the CJK tokenizer issue tracked in #14829 / #15500; the gap is at the schema/trigger layer.The fix
External-content FTS5 columns are fixed at CREATE time, so this requires drop-and-recreate plus a backfill, not an
ALTER TABLE.FTS_SQLandFTS_TRIGRAM_SQLnow declare three columns (content,tool_calls,tool_name) matching the source columns inmessages.INSERT/DELETE/UPDATEtriggers populate all three._init_schemagains a v11 step that drops the old triggers + tables, recreates them via the updated SQL, and backfills withINSERT INTO messages_fts(rowid, content, tool_calls, tool_name) SELECT id, content, tool_calls, tool_name FROM messages(nocontent IS NOT NULLfilter so tool-only assistant messages with empty content are also indexed).snippet(messages_fts, 0, …)→snippet(messages_fts, -1, …)(and the same for the trigram path). Per FTS5 docs,iCol = -1lets the function pick the column with the highest score, so a hit insidetool_callsproduces a snippet of the JSON args instead of an empty/wrong slice ofcontent. Content-only hits still pick column 0 (no behavior change for the common path).LIKEfallback nowORs the same three columns so the short-query path stays consistent with the FTS path.Test plan
TestFTS5SearchToolCallsAndToolNameregression suite (tests/test_hermes_state.py):tool_name→ foundrole_filterstill applies to tool-call matchestool_calls/tool_namerows is searchable afterSessionDB(...)opens ittests/test_hermes_state.py(230 tests):230 passedSessionDB:tests/tools/test_session_search.py,tests/hermes_state/,tests/gateway/test_resume_command.py,tests/plugins/memory/test_hindsight_provider.py—328 passedtotalhermes_state.py, reran the new tests, 5/6 fail with the expected "0 hits" symptom; restored, 6/6 passContract protected
messages.content,messages.tool_calls, ormessages.tool_nameis reachable throughSessionDB.search_messages(...)for the matching row.contentis empty but whose tool-call function name or argument JSON contains the search token; tool-result messages whosetool_nameis the only place the token appears.messagesschema, so adding new tool-related text columns in the future only requires extending the FTS column list + triggers (the migration pattern is already in place).test_role_filter_applies_to_tool_call_matches— even with the broader index,role_filter=["user"]still excludes assistant tool-call rows. (Existingtest_search_special_chars_do_not_crash,test_search_quoted_phrase_preserved, etc. still pass — the FTS5 query syntax surface is unchanged.)Related
session_searchdoes not indextool_callsortool_name#16751content)