Summary: messages_fts is a virtual FTS5 table with a single content column whose external-content source is messages.content. The triggers populating it read only new.content / old.content. The tool_calls (TEXT, serialized JSON args) and tool_name columns on messages are not reached by the search_messages path, so tokens that only appear in those columns aren't found by session_search even when the row is in the DB. On main, messages_fts_trigram (added in schema v10) has the same single-column, content-only-trigger pattern.
This is not the CJK tokenizer issue tracked in #14829 (and #15500, closed as a duplicate of #14829). The repro below uses ASCII tokens, so the gap is at the schema and trigger layer, not in tokenization.
Reproduction (clean clone of main HEAD 46b4cf8d, schema v10; same behavior reproduces on v0.11.0 bf196a3fc, schema v8):
import sys, tempfile
from pathlib import Path
sys.path.insert(0, "/path/to/hermes-agent") # repo root of clean clone
import hermes_state
with tempfile.TemporaryDirectory() as td:
db = hermes_state.SessionDB(db_path=Path(td) / "state.db")
db.create_session(session_id="s1", source="cli")
db.create_session(session_id="s2", source="cli")
db.append_message("s1", role="assistant",
content="Uploading to BUCKETMARKER_CONTENT.")
db.append_message("s2", role="assistant", content="",
tool_calls=[{"id": "c1", "type": "function",
"function": {"name": "FUNCNAMEMARKER",
"arguments": '{"cmd": "aws s3 cp x s3://BUCKETMARKER_TOOLCALL/"}'}}],
tool_name="TOOLNAMEMARKER")
print("via search_messages:")
for tok in ("BUCKETMARKER_CONTENT", "BUCKETMARKER_TOOLCALL",
"FUNCNAMEMARKER", "TOOLNAMEMARKER"):
print(f" {tok}: {len(db.search_messages(tok))} hit(s)")
# Direct SQL check of persisted columns, not a public API path.
print("via direct LIKE on messages table:")
for tok in ("BUCKETMARKER_CONTENT", "BUCKETMARKER_TOOLCALL",
"FUNCNAMEMARKER", "TOOLNAMEMARKER"):
n = db._conn.execute(
"SELECT COUNT(*) FROM messages "
"WHERE content LIKE ? OR tool_calls LIKE ? OR tool_name LIKE ?",
(f"%{tok}%", f"%{tok}%", f"%{tok}%"),
).fetchone()[0]
print(f" {tok}: {n} hit(s)")
Expected (assuming session_search is meant to cover searchable message context stored in the messages table): all four tokens find the relevant row.
Actual (identical on v0.11.0 bf196a3fc and main HEAD 46b4cf8d):
via search_messages:
BUCKETMARKER_CONTENT: 1 hit(s)
BUCKETMARKER_TOOLCALL: 0 hit(s)
FUNCNAMEMARKER: 0 hit(s)
TOOLNAMEMARKER: 0 hit(s)
via direct LIKE on messages table:
BUCKETMARKER_CONTENT: 1 hit(s)
BUCKETMARKER_TOOLCALL: 1 hit(s)
FUNCNAMEMARKER: 1 hit(s)
TOOLNAMEMARKER: 1 hit(s)
On main the trigram table was probed separately with SELECT COUNT(*) FROM messages_fts_trigram WHERE messages_fts_trigram MATCH ? for each token, with the same result (1, 0, 0, 0). search_messages routes to the trigram table only for CJK queries with cjk_count >= 3, so non-CJK tokens like the ASCII markers above never hit either FTS table.
Where the gap is (hermes_state.py, identical on both refs; main also has messages_fts_trigram with the same single-column, content-only-trigger pattern):
CREATE VIRTUAL TABLE IF NOT EXISTS messages_fts USING fts5(
content,
content=messages,
content_rowid=id
);
CREATE TRIGGER IF NOT EXISTS messages_fts_insert AFTER INSERT ON messages BEGIN
INSERT INTO messages_fts(rowid, content) VALUES (new.id, new.content);
END;
-- _delete and _update triggers also read only old.content / new.content.
User-facing symptom: Incomplete recall. session_search returns nothing for a token the user knows was in a prior session, because the token only ever appeared in tool_calls or tool_name.
Possible implementation directions. A naive "concat tool_calls and tool_name into the existing FTS content column" approach breaks the external-content invariant: content=messages ties the FTS column to messages.content, so any divergence between indexed text and the source column desyncs FTS rebuild and snippet() / highlight(). Three sound alternatives:
- Add
tool_calls and tool_name as additional FTS5 columns alongside content, with matching source columns so the external-content contract holds. Enables column-scoped queries.
- Add a denormalized
messages.search_text column maintained from content, tool_calls, and tool_name, and have FTS mirror that column. Requires a backfill plus INSERT INTO messages_fts(messages_fts) VALUES('rebuild').
- Switch to an internal-content or contentless FTS5 table whose indexed text the triggers manage directly. Trades off some
snippet() / highlight() ergonomics.
I searched for prior work and didn't find an existing report on tool_calls / tool_name indexing. #15500 (closed as duplicate of open #14829) is about CJK tokenization and is unrelated to this gap.
Summary:
messages_ftsis a virtual FTS5 table with a singlecontentcolumn whose external-content source ismessages.content. The triggers populating it read onlynew.content/old.content. Thetool_calls(TEXT, serialized JSON args) andtool_namecolumns onmessagesare not reached by thesearch_messagespath, so tokens that only appear in those columns aren't found bysession_searcheven when the row is in the DB. Onmain,messages_fts_trigram(added in schema v10) has the same single-column, content-only-trigger pattern.This is not the CJK tokenizer issue tracked in #14829 (and #15500, closed as a duplicate of #14829). The repro below uses ASCII tokens, so the gap is at the schema and trigger layer, not in tokenization.
Reproduction (clean clone of
mainHEAD46b4cf8d, schema v10; same behavior reproduces onv0.11.0bf196a3fc, schema v8):Expected (assuming
session_searchis meant to cover searchable message context stored in themessagestable): all four tokens find the relevant row.Actual (identical on
v0.11.0bf196a3fcandmainHEAD46b4cf8d):On
mainthe trigram table was probed separately withSELECT COUNT(*) FROM messages_fts_trigram WHERE messages_fts_trigram MATCH ?for each token, with the same result(1, 0, 0, 0).search_messagesroutes to the trigram table only for CJK queries withcjk_count >= 3, so non-CJK tokens like the ASCII markers above never hit either FTS table.Where the gap is (
hermes_state.py, identical on both refs;mainalso hasmessages_fts_trigramwith the same single-column, content-only-trigger pattern):CREATE VIRTUAL TABLE IF NOT EXISTS messages_fts USING fts5( content, content=messages, content_rowid=id ); CREATE TRIGGER IF NOT EXISTS messages_fts_insert AFTER INSERT ON messages BEGIN INSERT INTO messages_fts(rowid, content) VALUES (new.id, new.content); END; -- _delete and _update triggers also read only old.content / new.content.User-facing symptom: Incomplete recall.
session_searchreturns nothing for a token the user knows was in a prior session, because the token only ever appeared intool_callsortool_name.Possible implementation directions. A naive "concat
tool_callsandtool_nameinto the existing FTScontentcolumn" approach breaks the external-content invariant:content=messagesties the FTS column tomessages.content, so any divergence between indexed text and the source column desyncs FTSrebuildandsnippet()/highlight(). Three sound alternatives:tool_callsandtool_nameas additional FTS5 columns alongsidecontent, with matching source columns so the external-content contract holds. Enables column-scoped queries.messages.search_textcolumn maintained fromcontent,tool_calls, andtool_name, and have FTS mirror that column. Requires a backfill plusINSERT INTO messages_fts(messages_fts) VALUES('rebuild').snippet()/highlight()ergonomics.I searched for prior work and didn't find an existing report on
tool_calls/tool_nameindexing. #15500 (closed as duplicate of open #14829) is about CJK tokenization and is unrelated to this gap.