fix: use FTS5 highlight() for stemming-aware snippet extraction#18
Merged
Merged
Conversation
Replace the indexOf-only approach in `extractSnippet` with positions derived from FTS5 `highlight()` markers. FTS5 is the source of truth for which tokens matched — it uses the exact same porter tokenizer that produced the BM25 ranking — so stemmed matches like query "configure" hitting content "configuration" are found correctly. The approach uses `highlight(chunks, 1, char(2), char(3))` in the search SQL, which wraps matched tokens in STX/ETX control characters. `extractSnippet` scans for STX markers to find match positions in the original text, then feeds them into the existing multi-window merging logic. When `highlighted` is absent (non-FTS codepath), falls back to indexOf on raw query terms. Changes: - `SearchResult` gains optional `highlighted` field - `search()` and `searchTrigram()` SQL includes `highlight()` column - `extractSnippet` accepts optional `highlighted` parameter - Both callers pass `r.highlighted` through - Tests rewritten: marker parsing unit tests, indexOf fallback tests, and store integration tests for stemmed queries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
3 tasks
mksglu
added a commit
that referenced
this pull request
Jun 2, 2026
Closes Issue #9 in v1.0.162 PRD — extractUserPromptFeatures stub. Adds aggregate-only feature extraction on UserPromptSubmit messages. The raw prompt text is NEVER stored on the emitted event. Features: length: xs/s/m/l/xl (chars) lang: latin / non-latin / mixed (algorithmic Unicode block scan) shape: question / imperative codeFence: boolean (``` present) url: boolean (http(s):// present) Algorithmic language classifier — no regex. Treats Latin script (incl. Latin-1, Extended-A/B for Turkish, German, Vietnamese) as "latin"; Greek/Cyrillic/CJK/Arabic/Hebrew/Devanagari and emoji as "non-latin"; mix as "mixed". Digits/punctuation-only resolves to "latin" as a safe default for downstream aggregation. Wires alongside extractUserPlan so /plan and prompt_features both fire on a single UserPromptSubmit envelope. src/session/extract.ts:1591-1594 — extractUserEvents wiring src/session/extract.ts:1611-1672 — classifyLanguage helper src/session/extract.ts:1674-1714 — extractUserPromptFeatures tests/session/extract-prompt-features.test.ts — 11 tracers (RED->GREEN) Privacy: no raw prompt text in event.data — only aggregate features. Closes B5 prod data-quality flag #5 on the emit side; the storage-side redaction of legacy session_data.user_prompt is Issue #18. Platform-coordination: new event type `prompt_features` must round-trip the EventEnvelopeSchema Zod gate. Flag EM if strict mode rejects. Test plan: - npx vitest run tests/session/extract-prompt-features.test.ts -> 11/11 GREEN
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
highlight()markers, so stemmed matches (e.g., query "configure" matching content "configuration") produce correct snippet windowshighlight(chunks, 1, char(2), char(3))to bothsearch()andsearchTrigram()SQL queries, propagating a newhighlightedfield onSearchResultextractSnippetparses STX/ETX markers to find match positions, falling back to indexOf whenhighlightedis absentMotivation
extractSnippetusedindexOfto locate query terms in content. When BM25 matched via porter stemming (e.g., query "configure" matching "configuration"),indexOffailed and the function fell back to a blind prefix truncation. FTS5highlight()is the authoritative source — it uses the exact same tokenizer that produced the match — so we parse its marker positions instead.Why
highlight()notsnippet()FTS5
snippet()returns a single best-match window with a fixed token count. The existingextractSnippetmerges multiple windows for multi-term queries, showing several relevant regions.highlight()returns the full content with markers around ALL matched tokens, which we parse and feed into the existing windowing logic.Test plan
positionsFromHighlight(single marker, multiple markers, no markers, adjacent markers)extractSnippettests (highlight preference over indexOf, multi-term windows, indexOf fallback, prefix fallback, short term filtering, short content passthrough)highlighted)🤖 Generated with Claude Code