fix: use FTS5 highlight() for stemming-aware snippet extraction by rjkaes · Pull Request #18 · mksglu/context-mode

rjkaes · 2026-03-01T16:43:06Z

Summary

Replace indexOf-only snippet extraction with positions derived from FTS5 highlight() markers, so stemmed matches (e.g., query "configure" matching content "configuration") produce correct snippet windows
Add highlight(chunks, 1, char(2), char(3)) to both search() and searchTrigram() SQL queries, propagating a new highlighted field on SearchResult
extractSnippet parses STX/ETX markers to find match positions, falling back to indexOf when highlighted is absent

Motivation

extractSnippet used indexOf to locate query terms in content. When BM25 matched via porter stemming (e.g., query "configure" matching "configuration"), indexOf failed and the function fell back to a blind prefix truncation. FTS5 highlight() is the authoritative source — it uses the exact same tokenizer that produced the match — so we parse its marker positions instead.

Why `highlight()` not `snippet()`

FTS5 snippet() returns a single best-match window with a fixed token count. The existing extractSnippet merges multiple windows for multi-term queries, showing several relevant regions. highlight() returns the full content with markers around ALL matched tokens, which we parse and feed into the existing windowing logic.

Test plan

4 unit tests for positionsFromHighlight (single marker, multiple markers, no markers, adjacent markers)
6 extractSnippet tests (highlight preference over indexOf, multi-term windows, indexOf fallback, prefix fallback, short term filtering, short content passthrough)
4 store integration tests (porter stemmed markers, trigram markers, end-to-end snippet via store-produced highlighted)
Full test suite green (227 passed, 0 failed)

🤖 Generated with Claude Code

Replace the indexOf-only approach in `extractSnippet` with positions derived from FTS5 `highlight()` markers. FTS5 is the source of truth for which tokens matched — it uses the exact same porter tokenizer that produced the BM25 ranking — so stemmed matches like query "configure" hitting content "configuration" are found correctly. The approach uses `highlight(chunks, 1, char(2), char(3))` in the search SQL, which wraps matched tokens in STX/ETX control characters. `extractSnippet` scans for STX markers to find match positions in the original text, then feeds them into the existing multi-window merging logic. When `highlighted` is absent (non-FTS codepath), falls back to indexOf on raw query terms. Changes: - `SearchResult` gains optional `highlighted` field - `search()` and `searchTrigram()` SQL includes `highlight()` column - `extractSnippet` accepts optional `highlighted` parameter - Both callers pass `r.highlighted` through - Tests rewritten: marker parsing unit tests, indexOf fallback tests, and store integration tests for stemmed queries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mksglu · 2026-03-01T17:26:58Z

@rjkaes We're live. https://github.com/mksglu/claude-context-mode/releases/tag/v0.9.16

Closes Issue #9 in v1.0.162 PRD — extractUserPromptFeatures stub. Adds aggregate-only feature extraction on UserPromptSubmit messages. The raw prompt text is NEVER stored on the emitted event. Features: length: xs/s/m/l/xl (chars) lang: latin / non-latin / mixed (algorithmic Unicode block scan) shape: question / imperative codeFence: boolean (``` present) url: boolean (http(s):// present) Algorithmic language classifier — no regex. Treats Latin script (incl. Latin-1, Extended-A/B for Turkish, German, Vietnamese) as "latin"; Greek/Cyrillic/CJK/Arabic/Hebrew/Devanagari and emoji as "non-latin"; mix as "mixed". Digits/punctuation-only resolves to "latin" as a safe default for downstream aggregation. Wires alongside extractUserPlan so /plan and prompt_features both fire on a single UserPromptSubmit envelope. src/session/extract.ts:1591-1594 — extractUserEvents wiring src/session/extract.ts:1611-1672 — classifyLanguage helper src/session/extract.ts:1674-1714 — extractUserPromptFeatures tests/session/extract-prompt-features.test.ts — 11 tracers (RED->GREEN) Privacy: no raw prompt text in event.data — only aggregate features. Closes B5 prod data-quality flag #5 on the emit side; the storage-side redaction of legacy session_data.user_prompt is Issue #18. Platform-coordination: new event type `prompt_features` must round-trip the EventEnvelopeSchema Zod gate. Flag EM if strict mode rejects. Test plan: - npx vitest run tests/session/extract-prompt-features.test.ts -> 11/11 GREEN

mksglu merged commit 9823595 into mksglu:main Mar 1, 2026
3 checks passed

kianwoon mentioned this pull request Mar 16, 2026

refactor: modularize server, fix security issues, add tests #133

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use FTS5 highlight() for stemming-aware snippet extraction#18

fix: use FTS5 highlight() for stemming-aware snippet extraction#18
mksglu merged 1 commit into
mksglu:mainfrom
rjkaes:fix/extract-snippet-stemmed-matches

rjkaes commented Mar 1, 2026

Uh oh!

Uh oh!

mksglu commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rjkaes commented Mar 1, 2026

Summary

Motivation

Why highlight() not snippet()

Test plan

Uh oh!

Uh oh!

mksglu commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why `highlight()` not `snippet()`