Skip to content

fix: use FTS5 highlight() for stemming-aware snippet extraction#18

Merged
mksglu merged 1 commit into
mksglu:mainfrom
rjkaes:fix/extract-snippet-stemmed-matches
Mar 1, 2026
Merged

fix: use FTS5 highlight() for stemming-aware snippet extraction#18
mksglu merged 1 commit into
mksglu:mainfrom
rjkaes:fix/extract-snippet-stemmed-matches

Conversation

@rjkaes

@rjkaes rjkaes commented Mar 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace indexOf-only snippet extraction with positions derived from FTS5 highlight() markers, so stemmed matches (e.g., query "configure" matching content "configuration") produce correct snippet windows
  • Add highlight(chunks, 1, char(2), char(3)) to both search() and searchTrigram() SQL queries, propagating a new highlighted field on SearchResult
  • extractSnippet parses STX/ETX markers to find match positions, falling back to indexOf when highlighted is absent

Motivation

extractSnippet used indexOf to locate query terms in content. When BM25 matched via porter stemming (e.g., query "configure" matching "configuration"), indexOf failed and the function fell back to a blind prefix truncation. FTS5 highlight() is the authoritative source — it uses the exact same tokenizer that produced the match — so we parse its marker positions instead.

Why highlight() not snippet()

FTS5 snippet() returns a single best-match window with a fixed token count. The existing extractSnippet merges multiple windows for multi-term queries, showing several relevant regions. highlight() returns the full content with markers around ALL matched tokens, which we parse and feed into the existing windowing logic.

Test plan

  • 4 unit tests for positionsFromHighlight (single marker, multiple markers, no markers, adjacent markers)
  • 6 extractSnippet tests (highlight preference over indexOf, multi-term windows, indexOf fallback, prefix fallback, short term filtering, short content passthrough)
  • 4 store integration tests (porter stemmed markers, trigram markers, end-to-end snippet via store-produced highlighted)
  • Full test suite green (227 passed, 0 failed)

🤖 Generated with Claude Code

Replace the indexOf-only approach in `extractSnippet` with positions
derived from FTS5 `highlight()` markers. FTS5 is the source of truth
for which tokens matched — it uses the exact same porter tokenizer
that produced the BM25 ranking — so stemmed matches like query
"configure" hitting content "configuration" are found correctly.

The approach uses `highlight(chunks, 1, char(2), char(3))` in the
search SQL, which wraps matched tokens in STX/ETX control characters.
`extractSnippet` scans for STX markers to find match positions in the
original text, then feeds them into the existing multi-window merging
logic. When `highlighted` is absent (non-FTS codepath), falls back to
indexOf on raw query terms.

Changes:
- `SearchResult` gains optional `highlighted` field
- `search()` and `searchTrigram()` SQL includes `highlight()` column
- `extractSnippet` accepts optional `highlighted` parameter
- Both callers pass `r.highlighted` through
- Tests rewritten: marker parsing unit tests, indexOf fallback tests,
  and store integration tests for stemmed queries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mksglu mksglu merged commit 9823595 into mksglu:main Mar 1, 2026
3 checks passed
@mksglu

mksglu commented Mar 1, 2026

Copy link
Copy Markdown
Owner

@rjkaes We're live. https://github.com/mksglu/claude-context-mode/releases/tag/v0.9.16

mksglu added a commit that referenced this pull request Jun 2, 2026
Closes Issue #9 in v1.0.162 PRD — extractUserPromptFeatures stub.

Adds aggregate-only feature extraction on UserPromptSubmit messages.
The raw prompt text is NEVER stored on the emitted event.

Features:
  length: xs/s/m/l/xl (chars)
  lang:   latin / non-latin / mixed (algorithmic Unicode block scan)
  shape:  question / imperative
  codeFence: boolean (``` present)
  url:    boolean (http(s):// present)

Algorithmic language classifier — no regex. Treats Latin script (incl.
Latin-1, Extended-A/B for Turkish, German, Vietnamese) as "latin";
Greek/Cyrillic/CJK/Arabic/Hebrew/Devanagari and emoji as "non-latin";
mix as "mixed". Digits/punctuation-only resolves to "latin" as a safe
default for downstream aggregation.

Wires alongside extractUserPlan so /plan and prompt_features both fire
on a single UserPromptSubmit envelope.

src/session/extract.ts:1591-1594 — extractUserEvents wiring
src/session/extract.ts:1611-1672 — classifyLanguage helper
src/session/extract.ts:1674-1714 — extractUserPromptFeatures
tests/session/extract-prompt-features.test.ts — 11 tracers (RED->GREEN)

Privacy: no raw prompt text in event.data — only aggregate features.
Closes B5 prod data-quality flag #5 on the emit side; the storage-side
redaction of legacy session_data.user_prompt is Issue #18.

Platform-coordination: new event type `prompt_features` must round-trip
the EventEnvelopeSchema Zod gate. Flag EM if strict mode rejects.

Test plan:
- npx vitest run tests/session/extract-prompt-features.test.ts -> 11/11 GREEN
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants