feat: cross-modal search — text↔image retrieval#1127
Closed
garrytan-agents wants to merge 1 commit into
Closed
Conversation
…modal-3 Phase 1: text→image search with intent detection + multimodal query embedding. Builds on PR garrytan#1106 embedding column registry. Includes modality backfill prereq, 6 test cases, and phasing plan for image→text and unified column. File: docs/issues/cross-modal-search.md
6 tasks
garrytan
added a commit
that referenced
this pull request
May 20, 2026
…+ LLM intent) (#1165) * feat(cross-modal/0): batched multimodal + query helpers + SSRF helper Commit 0 of the cross-modal search wave. Foundation for Phase 1-3: - embedMultimodal accepts MultimodalInput text variant + EmbedMultimodalOpts with inputType: 'document' | 'query' (D22-2). Default unchanged so importImageFile keeps document-side embedding. - embedQueryMultimodal(text) + embedQueryMultimodalImage(input) wrappers for hybridSearch + searchByImage query paths. - embedMultimodalSafe binary-search retry on transient batch failure + failed_indices surfacing. Phase 3 reindex uses this so a single bad chunk doesn't discard the 31 in-flight embeddings around it. - Voyage path: text + image inputs in one batch via content arrays. - openai-compat path: text + image inputs in one request per input. - src/core/ssrf-validate.ts (D19): DNS-resolve-and-fetch-by-IP defense for redirect chains. Closes the DNS-rebinding gap that url-safety.ts' static check leaves open. Uses node:dns/promises with {all: true, family: 0} to inspect every A and AAAA record before connecting. fetchWithSSRFGuard helper validates per-redirect-hop and limits chain depth (default 3). - Re-exports from src/core/embedding.ts public seam. Tests: - test/embed-multimodal-batching.test.ts (13 cases): text variant, query inputType discipline, mixed text+image batches, embedQueryMultimodal, embedQueryMultimodalImage, embedMultimodalSafe happy/empty/all-fail/ mid-batch-recovery/permanent-misconfig. - test/ssrf-validate.test.ts (20 cases): static rejections via isInternalUrl, scheme + credentials rejection, DNS rebinding defense (single-record + multi-record), public happy path, IPv6 literals, malformed URLs. No regression in existing voyage-multimodal.test.ts or openai-compat-multimodal.test.ts (33 cases all pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cross-modal/1): Phase 1 text→image routing + knobsHash + RRF + backfill Phase 1 of the cross-modal search wave. Wires the existing 1024d Voyage multimodal embedding space (already populated for image chunks via importImageFile) into the user-facing query path. Text queries that match cross-modal intent regex route through Voyage multimodal-3 instead of the text embedding model, then search content_chunks.embedding_image. - query-intent.ts: new `suggestedModality: 'text' | 'image' | 'both'` axis on `QuerySuggestions`. Module-scope CROSS_MODAL_PATTERNS regex array (D15 — compiled once at module load). Conservative on purpose; LLM intent escalation (Commit 4) catches genuinely ambiguous phrasings. - query-intent.ts: new `isAmbiguousModalityQuery(query)` pure heuristic for Commit 4's escalation gate. Returns true ONLY when regex misses AND a visual noun + reference marker both fire. - types.ts: `SearchOpts.crossModal: 'text' | 'image' | 'both' | 'auto'` + `SearchResult.modality: 'text' | 'image'` for downstream renderers. - mode.ts: 7 new knobs in ModeBundle (D2): cross_modal_both_text_weight, cross_modal_both_image_weight, image_query_text_refinement_weight, image_query_image_refinement_weight, unified_multimodal, unified_multimodal_only, cross_modal_llm_intent. All three mode bundles default to the same values (cross-modal is opt-in). - mode.ts: D2 cache-key fix — KNOBS_HASH_VERSION bumped 2→3, all 7 new knobs participate in knobsHash so a text-mode cache hit can't be served to an image-mode caller. - mode.ts: D3 registry — all 7 keys land in SEARCH_MODE_CONFIG_KEYS so `gbrain search modes` / `stats` / `tune` see them. - hybrid.ts: routing branch at the embed step. Resolves effective modality from (per-call opts → suggestions → 'text'). Image route: embedQueryMultimodal + searchVector(embedding_image), skip expansion + keyword (D9 mode-bundle override). Both route: parallel text + image vector searches merged via weighted RRF (D6) with cross_modal_both_* weights. Fail-open: multimodal misconfigured → structured warn + text fallback. 'auto' literal normalized to undefined (D22-1). - operations.ts: thread `cross_modal` param through `query` op. - backfill-registry.ts: new `modality` backfill kind. SQL filter requires `chunk_source='image_asset'` (D22-7 defensive guard). Idempotent. - doctor.ts: `cross_modal_modality_backfill` check surfaces unflagged image-asset chunks with paste-ready `gbrain backfill modality` hint. Tests: - cross-modal-phase1.test.ts (45 cases): regex classification (positive + negative + plural-safe), isAmbiguousModalityQuery, D3 registry, D2 knobsHash diffs across all 7 new knobs, MODE_BUNDLES defaults, resolveSearchMode precedence chain. - cross-modal-hybrid-integration.test.ts (7 cases): PGLite + stubbed gateway. Verifies image-modality calls Voyage and not OpenAI, text calls OpenAI and not Voyage, 'auto' literal normalizes, 'both' mode hits both endpoints, fail-open routes to text on multimodal misconfig. - search-mode.test.ts: updated MODE_BUNDLES + KNOBS_HASH_VERSION assertions (148 cross-suite tests still pass; no regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cross-modal/2): Phase 2 image-as-query + D18 path ban + D23-#6 spend cap Phase 2 of the cross-modal search wave. Adds the `search_by_image` MCP op, the SSRF-defended image loader, and the daily per-OAuth-client spend cap on paid Voyage multimodal calls. D17 honest framing applied: Phase 2 ships image→similar-images + image-OCR-text retrieval. True image→full-text- knowledge requires Phase 3's unified column. - src/core/search/image-loader.ts: loadImageInput accepts local path, data: URI, or http(s):// URL. Magic-byte sniff for PNG/JPEG/WebP (no other formats). Hard size cap (10MB local default, 2MB remote default). http(s) path uses fetchWithSSRFGuard from Commit 0: every redirect hop re-resolved via DNS lookup + every record checked against the internal IP deny list. Max 3 redirect hops. 5s total fetch timeout. Pre-flight Content-Length check + post-fetch size guard for lying servers. - src/core/search/by-image.ts: searchByImage runs the image branch always; D13 hybrid intersect runs a parallel text branch when `query` is provided, merged via weighted RRF. Phase 3 will widen the column routing to embedding_multimodal once that lands. - src/core/operations.ts: new search_by_image op (scope: read, NOT localOnly). D18 P0 — when ctx.remote === true AND image_path is set, rejects with permission_denied at handler entry (validateParams would catch it again at dispatch). D5 source-id thread via sourceScopeOpts. D12 per-param length cap enforced via remote-vs-local maxBytes config read at handler entry. D23-#6 pre-flight checkBudget + post-call recordSpend (best-effort; failures don't block response). - src/core/spend-log.ts: BudgetExceededError + checkBudget + recordSpend + getTodaySpendCents. UTC day-aligned aggregation so the cap rolls over deterministically. Local CLI callers (no clientId) bypass the gate entirely. Pre-v0.36 brains without the mcp_spend_log table fail open to spend=0; the migration brings the table in on first start. - src/core/migrate.ts: new migration v67 mcp_spend_log table + indexes for the (client_id, day) and (token_name, day) hot reads. PGLite parity via sqlFor.pglite. - src/core/search/hybrid.ts: RRF_K constant exported so by-image.ts can share the same effective-K math as the main hybrid path. Tests: - cross-modal-phase2.test.ts (15 cases): magic-byte sniffing (PNG + JPEG + WebP positive, GIF rejection), oversized rejection (default + custom cap), data: URI happy path + malformed + decoded-non-image + oversized, invalid input shapes (empty + ftp), SSRF defense via DNS rebinding stub. - search-by-image-op.test.ts (7 cases): D18 remote image_path rejection + local CLI accepts; input validation (missing all three / multiple together); D23-#6 budget block-at-cap + allow-under-cap + local-CLI-bypass; migration v67 mcp_spend_log table applied cleanly. All 166 tests across the cross-modal suite pass; no regression in existing voyage-multimodal / openai-compat-multimodal / search-mode suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cross-modal/3): Phase 3 unified column + reindex + D8 fail-open + D23-#2 Phase 3 of the cross-modal search wave. Adds the unified multimodal column on content_chunks + the `gbrain reindex --multimodal` sweep + the `search.unified_multimodal` routing flag with D8 source-aware coverage guard + fail-open behavior. D17 honest framing: this is the phase that unlocks true image→full-text-knowledge — Phase 2's searchByImage transparently upgrades to the richer retrieval once the unified column has coverage. D10 reindex-core extraction filed as a follow-up TODO. The existing markdown reindex walks pages and re-imports via importFromFile; this walks content_chunks and re-embeds via the gateway. Patterns rhyme but cores diverge enough that extraction balloons the diff. Both commands stand alone with their own checkpoint + cost-prompt logic. - migrate.ts v68 (embedding_multimodal_column): column-only ALTER on content_chunks. HNSW partial index deferred to post-reindex build (D20: pgvector docs recommend post-load build for HNSW). Both engines. - types.ts SearchOpts.embeddingColumn type widened to include 'embedding_multimodal'. - postgres-engine.ts + pglite-engine.ts searchVector: route to embedding_multimodal column when opts.embeddingColumn set. NO modality filter (unified column carries both text + image content). - hybrid.ts unified routing branch: when search.unified_multimodal=true, bypasses dual-column branching and runs embedQueryMultimodal + searchVector(embedding_multimodal). D8 fail-open: zero rows + not strict-mode → falls through to dual-column text path with structured warning. search.unified_multimodal_only=true bypasses the fallback. - src/commands/reindex-multimodal.ts: `gbrain reindex --multimodal`. D7 lock via tryAcquireDbLock('gbrain-reindex-multimodal'); 6h TTL. Cost prompt + 10s Ctrl-C grace window in TTY; auto-proceeds non-TTY. GBRAIN_NO_REEMBED=1 bypass. Checkpoint at ~/.gbrain/reindex-multimodal-checkpoint.json for resume. D23-#2 auto-flip prompt at coverage=100% completion. - cli.ts: `gbrain reindex --multimodal` dispatch with --limit, --dry-run, --cost-estimate, --no-embed, --yes, --json flags. - doctor.ts: unified_multimodal_coverage check (D21 source-aware) + reports per-source % when search.unified_multimodal is on. Warns at <95% lowest source; fails when unified_multimodal_only=true AND lowest source <99%. Falls open to OK when column not yet present. Tests: - unified-multimodal.test.ts (8 cases): schema migration v68 applies, reindex --dry-run + --cost-estimate + GBRAIN_NO_REEMBED bypass + zero-pending fast-path, hybridSearch unified routing forces voyage endpoint, D8 fail-open routes to text on empty unified, D8 strict blocks text fallback. All 211 tests across the cross-modal + related suite pass; no regression in voyage-multimodal / openai-compat-multimodal / search-mode / intent / search base suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cross-modal/4): LLM intent escalation for ambiguous modality Commit 4 of the cross-modal search wave (opt-in default off). When `search.cross_modal.llm_intent` is true AND the regex classifier returned 'text' AND `isAmbiguousModalityQuery(query)` fires, hybridSearch awaits a Haiku tie-break via gateway.chat() before routing. The ambiguous-modality gate (introduced in Commit 1) ensures the LLM call only fires on the narrow band where regex misses but a visual noun + reference marker both fire — roughly <1% of queries with the flag on. - src/core/search/llm-intent.ts: new module. `classifyModalityWithLLM` routes through gateway.chat() with a fixed system prompt ("Output exactly one word: text, image, or both"). 1s timeout via AbortController. `parseModality` is a pure exported helper that tolerates trailing punctuation + casing. Fail-open on every error path (gateway unavailable, timeout, parse failure, unrecognized output). - src/core/search/hybrid.ts: escalation branch slots BEFORE the unified routing branch. Gated by: no explicit per-call crossModal opt, regex result == 'text', config flag on, ambiguity heuristic fires. Fail-open to regex result on any error from the LLM tie-break. Tests: - llm-intent-escalation.test.ts (14 cases): parseModality tolerance matrix (text / image / both / trailing punct / whitespace / unrecognized / empty), classifyModalityWithLLM happy paths for all 3 outputs, fail-open on throw / unrecognized output / gateway-not- configured, explicit-fallback-honored. - llm-intent-hybrid-integration.test.ts (6 cases): hybridSearch escalation gate fires ONLY when flag-on + ambiguous; off when flag-off, unambiguous, regex-confident, or explicit per-call opt set; fail-open on LLM throw. All 231 tests across the cross-modal + related suite pass; no regression in voyage-multimodal / openai-compat-multimodal / search-mode / intent / search base suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cross-modal/3): verify-gate fixes for full test suite Three small fixes to pass the full unit + E2E sweep after the cross-modal wave commits land. - migrate.ts v67: drop date_trunc('day', created_at) from mcp_spend_log indexes. TIMESTAMPTZ truncation depends on session timezone and isn't IMMUTABLE, so Postgres rejects the function in the index expression with SQLSTATE 42P17. BTREE on (client_id, created_at) covers the per-day rollup query via range scan on created_at — same performance, no IMMUTABLE constraint. - pglite-schema.ts + src/schema.sql: shorten the embedding_multimodal column comment. The longer version contained a comma inside a SQL line comment ("...search.unified_multimodal=true, all queries..."), which broke parseBaseTableColumns in test/schema-bootstrap-coverage (the parser splits on commas at depth-0 before stripping comments, so the comma inside the comment shortened the column-definition part and an "all" token from "all queries" got picked up as the next column name — silently hiding embedding_multimodal from coverage). - schema-embedded.ts: regenerated via `bun run build:schema`. - test/e2e/v030_1-integration-pglite.test.ts: listBackfills assertion extended to include the new `modality` entry registered in src/core/backfill-registry.ts as part of Commit 1. - test/search/knobs-hash-reranker.test.ts: KNOBS_HASH_VERSION assertion updated from 2→3 to match the cross-modal-wave hash-key extension (D2 cache contamination fix). Same shape as the prior v0.32→v0.35 bump. - test/unified-multimodal.test.ts: migrated process.env mutation to withEnv() helper to satisfy the scripts/check-test-isolation R1 rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cross-modal): VERSION + CHANGELOG + CLAUDE.md + spec doc + llms regen Final docs commit for the cross-modal wave (v0.36.0.0). - VERSION + package.json: bump 0.35.5.1 → 0.36.0.0 - CHANGELOG.md: full Garry-voice release entry with five-commit breakdown, the-numbers-that-matter table, what-this-means-for-you, and the required to-take-advantage-of-v0.36.0.0 block - docs/issues/cross-modal-search.md: cherry-picked from PR #1127 head (164 lines, the original spec doc preserved as historical reference for Phase 2 + 3 background) - CLAUDE.md: Key Files entries for src/core/ssrf-validate.ts, src/core/search/image-loader.ts, src/core/search/by-image.ts, src/core/search/llm-intent.ts, src/core/spend-log.ts, src/commands/reindex-multimodal.ts, plus extension annotations on src/core/search/query-intent.ts, src/core/search/mode.ts, src/core/search/hybrid.ts, src/core/backfill-registry.ts, src/core/migrate.ts (v67 + v68) - llms-full.txt + llms.txt: regenerated via `bun run build:llms` `bun run verify` clean (privacy + proposal-pii + test-names + jsonb + source-id-projection + progress + test-isolation + wasm + admin-build + admin-scope-drift + cli-exec + system-of-record + eval-glossary + typecheck). `bun test test/build-llms.test.ts` clean (7/7). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cross-modal): renumber migrations 67→69 + 68→70 post-master-merge Master shipped its own v67 (`facts_typed_claim_columns`) during the cross-modal wave's review cycle. The merge picked up both side's v67 entries, breaking the migration-distinct-versions test. Renumbering moves cross-modal's table + column ALTER off the collision: - v67 mcp_spend_log → v69 mcp_spend_log - v68 embedding_multimodal_column → v70 embedding_multimodal_column References updated in CHANGELOG, CLAUDE.md, pglite-schema.ts, schema.sql. schema-embedded.ts regenerated. llms-full.txt regenerated. 7006 unit tests pass, 0 fail. No test code touched — just version renumbering plus comment refs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version 0.36.0.0 → 0.36.4.0 Bumping to v0.36.4.0 to land in the queue slot the user requested. No behavior change; pure version bump across VERSION, package.json, CHANGELOG.md header, llms-full.txt regen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Cross-modal search spec — enables text queries to find images and (future) image queries to find text.
The Problem
gbrain has 11K image chunks embedded with Voyage multimodal-3, a valid HNSW index, but no way to actually search them. Text queries only search text embeddings. The multimodal embedding space exists but nothing routes queries through it.
Approach (Phased)
Phase 1 (this PR): Text → Image search
embedding_imagecolumn with the multimodal vectorPhase 2 (future): Image → Text search (upload a photo, find related text)
Phase 3 (future): Unified multimodal column (everything in one embedding space)
Key Evidence
image(should be ~11K)embedding_multimodal_modelHow to Implement
Pick up with Claude Code:
docs/issues/cross-modal-search.mdfor full specsearch/cross-modal.ts(new),embedding.ts,hybrid.ts,types.tsFile:
docs/issues/cross-modal-search.md