feat: cross-modal search — text↔image retrieval by garrytan-agents · Pull Request #1127 · garrytan/gbrain

garrytan-agents · 2026-05-17T21:29:13Z

What

Cross-modal search spec — enables text queries to find images and (future) image queries to find text.

The Problem

gbrain has 11K image chunks embedded with Voyage multimodal-3, a valid HNSW index, but no way to actually search them. Text queries only search text embeddings. The multimodal embedding space exists but nothing routes queries through it.

Approach (Phased)

Phase 1 (this PR): Text → Image search

Intent detection: detect "show me photos of..." / "find images from..." patterns
Multimodal query embedding: route text queries through Voyage multimodal-3 (same space as image embeddings)
Search embedding_image column with the multimodal vector
Prereq: fix modality column backfill (11K chunks have embeddings but wrong modality tag)

Phase 2 (future): Image → Text search (upload a photo, find related text)

Phase 3 (future): Unified multimodal column (everything in one embedding space)

Key Evidence

11,204 image chunks embedded, 83 MB HNSW index, valid ✅
Modality metadata broken: only 10 chunks marked image (should be ~11K)
Voyage multimodal-3 already configured as embedding_multimodal_model
The embedding space supports cross-modal: text and images in same 1024d space

How to Implement

Pick up with Claude Code:

Read docs/issues/cross-modal-search.md for full spec
6 test cases specified (intent detection, multimodal embed routing, cross-modal search, fallback)
Key files: search/cross-modal.ts (new), embedding.ts, hybrid.ts, types.ts
Depends on PR feat: dynamic embedding column selection for search #1106 embedding column registry for clean provider routing

File: docs/issues/cross-modal-search.md

…modal-3 Phase 1: text→image search with intent detection + multimodal query embedding. Builds on PR garrytan#1106 embedding column registry. Includes modality backfill prereq, 6 test cases, and phasing plan for image→text and unified column. File: docs/issues/cross-modal-search.md

…+ LLM intent) (#1165) * feat(cross-modal/0): batched multimodal + query helpers + SSRF helper Commit 0 of the cross-modal search wave. Foundation for Phase 1-3: - embedMultimodal accepts MultimodalInput text variant + EmbedMultimodalOpts with inputType: 'document' | 'query' (D22-2). Default unchanged so importImageFile keeps document-side embedding. - embedQueryMultimodal(text) + embedQueryMultimodalImage(input) wrappers for hybridSearch + searchByImage query paths. - embedMultimodalSafe binary-search retry on transient batch failure + failed_indices surfacing. Phase 3 reindex uses this so a single bad chunk doesn't discard the 31 in-flight embeddings around it. - Voyage path: text + image inputs in one batch via content arrays. - openai-compat path: text + image inputs in one request per input. - src/core/ssrf-validate.ts (D19): DNS-resolve-and-fetch-by-IP defense for redirect chains. Closes the DNS-rebinding gap that url-safety.ts' static check leaves open. Uses node:dns/promises with {all: true, family: 0} to inspect every A and AAAA record before connecting. fetchWithSSRFGuard helper validates per-redirect-hop and limits chain depth (default 3). - Re-exports from src/core/embedding.ts public seam. Tests: - test/embed-multimodal-batching.test.ts (13 cases): text variant, query inputType discipline, mixed text+image batches, embedQueryMultimodal, embedQueryMultimodalImage, embedMultimodalSafe happy/empty/all-fail/ mid-batch-recovery/permanent-misconfig. - test/ssrf-validate.test.ts (20 cases): static rejections via isInternalUrl, scheme + credentials rejection, DNS rebinding defense (single-record + multi-record), public happy path, IPv6 literals, malformed URLs. No regression in existing voyage-multimodal.test.ts or openai-compat-multimodal.test.ts (33 cases all pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cross-modal/1): Phase 1 text→image routing + knobsHash + RRF + backfill Phase 1 of the cross-modal search wave. Wires the existing 1024d Voyage multimodal embedding space (already populated for image chunks via importImageFile) into the user-facing query path. Text queries that match cross-modal intent regex route through Voyage multimodal-3 instead of the text embedding model, then search content_chunks.embedding_image. - query-intent.ts: new `suggestedModality: 'text' | 'image' | 'both'` axis on `QuerySuggestions`. Module-scope CROSS_MODAL_PATTERNS regex array (D15 — compiled once at module load). Conservative on purpose; LLM intent escalation (Commit 4) catches genuinely ambiguous phrasings. - query-intent.ts: new `isAmbiguousModalityQuery(query)` pure heuristic for Commit 4's escalation gate. Returns true ONLY when regex misses AND a visual noun + reference marker both fire. - types.ts: `SearchOpts.crossModal: 'text' | 'image' | 'both' | 'auto'` + `SearchResult.modality: 'text' | 'image'` for downstream renderers. - mode.ts: 7 new knobs in ModeBundle (D2): cross_modal_both_text_weight, cross_modal_both_image_weight, image_query_text_refinement_weight, image_query_image_refinement_weight, unified_multimodal, unified_multimodal_only, cross_modal_llm_intent. All three mode bundles default to the same values (cross-modal is opt-in). - mode.ts: D2 cache-key fix — KNOBS_HASH_VERSION bumped 2→3, all 7 new knobs participate in knobsHash so a text-mode cache hit can't be served to an image-mode caller. - mode.ts: D3 registry — all 7 keys land in SEARCH_MODE_CONFIG_KEYS so `gbrain search modes` / `stats` / `tune` see them. - hybrid.ts: routing branch at the embed step. Resolves effective modality from (per-call opts → suggestions → 'text'). Image route: embedQueryMultimodal + searchVector(embedding_image), skip expansion + keyword (D9 mode-bundle override). Both route: parallel text + image vector searches merged via weighted RRF (D6) with cross_modal_both_* weights. Fail-open: multimodal misconfigured → structured warn + text fallback. 'auto' literal normalized to undefined (D22-1). - operations.ts: thread `cross_modal` param through `query` op. - backfill-registry.ts: new `modality` backfill kind. SQL filter requires `chunk_source='image_asset'` (D22-7 defensive guard). Idempotent. - doctor.ts: `cross_modal_modality_backfill` check surfaces unflagged image-asset chunks with paste-ready `gbrain backfill modality` hint. Tests: - cross-modal-phase1.test.ts (45 cases): regex classification (positive + negative + plural-safe), isAmbiguousModalityQuery, D3 registry, D2 knobsHash diffs across all 7 new knobs, MODE_BUNDLES defaults, resolveSearchMode precedence chain. - cross-modal-hybrid-integration.test.ts (7 cases): PGLite + stubbed gateway. Verifies image-modality calls Voyage and not OpenAI, text calls OpenAI and not Voyage, 'auto' literal normalizes, 'both' mode hits both endpoints, fail-open routes to text on multimodal misconfig. - search-mode.test.ts: updated MODE_BUNDLES + KNOBS_HASH_VERSION assertions (148 cross-suite tests still pass; no regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cross-modal/2): Phase 2 image-as-query + D18 path ban + D23-#6 spend cap Phase 2 of the cross-modal search wave. Adds the `search_by_image` MCP op, the SSRF-defended image loader, and the daily per-OAuth-client spend cap on paid Voyage multimodal calls. D17 honest framing applied: Phase 2 ships image→similar-images + image-OCR-text retrieval. True image→full-text- knowledge requires Phase 3's unified column. - src/core/search/image-loader.ts: loadImageInput accepts local path, data: URI, or http(s):// URL. Magic-byte sniff for PNG/JPEG/WebP (no other formats). Hard size cap (10MB local default, 2MB remote default). http(s) path uses fetchWithSSRFGuard from Commit 0: every redirect hop re-resolved via DNS lookup + every record checked against the internal IP deny list. Max 3 redirect hops. 5s total fetch timeout. Pre-flight Content-Length check + post-fetch size guard for lying servers. - src/core/search/by-image.ts: searchByImage runs the image branch always; D13 hybrid intersect runs a parallel text branch when `query` is provided, merged via weighted RRF. Phase 3 will widen the column routing to embedding_multimodal once that lands. - src/core/operations.ts: new search_by_image op (scope: read, NOT localOnly). D18 P0 — when ctx.remote === true AND image_path is set, rejects with permission_denied at handler entry (validateParams would catch it again at dispatch). D5 source-id thread via sourceScopeOpts. D12 per-param length cap enforced via remote-vs-local maxBytes config read at handler entry. D23-#6 pre-flight checkBudget + post-call recordSpend (best-effort; failures don't block response). - src/core/spend-log.ts: BudgetExceededError + checkBudget + recordSpend + getTodaySpendCents. UTC day-aligned aggregation so the cap rolls over deterministically. Local CLI callers (no clientId) bypass the gate entirely. Pre-v0.36 brains without the mcp_spend_log table fail open to spend=0; the migration brings the table in on first start. - src/core/migrate.ts: new migration v67 mcp_spend_log table + indexes for the (client_id, day) and (token_name, day) hot reads. PGLite parity via sqlFor.pglite. - src/core/search/hybrid.ts: RRF_K constant exported so by-image.ts can share the same effective-K math as the main hybrid path. Tests: - cross-modal-phase2.test.ts (15 cases): magic-byte sniffing (PNG + JPEG + WebP positive, GIF rejection), oversized rejection (default + custom cap), data: URI happy path + malformed + decoded-non-image + oversized, invalid input shapes (empty + ftp), SSRF defense via DNS rebinding stub. - search-by-image-op.test.ts (7 cases): D18 remote image_path rejection + local CLI accepts; input validation (missing all three / multiple together); D23-#6 budget block-at-cap + allow-under-cap + local-CLI-bypass; migration v67 mcp_spend_log table applied cleanly. All 166 tests across the cross-modal suite pass; no regression in existing voyage-multimodal / openai-compat-multimodal / search-mode suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cross-modal/3): Phase 3 unified column + reindex + D8 fail-open + D23-#2 Phase 3 of the cross-modal search wave. Adds the unified multimodal column on content_chunks + the `gbrain reindex --multimodal` sweep + the `search.unified_multimodal` routing flag with D8 source-aware coverage guard + fail-open behavior. D17 honest framing: this is the phase that unlocks true image→full-text-knowledge — Phase 2's searchByImage transparently upgrades to the richer retrieval once the unified column has coverage. D10 reindex-core extraction filed as a follow-up TODO. The existing markdown reindex walks pages and re-imports via importFromFile; this walks content_chunks and re-embeds via the gateway. Patterns rhyme but cores diverge enough that extraction balloons the diff. Both commands stand alone with their own checkpoint + cost-prompt logic. - migrate.ts v68 (embedding_multimodal_column): column-only ALTER on content_chunks. HNSW partial index deferred to post-reindex build (D20: pgvector docs recommend post-load build for HNSW). Both engines. - types.ts SearchOpts.embeddingColumn type widened to include 'embedding_multimodal'. - postgres-engine.ts + pglite-engine.ts searchVector: route to embedding_multimodal column when opts.embeddingColumn set. NO modality filter (unified column carries both text + image content). - hybrid.ts unified routing branch: when search.unified_multimodal=true, bypasses dual-column branching and runs embedQueryMultimodal + searchVector(embedding_multimodal). D8 fail-open: zero rows + not strict-mode → falls through to dual-column text path with structured warning. search.unified_multimodal_only=true bypasses the fallback. - src/commands/reindex-multimodal.ts: `gbrain reindex --multimodal`. D7 lock via tryAcquireDbLock('gbrain-reindex-multimodal'); 6h TTL. Cost prompt + 10s Ctrl-C grace window in TTY; auto-proceeds non-TTY. GBRAIN_NO_REEMBED=1 bypass. Checkpoint at ~/.gbrain/reindex-multimodal-checkpoint.json for resume. D23-#2 auto-flip prompt at coverage=100% completion. - cli.ts: `gbrain reindex --multimodal` dispatch with --limit, --dry-run, --cost-estimate, --no-embed, --yes, --json flags. - doctor.ts: unified_multimodal_coverage check (D21 source-aware) + reports per-source % when search.unified_multimodal is on. Warns at <95% lowest source; fails when unified_multimodal_only=true AND lowest source <99%. Falls open to OK when column not yet present. Tests: - unified-multimodal.test.ts (8 cases): schema migration v68 applies, reindex --dry-run + --cost-estimate + GBRAIN_NO_REEMBED bypass + zero-pending fast-path, hybridSearch unified routing forces voyage endpoint, D8 fail-open routes to text on empty unified, D8 strict blocks text fallback. All 211 tests across the cross-modal + related suite pass; no regression in voyage-multimodal / openai-compat-multimodal / search-mode / intent / search base suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cross-modal/4): LLM intent escalation for ambiguous modality Commit 4 of the cross-modal search wave (opt-in default off). When `search.cross_modal.llm_intent` is true AND the regex classifier returned 'text' AND `isAmbiguousModalityQuery(query)` fires, hybridSearch awaits a Haiku tie-break via gateway.chat() before routing. The ambiguous-modality gate (introduced in Commit 1) ensures the LLM call only fires on the narrow band where regex misses but a visual noun + reference marker both fire — roughly <1% of queries with the flag on. - src/core/search/llm-intent.ts: new module. `classifyModalityWithLLM` routes through gateway.chat() with a fixed system prompt ("Output exactly one word: text, image, or both"). 1s timeout via AbortController. `parseModality` is a pure exported helper that tolerates trailing punctuation + casing. Fail-open on every error path (gateway unavailable, timeout, parse failure, unrecognized output). - src/core/search/hybrid.ts: escalation branch slots BEFORE the unified routing branch. Gated by: no explicit per-call crossModal opt, regex result == 'text', config flag on, ambiguity heuristic fires. Fail-open to regex result on any error from the LLM tie-break. Tests: - llm-intent-escalation.test.ts (14 cases): parseModality tolerance matrix (text / image / both / trailing punct / whitespace / unrecognized / empty), classifyModalityWithLLM happy paths for all 3 outputs, fail-open on throw / unrecognized output / gateway-not- configured, explicit-fallback-honored. - llm-intent-hybrid-integration.test.ts (6 cases): hybridSearch escalation gate fires ONLY when flag-on + ambiguous; off when flag-off, unambiguous, regex-confident, or explicit per-call opt set; fail-open on LLM throw. All 231 tests across the cross-modal + related suite pass; no regression in voyage-multimodal / openai-compat-multimodal / search-mode / intent / search base suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cross-modal/3): verify-gate fixes for full test suite Three small fixes to pass the full unit + E2E sweep after the cross-modal wave commits land. - migrate.ts v67: drop date_trunc('day', created_at) from mcp_spend_log indexes. TIMESTAMPTZ truncation depends on session timezone and isn't IMMUTABLE, so Postgres rejects the function in the index expression with SQLSTATE 42P17. BTREE on (client_id, created_at) covers the per-day rollup query via range scan on created_at — same performance, no IMMUTABLE constraint. - pglite-schema.ts + src/schema.sql: shorten the embedding_multimodal column comment. The longer version contained a comma inside a SQL line comment ("...search.unified_multimodal=true, all queries..."), which broke parseBaseTableColumns in test/schema-bootstrap-coverage (the parser splits on commas at depth-0 before stripping comments, so the comma inside the comment shortened the column-definition part and an "all" token from "all queries" got picked up as the next column name — silently hiding embedding_multimodal from coverage). - schema-embedded.ts: regenerated via `bun run build:schema`. - test/e2e/v030_1-integration-pglite.test.ts: listBackfills assertion extended to include the new `modality` entry registered in src/core/backfill-registry.ts as part of Commit 1. - test/search/knobs-hash-reranker.test.ts: KNOBS_HASH_VERSION assertion updated from 2→3 to match the cross-modal-wave hash-key extension (D2 cache contamination fix). Same shape as the prior v0.32→v0.35 bump. - test/unified-multimodal.test.ts: migrated process.env mutation to withEnv() helper to satisfy the scripts/check-test-isolation R1 rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cross-modal): VERSION + CHANGELOG + CLAUDE.md + spec doc + llms regen Final docs commit for the cross-modal wave (v0.36.0.0). - VERSION + package.json: bump 0.35.5.1 → 0.36.0.0 - CHANGELOG.md: full Garry-voice release entry with five-commit breakdown, the-numbers-that-matter table, what-this-means-for-you, and the required to-take-advantage-of-v0.36.0.0 block - docs/issues/cross-modal-search.md: cherry-picked from PR #1127 head (164 lines, the original spec doc preserved as historical reference for Phase 2 + 3 background) - CLAUDE.md: Key Files entries for src/core/ssrf-validate.ts, src/core/search/image-loader.ts, src/core/search/by-image.ts, src/core/search/llm-intent.ts, src/core/spend-log.ts, src/commands/reindex-multimodal.ts, plus extension annotations on src/core/search/query-intent.ts, src/core/search/mode.ts, src/core/search/hybrid.ts, src/core/backfill-registry.ts, src/core/migrate.ts (v67 + v68) - llms-full.txt + llms.txt: regenerated via `bun run build:llms` `bun run verify` clean (privacy + proposal-pii + test-names + jsonb + source-id-projection + progress + test-isolation + wasm + admin-build + admin-scope-drift + cli-exec + system-of-record + eval-glossary + typecheck). `bun test test/build-llms.test.ts` clean (7/7). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cross-modal): renumber migrations 67→69 + 68→70 post-master-merge Master shipped its own v67 (`facts_typed_claim_columns`) during the cross-modal wave's review cycle. The merge picked up both side's v67 entries, breaking the migration-distinct-versions test. Renumbering moves cross-modal's table + column ALTER off the collision: - v67 mcp_spend_log → v69 mcp_spend_log - v68 embedding_multimodal_column → v70 embedding_multimodal_column References updated in CHANGELOG, CLAUDE.md, pglite-schema.ts, schema.sql. schema-embedded.ts regenerated. llms-full.txt regenerated. 7006 unit tests pass, 0 fail. No test code touched — just version renumbering plus comment refs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version 0.36.0.0 → 0.36.4.0 Bumping to v0.36.4.0 to land in the queue slot the user requested. No behavior change; pure version bump across VERSION, package.json, CHANGELOG.md header, llms-full.txt regen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan mentioned this pull request May 18, 2026

v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) #1165

Merged

6 tasks

garrytan closed this in #1165 May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cross-modal search — text↔image retrieval#1127

feat: cross-modal search — text↔image retrieval#1127
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/cross-modal-search

garrytan-agents commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan-agents commented May 17, 2026

What

The Problem

Approach (Phased)

Key Evidence

How to Implement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant