Skip to content

feat: cross-modal search — text↔image retrieval#1127

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/cross-modal-search
Closed

feat: cross-modal search — text↔image retrieval#1127
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/cross-modal-search

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

What

Cross-modal search spec — enables text queries to find images and (future) image queries to find text.

The Problem

gbrain has 11K image chunks embedded with Voyage multimodal-3, a valid HNSW index, but no way to actually search them. Text queries only search text embeddings. The multimodal embedding space exists but nothing routes queries through it.

Approach (Phased)

Phase 1 (this PR): Text → Image search

  • Intent detection: detect "show me photos of..." / "find images from..." patterns
  • Multimodal query embedding: route text queries through Voyage multimodal-3 (same space as image embeddings)
  • Search embedding_image column with the multimodal vector
  • Prereq: fix modality column backfill (11K chunks have embeddings but wrong modality tag)

Phase 2 (future): Image → Text search (upload a photo, find related text)

Phase 3 (future): Unified multimodal column (everything in one embedding space)

Key Evidence

  • 11,204 image chunks embedded, 83 MB HNSW index, valid ✅
  • Modality metadata broken: only 10 chunks marked image (should be ~11K)
  • Voyage multimodal-3 already configured as embedding_multimodal_model
  • The embedding space supports cross-modal: text and images in same 1024d space

How to Implement

Pick up with Claude Code:

  • Read docs/issues/cross-modal-search.md for full spec
  • 6 test cases specified (intent detection, multimodal embed routing, cross-modal search, fallback)
  • Key files: search/cross-modal.ts (new), embedding.ts, hybrid.ts, types.ts
  • Depends on PR feat: dynamic embedding column selection for search #1106 embedding column registry for clean provider routing

File: docs/issues/cross-modal-search.md

…modal-3

Phase 1: text→image search with intent detection + multimodal query embedding.
Builds on PR garrytan#1106 embedding column registry. Includes modality backfill prereq,
6 test cases, and phasing plan for image→text and unified column.

File: docs/issues/cross-modal-search.md
garrytan added a commit that referenced this pull request May 20, 2026
…+ LLM intent) (#1165)

* feat(cross-modal/0): batched multimodal + query helpers + SSRF helper

Commit 0 of the cross-modal search wave. Foundation for Phase 1-3:

- embedMultimodal accepts MultimodalInput text variant + EmbedMultimodalOpts
  with inputType: 'document' | 'query' (D22-2). Default unchanged so
  importImageFile keeps document-side embedding.
- embedQueryMultimodal(text) + embedQueryMultimodalImage(input) wrappers
  for hybridSearch + searchByImage query paths.
- embedMultimodalSafe binary-search retry on transient batch failure +
  failed_indices surfacing. Phase 3 reindex uses this so a single bad
  chunk doesn't discard the 31 in-flight embeddings around it.
- Voyage path: text + image inputs in one batch via content arrays.
- openai-compat path: text + image inputs in one request per input.
- src/core/ssrf-validate.ts (D19): DNS-resolve-and-fetch-by-IP defense
  for redirect chains. Closes the DNS-rebinding gap that url-safety.ts'
  static check leaves open. Uses node:dns/promises with {all: true,
  family: 0} to inspect every A and AAAA record before connecting.
  fetchWithSSRFGuard helper validates per-redirect-hop and limits chain
  depth (default 3).
- Re-exports from src/core/embedding.ts public seam.

Tests:
- test/embed-multimodal-batching.test.ts (13 cases): text variant, query
  inputType discipline, mixed text+image batches, embedQueryMultimodal,
  embedQueryMultimodalImage, embedMultimodalSafe happy/empty/all-fail/
  mid-batch-recovery/permanent-misconfig.
- test/ssrf-validate.test.ts (20 cases): static rejections via
  isInternalUrl, scheme + credentials rejection, DNS rebinding defense
  (single-record + multi-record), public happy path, IPv6 literals,
  malformed URLs.

No regression in existing voyage-multimodal.test.ts or
openai-compat-multimodal.test.ts (33 cases all pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cross-modal/1): Phase 1 text→image routing + knobsHash + RRF + backfill

Phase 1 of the cross-modal search wave. Wires the existing 1024d Voyage
multimodal embedding space (already populated for image chunks via
importImageFile) into the user-facing query path. Text queries that match
cross-modal intent regex route through Voyage multimodal-3 instead of the
text embedding model, then search content_chunks.embedding_image.

- query-intent.ts: new `suggestedModality: 'text' | 'image' | 'both'`
  axis on `QuerySuggestions`. Module-scope CROSS_MODAL_PATTERNS regex
  array (D15 — compiled once at module load). Conservative on purpose;
  LLM intent escalation (Commit 4) catches genuinely ambiguous phrasings.
- query-intent.ts: new `isAmbiguousModalityQuery(query)` pure heuristic
  for Commit 4's escalation gate. Returns true ONLY when regex misses
  AND a visual noun + reference marker both fire.
- types.ts: `SearchOpts.crossModal: 'text' | 'image' | 'both' | 'auto'`
  + `SearchResult.modality: 'text' | 'image'` for downstream renderers.
- mode.ts: 7 new knobs in ModeBundle (D2): cross_modal_both_text_weight,
  cross_modal_both_image_weight, image_query_text_refinement_weight,
  image_query_image_refinement_weight, unified_multimodal,
  unified_multimodal_only, cross_modal_llm_intent. All three mode
  bundles default to the same values (cross-modal is opt-in).
- mode.ts: D2 cache-key fix — KNOBS_HASH_VERSION bumped 2→3, all 7 new
  knobs participate in knobsHash so a text-mode cache hit can't be
  served to an image-mode caller.
- mode.ts: D3 registry — all 7 keys land in SEARCH_MODE_CONFIG_KEYS so
  `gbrain search modes` / `stats` / `tune` see them.
- hybrid.ts: routing branch at the embed step. Resolves effective
  modality from (per-call opts → suggestions → 'text'). Image route:
  embedQueryMultimodal + searchVector(embedding_image), skip expansion
  + keyword (D9 mode-bundle override). Both route: parallel text + image
  vector searches merged via weighted RRF (D6) with cross_modal_both_*
  weights. Fail-open: multimodal misconfigured → structured warn + text
  fallback. 'auto' literal normalized to undefined (D22-1).
- operations.ts: thread `cross_modal` param through `query` op.
- backfill-registry.ts: new `modality` backfill kind. SQL filter requires
  `chunk_source='image_asset'` (D22-7 defensive guard). Idempotent.
- doctor.ts: `cross_modal_modality_backfill` check surfaces unflagged
  image-asset chunks with paste-ready `gbrain backfill modality` hint.

Tests:
- cross-modal-phase1.test.ts (45 cases): regex classification (positive
  + negative + plural-safe), isAmbiguousModalityQuery, D3 registry, D2
  knobsHash diffs across all 7 new knobs, MODE_BUNDLES defaults,
  resolveSearchMode precedence chain.
- cross-modal-hybrid-integration.test.ts (7 cases): PGLite + stubbed
  gateway. Verifies image-modality calls Voyage and not OpenAI, text
  calls OpenAI and not Voyage, 'auto' literal normalizes, 'both' mode
  hits both endpoints, fail-open routes to text on multimodal misconfig.
- search-mode.test.ts: updated MODE_BUNDLES + KNOBS_HASH_VERSION
  assertions (148 cross-suite tests still pass; no regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cross-modal/2): Phase 2 image-as-query + D18 path ban + D23-#6 spend cap

Phase 2 of the cross-modal search wave. Adds the `search_by_image` MCP op,
the SSRF-defended image loader, and the daily per-OAuth-client spend cap
on paid Voyage multimodal calls. D17 honest framing applied: Phase 2 ships
image→similar-images + image-OCR-text retrieval. True image→full-text-
knowledge requires Phase 3's unified column.

- src/core/search/image-loader.ts: loadImageInput accepts local path,
  data: URI, or http(s):// URL. Magic-byte sniff for PNG/JPEG/WebP (no
  other formats). Hard size cap (10MB local default, 2MB remote default).
  http(s) path uses fetchWithSSRFGuard from Commit 0: every redirect hop
  re-resolved via DNS lookup + every record checked against the internal
  IP deny list. Max 3 redirect hops. 5s total fetch timeout. Pre-flight
  Content-Length check + post-fetch size guard for lying servers.
- src/core/search/by-image.ts: searchByImage runs the image branch
  always; D13 hybrid intersect runs a parallel text branch when
  `query` is provided, merged via weighted RRF. Phase 3 will widen
  the column routing to embedding_multimodal once that lands.
- src/core/operations.ts: new search_by_image op (scope: read, NOT
  localOnly). D18 P0 — when ctx.remote === true AND image_path is set,
  rejects with permission_denied at handler entry (validateParams would
  catch it again at dispatch). D5 source-id thread via sourceScopeOpts.
  D12 per-param length cap enforced via remote-vs-local maxBytes config
  read at handler entry. D23-#6 pre-flight checkBudget + post-call
  recordSpend (best-effort; failures don't block response).
- src/core/spend-log.ts: BudgetExceededError + checkBudget + recordSpend
  + getTodaySpendCents. UTC day-aligned aggregation so the cap rolls
  over deterministically. Local CLI callers (no clientId) bypass the
  gate entirely. Pre-v0.36 brains without the mcp_spend_log table fail
  open to spend=0; the migration brings the table in on first start.
- src/core/migrate.ts: new migration v67 mcp_spend_log table + indexes
  for the (client_id, day) and (token_name, day) hot reads. PGLite
  parity via sqlFor.pglite.
- src/core/search/hybrid.ts: RRF_K constant exported so by-image.ts can
  share the same effective-K math as the main hybrid path.

Tests:
- cross-modal-phase2.test.ts (15 cases): magic-byte sniffing (PNG +
  JPEG + WebP positive, GIF rejection), oversized rejection (default +
  custom cap), data: URI happy path + malformed + decoded-non-image
  + oversized, invalid input shapes (empty + ftp), SSRF defense via
  DNS rebinding stub.
- search-by-image-op.test.ts (7 cases): D18 remote image_path
  rejection + local CLI accepts; input validation (missing all three /
  multiple together); D23-#6 budget block-at-cap + allow-under-cap +
  local-CLI-bypass; migration v67 mcp_spend_log table applied cleanly.

All 166 tests across the cross-modal suite pass; no regression in
existing voyage-multimodal / openai-compat-multimodal / search-mode suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cross-modal/3): Phase 3 unified column + reindex + D8 fail-open + D23-#2

Phase 3 of the cross-modal search wave. Adds the unified multimodal column
on content_chunks + the `gbrain reindex --multimodal` sweep + the
`search.unified_multimodal` routing flag with D8 source-aware coverage
guard + fail-open behavior. D17 honest framing: this is the phase that
unlocks true image→full-text-knowledge — Phase 2's searchByImage
transparently upgrades to the richer retrieval once the unified column
has coverage.

D10 reindex-core extraction filed as a follow-up TODO. The existing
markdown reindex walks pages and re-imports via importFromFile; this
walks content_chunks and re-embeds via the gateway. Patterns rhyme but
cores diverge enough that extraction balloons the diff. Both commands
stand alone with their own checkpoint + cost-prompt logic.

- migrate.ts v68 (embedding_multimodal_column): column-only ALTER on
  content_chunks. HNSW partial index deferred to post-reindex build
  (D20: pgvector docs recommend post-load build for HNSW). Both engines.
- types.ts SearchOpts.embeddingColumn type widened to include
  'embedding_multimodal'.
- postgres-engine.ts + pglite-engine.ts searchVector: route to
  embedding_multimodal column when opts.embeddingColumn set. NO modality
  filter (unified column carries both text + image content).
- hybrid.ts unified routing branch: when search.unified_multimodal=true,
  bypasses dual-column branching and runs embedQueryMultimodal +
  searchVector(embedding_multimodal). D8 fail-open: zero rows + not
  strict-mode → falls through to dual-column text path with structured
  warning. search.unified_multimodal_only=true bypasses the fallback.
- src/commands/reindex-multimodal.ts: `gbrain reindex --multimodal`.
  D7 lock via tryAcquireDbLock('gbrain-reindex-multimodal'); 6h TTL.
  Cost prompt + 10s Ctrl-C grace window in TTY; auto-proceeds non-TTY.
  GBRAIN_NO_REEMBED=1 bypass. Checkpoint at
  ~/.gbrain/reindex-multimodal-checkpoint.json for resume. D23-#2
  auto-flip prompt at coverage=100% completion.
- cli.ts: `gbrain reindex --multimodal` dispatch with --limit, --dry-run,
  --cost-estimate, --no-embed, --yes, --json flags.
- doctor.ts: unified_multimodal_coverage check (D21 source-aware) +
  reports per-source % when search.unified_multimodal is on. Warns at
  <95% lowest source; fails when unified_multimodal_only=true AND
  lowest source <99%. Falls open to OK when column not yet present.

Tests:
- unified-multimodal.test.ts (8 cases): schema migration v68 applies,
  reindex --dry-run + --cost-estimate + GBRAIN_NO_REEMBED bypass +
  zero-pending fast-path, hybridSearch unified routing forces voyage
  endpoint, D8 fail-open routes to text on empty unified, D8 strict
  blocks text fallback.

All 211 tests across the cross-modal + related suite pass; no
regression in voyage-multimodal / openai-compat-multimodal / search-mode
/ intent / search base suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cross-modal/4): LLM intent escalation for ambiguous modality

Commit 4 of the cross-modal search wave (opt-in default off).

When `search.cross_modal.llm_intent` is true AND the regex classifier
returned 'text' AND `isAmbiguousModalityQuery(query)` fires, hybridSearch
awaits a Haiku tie-break via gateway.chat() before routing. The
ambiguous-modality gate (introduced in Commit 1) ensures the LLM call
only fires on the narrow band where regex misses but a visual noun +
reference marker both fire — roughly <1% of queries with the flag on.

- src/core/search/llm-intent.ts: new module. `classifyModalityWithLLM`
  routes through gateway.chat() with a fixed system prompt ("Output
  exactly one word: text, image, or both"). 1s timeout via AbortController.
  `parseModality` is a pure exported helper that tolerates trailing
  punctuation + casing. Fail-open on every error path (gateway
  unavailable, timeout, parse failure, unrecognized output).
- src/core/search/hybrid.ts: escalation branch slots BEFORE the unified
  routing branch. Gated by: no explicit per-call crossModal opt, regex
  result == 'text', config flag on, ambiguity heuristic fires. Fail-open
  to regex result on any error from the LLM tie-break.

Tests:
- llm-intent-escalation.test.ts (14 cases): parseModality tolerance
  matrix (text / image / both / trailing punct / whitespace /
  unrecognized / empty), classifyModalityWithLLM happy paths for all 3
  outputs, fail-open on throw / unrecognized output / gateway-not-
  configured, explicit-fallback-honored.
- llm-intent-hybrid-integration.test.ts (6 cases): hybridSearch
  escalation gate fires ONLY when flag-on + ambiguous; off when flag-off,
  unambiguous, regex-confident, or explicit per-call opt set; fail-open
  on LLM throw.

All 231 tests across the cross-modal + related suite pass; no
regression in voyage-multimodal / openai-compat-multimodal /
search-mode / intent / search base suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cross-modal/3): verify-gate fixes for full test suite

Three small fixes to pass the full unit + E2E sweep after the cross-modal
wave commits land.

- migrate.ts v67: drop date_trunc('day', created_at) from
  mcp_spend_log indexes. TIMESTAMPTZ truncation depends on session
  timezone and isn't IMMUTABLE, so Postgres rejects the function in
  the index expression with SQLSTATE 42P17. BTREE on
  (client_id, created_at) covers the per-day rollup query via range
  scan on created_at — same performance, no IMMUTABLE constraint.
- pglite-schema.ts + src/schema.sql: shorten the embedding_multimodal
  column comment. The longer version contained a comma inside a SQL
  line comment ("...search.unified_multimodal=true, all queries..."),
  which broke parseBaseTableColumns in test/schema-bootstrap-coverage
  (the parser splits on commas at depth-0 before stripping comments,
  so the comma inside the comment shortened the column-definition part
  and an "all" token from "all queries" got picked up as the next
  column name — silently hiding embedding_multimodal from coverage).
- schema-embedded.ts: regenerated via `bun run build:schema`.
- test/e2e/v030_1-integration-pglite.test.ts: listBackfills assertion
  extended to include the new `modality` entry registered in
  src/core/backfill-registry.ts as part of Commit 1.
- test/search/knobs-hash-reranker.test.ts: KNOBS_HASH_VERSION assertion
  updated from 2→3 to match the cross-modal-wave hash-key extension
  (D2 cache contamination fix). Same shape as the prior
  v0.32→v0.35 bump.
- test/unified-multimodal.test.ts: migrated process.env mutation to
  withEnv() helper to satisfy the scripts/check-test-isolation R1
  rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(cross-modal): VERSION + CHANGELOG + CLAUDE.md + spec doc + llms regen

Final docs commit for the cross-modal wave (v0.36.0.0).

- VERSION + package.json: bump 0.35.5.1 → 0.36.0.0
- CHANGELOG.md: full Garry-voice release entry with five-commit breakdown,
  the-numbers-that-matter table, what-this-means-for-you, and the
  required to-take-advantage-of-v0.36.0.0 block
- docs/issues/cross-modal-search.md: cherry-picked from PR #1127 head
  (164 lines, the original spec doc preserved as historical reference
  for Phase 2 + 3 background)
- CLAUDE.md: Key Files entries for src/core/ssrf-validate.ts,
  src/core/search/image-loader.ts, src/core/search/by-image.ts,
  src/core/search/llm-intent.ts, src/core/spend-log.ts,
  src/commands/reindex-multimodal.ts, plus extension annotations on
  src/core/search/query-intent.ts, src/core/search/mode.ts,
  src/core/search/hybrid.ts, src/core/backfill-registry.ts,
  src/core/migrate.ts (v67 + v68)
- llms-full.txt + llms.txt: regenerated via `bun run build:llms`

`bun run verify` clean (privacy + proposal-pii + test-names + jsonb +
source-id-projection + progress + test-isolation + wasm + admin-build +
admin-scope-drift + cli-exec + system-of-record + eval-glossary +
typecheck). `bun test test/build-llms.test.ts` clean (7/7).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cross-modal): renumber migrations 67→69 + 68→70 post-master-merge

Master shipped its own v67 (`facts_typed_claim_columns`) during the
cross-modal wave's review cycle. The merge picked up both side's v67
entries, breaking the migration-distinct-versions test. Renumbering
moves cross-modal's table + column ALTER off the collision:

- v67 mcp_spend_log → v69 mcp_spend_log
- v68 embedding_multimodal_column → v70 embedding_multimodal_column

References updated in CHANGELOG, CLAUDE.md, pglite-schema.ts, schema.sql.
schema-embedded.ts regenerated. llms-full.txt regenerated.

7006 unit tests pass, 0 fail. No test code touched — just version
renumbering plus comment refs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version 0.36.0.0 → 0.36.4.0

Bumping to v0.36.4.0 to land in the queue slot the user requested.
No behavior change; pure version bump across VERSION, package.json,
CHANGELOG.md header, llms-full.txt regen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant