Skip to content

docs(spec): postgres + pgvector + apache age substrate migration#19

Merged
jphein merged 1 commit into
mainfrom
docs/pgvector-age-migration-spec
May 11, 2026
Merged

docs(spec): postgres + pgvector + apache age substrate migration#19
jphein merged 1 commit into
mainfrom
docs/pgvector-age-migration-spec

Conversation

@jphein

@jphein jphein commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Full substrate-swap design for moving mempalace from ChromaDB + sqlite-KG to a unified Postgres deployment with pgvector for embeddings and Apache AGE for the knowledge graph. One DB, two extensions, one connection pool, one backup story, one ACID boundary.

What's in the spec

  • Architecture: palace-daemon → mempalace.backends.postgres + mempalace.knowledge_graph_age → Postgres 15+ with pgvector + age
  • Schema: drawers (vector(384) embeddings, HNSW + GIN + partial BTrees backing fork rows 1 / 7), closets, backend_meta (version + dimension-mismatch guard), AGE graph mempalace_kg
  • Migration tool: new mempalace migrate-to-postgres CLI, 7 phases, restartable, idempotent, checkpointed via backend_meta. Targets ~7-15 min for the canonical 160K-drawer palace.
  • Cutover: deliberate operator sequence, no automatic config-file mutation
  • Rollback: .chromadb-backup-<date> directory preserved indefinitely; "restart daemon on backup" is a deterministic recovery path
  • Testing: BaseCollection conformance, migration idempotency, AGE KG parity with sqlite KG, dimension-mismatch guard, daemon-strict + Postgres composition

Upstream composition

Surface Source Our move
BaseCollection ABC upstream #413 (merged) Foundation
pgvector backend upstream #665 (OPEN) Primary anchor; fork-port if it stalls
Migration CLI shape upstream #574 (OPEN) Mirror mempalace migrate ergonomics + dimension-mismatch guard
AGE-backed KG none upstream Fork-side, novel — pitch upstream after production use

Comments will go up on MemPalace#665 and MemPalace#574 after this spec merges to main, pointing at the stable URL.

Test plan

Out of scope / deferred

  • Embedder pluggability (v1 holds ONNX MiniLM-L6-v2 / 384d fixed)
  • Backup endpoint rewrite (palace-daemon's territory)
  • Concurrent online migration with dual-write (decided against — complexity vs benefit doesn't pencil for a single-user palace with brief downtime budget)

🤖 Generated with Claude Code

Full substrate-swap design for moving mempalace from ChromaDB + sqlite-KG
to a unified Postgres deployment with pgvector for embeddings and Apache
AGE for the knowledge graph. Composes on top of upstream's existing
backend-seam work rather than reinventing it.

Compositions with upstream:
- MemPalace#413 (BaseCollection seam, merged) — foundation
- MemPalace#665 (pgvector / pg_sorted_heap backend, OPEN) — storage half, primary
  anchor; fork-port if it stalls
- MemPalace#574 (LanceDB substrate-swap + `mempalace migrate` ergonomics, OPEN) —
  template for the migration CLI shape + dimension-mismatch guard

Genuinely fork-side:
- `mempalace.knowledge_graph_age` — AGE-backed KG (Cypher MERGE for
  entities, CREATE for edges; properties carry today's sqlite-KG
  columns: relation_type, source, valid_from, valid_to, confidence)
- `mempalace migrate-to-postgres` CLI — 7-phase, restartable, idempotent
  migration tool; checkpoint per phase in backend_meta; full ChromaDB
  palace + sqlite KG → unified Postgres in ~7-15 min for the canonical
  160K-drawer palace

Operational shape:
- Offline batch + cutover (decided against dual-write streaming): brief
  daemon downtime, rollback is "restart daemon on the backup palace"
- Backend selection via MEMPALACE_BACKEND=postgres + MEMPALACE_KG_BACKEND=age
  env vars; readable from config.json too
- Composes with palace-daemon's existing /search /context /graph
  endpoints — substrate swap happens below the BaseCollection ABC, so
  the daemon's HTTP surface is unchanged
- Rows 33/34 daemon-strict routing is backend-agnostic — JSON-RPC
  forwarding to /mcp doesn't open a local backend in strict mode

Non-goals (deferred to v2):
- Embedder pluggability (v1 stays on ONNX MiniLM-L6-v2 / 384d fixed)
- Backup endpoint rewrite (palace-daemon territory, not mempalace)
- Concurrent online migration with dual-write

Next: writing-plans skill to decompose the implementable surfaces (pg
backend integration, AGE KG, migration tool) into ordered tasks with
file paths and test coverage targets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 11, 2026 02:07
@jphein jphein merged commit 9afc801 into main May 11, 2026
10 checks passed
@jphein jphein deleted the docs/pgvector-age-migration-spec branch May 11, 2026 02:11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a fork-side design spec describing how MemPalace would migrate from the current ChromaDB + SQLite knowledge graph substrate to a unified Postgres deployment using pgvector (embeddings) and Apache AGE (knowledge graph), including schema, migration phases, cutover, rollback, and testing strategy.

Changes:

  • Introduces an end-to-end architecture for a Postgres-based backend + AGE-backed knowledge graph.
  • Proposes a Postgres schema (drawers/closets/meta) and indexing strategy.
  • Specifies a restartable, checkpointed mempalace migrate-to-postgres CLI with operational cutover/rollback steps and a test plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +102 to +110
embedding vector(384), -- nullable: closets may not have one
source_file text,
mtime double precision,
normalize_version integer,
importance integer DEFAULT 0
);

CREATE INDEX drawers_embedding_hnsw_idx ON mempalace.drawers
USING hnsw (embedding vector_cosine_ops)
Comment on lines +132 to +135
topic text,
metadata jsonb NOT NULL DEFAULT '{}'
);

Comment on lines +201 to +205
Each phase wrapped in a transaction. Phase checkpoint written to `backend_meta` on success. Re-run with `--resume` (or just rerunning the same command — `--resume` is the default behaviour when checkpoint rows exist) skips completed phases.

| # | Phase | Action | Idempotent? | Checkpoint key |
|---|-------|--------|-------------|----------------|
| 0 | **Preflight** | Connect to both source and target. Verify `pgvector` + `age` extensions are installed. Check that ChromaDB palace embedding-model/dim matches what we'll write into `backend_meta`. Check `PALACE_DAEMON_URL` is not reachable (refuses to proceed if daemon is responsive — would silently lose writes). | Yes (read-only checks) | — |
Comment on lines +207 to +208
| 2 | **Drawers** | Stream `chromadb.col.get(include=["embeddings","documents","metadatas"])` in batches of `--batch-size`. Build `unnest(...)` arrays. One `INSERT ... SELECT FROM unnest(...) ON CONFLICT (id) DO NOTHING` per batch. Track running count in `backend_meta.migration_drawer_count` for visibility. | Yes (`ON CONFLICT DO NOTHING`) | `migration_phase_drawers` |
| 3 | **Closets** | Same shape as drawers, smaller cardinality. Usually one or two batches. | Yes | `migration_phase_closets` |
Comment on lines +171 to +183
**Edge label** `RELATION` (carries all current sqlite-KG columns as edge properties):
```cypher
(s:Entity)-[r:RELATION {
relation_type: 'works_on',
source: 'drawer_abc123',
valid_from: '2026-04-25',
valid_to: null,
confidence: 0.9
}]->(o:Entity)
```

Why one edge label instead of one per relation type? Mempalace's KG today stores `relation_type` as a string column on a single `kg_triples` table — already homogeneous. Keeping AGE's edge label uniform preserves that simplicity; queries can filter on `r.relation_type` exactly as the current sqlite KG queries filter on `relation_type =`. A future redesign could shard relation types into separate edge labels for performance, but that's not needed today.

Comment on lines +216 to +218
- Each phase transactional. Crash mid-phase → re-run picks up where it left off based on the `migration_*_count` watermark or `IF NOT EXISTS` semantics.
- Phase 2/3 use `ON CONFLICT DO NOTHING` — re-running over already-copied rows is a no-op, not an error.
- Phase 5's edge creation is the most fragile (no natural primary key on edges in AGE). The watermark count + a "skip first N triples" resume strategy avoids duplicates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants