docs(spec): postgres + pgvector + apache age substrate migration by jphein · Pull Request #19 · techempower-org/mempalace

jphein · 2026-05-11T02:07:14Z

Summary

Full substrate-swap design for moving mempalace from ChromaDB + sqlite-KG to a unified Postgres deployment with pgvector for embeddings and Apache AGE for the knowledge graph. One DB, two extensions, one connection pool, one backup story, one ACID boundary.

What's in the spec

Architecture: palace-daemon → mempalace.backends.postgres + mempalace.knowledge_graph_age → Postgres 15+ with pgvector + age
Schema: drawers (vector(384) embeddings, HNSW + GIN + partial BTrees backing fork rows 1 / 7), closets, backend_meta (version + dimension-mismatch guard), AGE graph mempalace_kg
Migration tool: new mempalace migrate-to-postgres CLI, 7 phases, restartable, idempotent, checkpointed via backend_meta. Targets ~7-15 min for the canonical 160K-drawer palace.
Cutover: deliberate operator sequence, no automatic config-file mutation
Rollback: .chromadb-backup-<date> directory preserved indefinitely; "restart daemon on backup" is a deterministic recovery path
Testing: BaseCollection conformance, migration idempotency, AGE KG parity with sqlite KG, dimension-mismatch guard, daemon-strict + Postgres composition

Upstream composition

Surface	Source	Our move
`BaseCollection` ABC	upstream #413 (merged)	Foundation
`pgvector` backend	upstream #665 (OPEN)	Primary anchor; fork-port if it stalls
Migration CLI shape	upstream #574 (OPEN)	Mirror `mempalace migrate` ergonomics + dimension-mismatch guard
AGE-backed KG	none upstream	Fork-side, novel — pitch upstream after production use

Comments will go up on MemPalace#665 and MemPalace#574 after this spec merges to main, pointing at the stable URL.

Test plan

Spec self-review — no TBD/TODO/placeholders; internally consistent; appropriately scoped for a single implementation plan; no ambiguous requirements
Copilot review of the spec text
After merge: comment on upstream Add optional PostgreSQL backend with pg_sorted_heap support MemPalace/mempalace#665 and feat: add LanceDB backend abstraction and migration path MemPalace/mempalace#574
Next: writing-plans skill decomposes the three implementable surfaces (pg backend integration, AGE KG, migration tool) into ordered tasks

Out of scope / deferred

Embedder pluggability (v1 holds ONNX MiniLM-L6-v2 / 384d fixed)
Backup endpoint rewrite (palace-daemon's territory)
Concurrent online migration with dual-write (decided against — complexity vs benefit doesn't pencil for a single-user palace with brief downtime budget)

🤖 Generated with Claude Code

Full substrate-swap design for moving mempalace from ChromaDB + sqlite-KG to a unified Postgres deployment with pgvector for embeddings and Apache AGE for the knowledge graph. Composes on top of upstream's existing backend-seam work rather than reinventing it. Compositions with upstream: - MemPalace#413 (BaseCollection seam, merged) — foundation - MemPalace#665 (pgvector / pg_sorted_heap backend, OPEN) — storage half, primary anchor; fork-port if it stalls - MemPalace#574 (LanceDB substrate-swap + `mempalace migrate` ergonomics, OPEN) — template for the migration CLI shape + dimension-mismatch guard Genuinely fork-side: - `mempalace.knowledge_graph_age` — AGE-backed KG (Cypher MERGE for entities, CREATE for edges; properties carry today's sqlite-KG columns: relation_type, source, valid_from, valid_to, confidence) - `mempalace migrate-to-postgres` CLI — 7-phase, restartable, idempotent migration tool; checkpoint per phase in backend_meta; full ChromaDB palace + sqlite KG → unified Postgres in ~7-15 min for the canonical 160K-drawer palace Operational shape: - Offline batch + cutover (decided against dual-write streaming): brief daemon downtime, rollback is "restart daemon on the backup palace" - Backend selection via MEMPALACE_BACKEND=postgres + MEMPALACE_KG_BACKEND=age env vars; readable from config.json too - Composes with palace-daemon's existing /search /context /graph endpoints — substrate swap happens below the BaseCollection ABC, so the daemon's HTTP surface is unchanged - Rows 33/34 daemon-strict routing is backend-agnostic — JSON-RPC forwarding to /mcp doesn't open a local backend in strict mode Non-goals (deferred to v2): - Embedder pluggability (v1 stays on ONNX MiniLM-L6-v2 / 384d fixed) - Backup endpoint rewrite (palace-daemon territory, not mempalace) - Concurrent online migration with dual-write Next: writing-plans skill to decompose the implementable surfaces (pg backend integration, AGE KG, migration tool) into ordered tasks with file paths and test coverage targets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a fork-side design spec describing how MemPalace would migrate from the current ChromaDB + SQLite knowledge graph substrate to a unified Postgres deployment using pgvector (embeddings) and Apache AGE (knowledge graph), including schema, migration phases, cutover, rollback, and testing strategy.

Changes:

Introduces an end-to-end architecture for a Postgres-based backend + AGE-backed knowledge graph.
Proposes a Postgres schema (drawers/closets/meta) and indexing strategy.
Specifies a restartable, checkpointed mempalace migrate-to-postgres CLI with operational cutover/rollback steps and a test plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    embedding       vector(384),              -- nullable: closets may not have one
+    source_file     text,
+    mtime           double precision,
+    normalize_version integer,
+    importance      integer DEFAULT 0
+);
+
+CREATE INDEX drawers_embedding_hnsw_idx ON mempalace.drawers
+    USING hnsw (embedding vector_cosine_ops)


+    topic       text,
+    metadata    jsonb NOT NULL DEFAULT '{}'
+);
+


+Each phase wrapped in a transaction. Phase checkpoint written to `backend_meta` on success. Re-run with `--resume` (or just rerunning the same command — `--resume` is the default behaviour when checkpoint rows exist) skips completed phases.
+
+| # | Phase | Action | Idempotent? | Checkpoint key |
+|---|-------|--------|-------------|----------------|
+| 0 | **Preflight** | Connect to both source and target. Verify `pgvector` + `age` extensions are installed. Check that ChromaDB palace embedding-model/dim matches what we'll write into `backend_meta`. Check `PALACE_DAEMON_URL` is not reachable (refuses to proceed if daemon is responsive — would silently lose writes). | Yes (read-only checks) | — |


+| 2 | **Drawers** | Stream `chromadb.col.get(include=["embeddings","documents","metadatas"])` in batches of `--batch-size`. Build `unnest(...)` arrays. One `INSERT ... SELECT FROM unnest(...) ON CONFLICT (id) DO NOTHING` per batch. Track running count in `backend_meta.migration_drawer_count` for visibility. | Yes (`ON CONFLICT DO NOTHING`) | `migration_phase_drawers` |
+| 3 | **Closets** | Same shape as drawers, smaller cardinality. Usually one or two batches. | Yes | `migration_phase_closets` |


+**Edge label** `RELATION` (carries all current sqlite-KG columns as edge properties):
+```cypher
+(s:Entity)-[r:RELATION {
+    relation_type: 'works_on',
+    source:        'drawer_abc123',
+    valid_from:    '2026-04-25',
+    valid_to:      null,
+    confidence:    0.9
+}]->(o:Entity)
+```
+
+Why one edge label instead of one per relation type? Mempalace's KG today stores `relation_type` as a string column on a single `kg_triples` table — already homogeneous. Keeping AGE's edge label uniform preserves that simplicity; queries can filter on `r.relation_type` exactly as the current sqlite KG queries filter on `relation_type =`. A future redesign could shard relation types into separate edge labels for performance, but that's not needed today.
+


+- Each phase transactional. Crash mid-phase → re-run picks up where it left off based on the `migration_*_count` watermark or `IF NOT EXISTS` semantics.
+- Phase 2/3 use `ON CONFLICT DO NOTHING` — re-running over already-copied rows is a no-op, not an error.
+- Phase 5's edge creation is the most fragile (no natural primary key on edges in AGE). The watermark count + a "skip first N triples" resume strategy avoids duplicates.


Copilot AI review requested due to automatic review settings May 11, 2026 02:07

Copilot started reviewing on behalf of jphein May 11, 2026 02:07 View session

jphein merged commit 9afc801 into main May 11, 2026
10 checks passed

jphein deleted the docs/pgvector-age-migration-spec branch May 11, 2026 02:11

Copilot AI reviewed May 11, 2026

View reviewed changes

jphein mentioned this pull request May 11, 2026

docs(plan): pgvector + AGE migration implementation plan #20

Merged

3 tasks

jphein added the documentation Improvements or additions to documentation label May 14, 2026

This was referenced May 24, 2026

docs: True Memory comparison + landscape benchmark survey #174

Merged

fix(search): drop AGE regex branch + statement_timeout guard #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): postgres + pgvector + apache age substrate migration#19

docs(spec): postgres + pgvector + apache age substrate migration#19
jphein merged 1 commit into
mainfrom
docs/pgvector-age-migration-spec

jphein commented May 11, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		\| 2 \| Drawers \| Stream `chromadb.col.get(include=["embeddings","documents","metadatas"])` in batches of `--batch-size`. Build `unnest(...)` arrays. One `INSERT ... SELECT FROM unnest(...) ON CONFLICT (id) DO NOTHING` per batch. Track running count in `backend_meta.migration_drawer_count` for visibility. \| Yes (`ON CONFLICT DO NOTHING`) \| `migration_phase_drawers` \|
		\| 3 \| Closets \| Same shape as drawers, smaller cardinality. Usually one or two batches. \| Yes \| `migration_phase_closets` \|

Conversation

jphein commented May 11, 2026

Summary

What's in the spec

Upstream composition

Test plan

Out of scope / deferred

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants