Add optional PostgreSQL backend with pg_sorted_heap support by skuznetsov · Pull Request #665 · MemPalace/mempalace

skuznetsov · 2026-04-12T03:32:26Z

What does this PR do?

Follow-up to #413. This adds an optional PostgreSQL palace backend while keeping ChromaDB as the default zero-config backend.

The PostgreSQL backend supports two extension paths:

pg_sorted_heap preferred path: sorted_heap table storage, svec embeddings, and lazy sorted_hnsw vector indexes.
pgvector fallback path: regular heap table, vector embeddings, and lazy hnsw vector indexes when only vector is available.

It also adds:

Backend selection via MEMPALACE_BACKEND=postgres / config backend: postgres and MEMPALACE_POSTGRES_DSN / config postgres_dsn.
Batched INSERT ... SELECT ... FROM unnest(...) ... ON CONFLICT writes for add/upsert, avoiding delete-then-add upserts.
Fail-closed metadata filter handling for unsupported operators and explicit empty id lists.
Lazy vector index creation using PostgreSQL catalog/stat estimates instead of an exact COUNT(*) threshold check.
Documentation and a helper script for installing pg_sorted_heap or pgvector from source.

ChromaDB remains the documented default and the public raw-mode benchmark path. The PostgreSQL backend is opt-in for larger, long-lived, or server/team deployments.

Related: #266

Merge order: This is the Stage 2 PostgreSQL backend follow-up and is intended to merge after the Stage 1 backend seam from #413. #413 is already merged into develop; this PR builds on that seam rather than replacing it.

How to test

uv run python -m pytest tests/ -v
- 617 passed, 106 deselected, 413 warnings
uv run pytest -q
- 617 passed, 106 deselected, 413 warnings
uv run ruff check .
- passed
git diff --check
- passed
Local PostgreSQL smoke checks were also run for both extension paths:
- pg_sorted_heap / svec
- pgvector / vector
- batch INSERT ... SELECT ... FROM unnest(...) ... ON CONFLICT
- estimated row-count based lazy index threshold check

Checklist

Tests pass (python -m pytest tests/ -v)
No hardcoded paths
Linter passes (ruff check .)

web3guru888

PostgreSQL Backend Review — Thorough Read-Through

Very solid PR. This is a well-architected backend addition that I've been hoping to see since #413 laid the seam. I read every line of postgres.py (631 LOC), the config changes, the mcp_server refactor, palace.py routing, tests, docs, and the install script. Here's what I found:

Strengths

1. Batched writes via INSERT ... SELECT ... FROM unnest()
This is the right approach for PostgreSQL — avoids N round-trips and the delete-then-add antipattern that ChromaDB's update() effectively does. The ON CONFLICT (id) DO UPDATE SET ... upsert is clean and atomic. The local dedup of rows_by_id before hitting the DB is a nice touch that prevents wasted work on duplicate IDs within a single batch.

2. Fail-closed metadata filter handling
_where_to_sql() raising ValueError on unsupported operators ($gt, $lt, $gte, $lte) rather than silently ignoring them is exactly right. ChromaDB supports these, but silent degradation in a database backend would cause subtle retrieval bugs. The explicit error forces callers to adapt.

3. Lazy vector index with catalog estimates
Using pg_stat_all_tables.n_live_tup / pg_class.reltuples instead of COUNT(*) for the 5,000-row threshold check is smart — avoids a sequential scan on every write batch. The _rows_since_index_check interval limiter prevents even the catalog query from running on every insert. And the _local_row_estimate fallback handles freshly-created tables where ANALYZE hasn't run yet.

4. mcp_server.py refactor is clean
Replacing the direct chromadb.PersistentClient import with palace.get_collection() is the right abstraction. The cache invalidation via _get_collection_cache_key() that composes (backend, target, collection_name, inode) handles both backends correctly. The _fetch_all_metadata loop change from while offset < total to while True with break-on-empty is safer since count() could race with concurrent writes.

5. estimated_count() on BaseCollection
Good addition — defaults to count() so ChromaDB callers are unaffected, but PostgreSQL can use catalog stats. The layers.py change to prefer estimated_count when available is a nice performance win for status checks.

6. Test coverage is thorough
315 lines of tests covering: input validation, where-clause parsing (including $or, $and, $in, $nin, $eq, $ne), empty-id rejection, batch conflict handling, dedup within a batch, lazy index creation, estimated count, and the backend factory's create/no-create paths. All without requiring a live PostgreSQL server — good use of monkeypatching.

Minor observations (not blocking)

1. _metadata_value() stringifies everything — metadata values go through str() before comparison, which means where={"count": 5} compares the string "5" against metadata->>'count'. This matches ChromaDB's behavior (which also stringifies in its SQLite layer), but it's worth a doc comment since PostgreSQL's jsonb could support typed comparisons. Could be a future enhancement.

2. The _embed() function uses a module-global _embedder — this is fine for single-process use, but if someone embeds MemPalace in a multi-threaded server, SentenceTransformer.encode() may not be thread-safe depending on the model backend. A threading.Lock around the lazy init would be defensive. Low priority since the typical deployment is single-process CLI.

3. Connection lifecycle — autocommit = True is the right choice for this use case (no long transactions), but if the connection drops mid-session (network blip to a remote PG), the next _get_conn() call will reconnect since it checks self._conn.closed. Good. One edge case: if psycopg2 raises OperationalError mid-query, the connection may be left in a bad state without closed being set. A try/except around cursor operations that resets self._conn = None on OperationalError would make it more resilient for long-lived server deployments.

4. sorted_heap table has PRIMARY KEY (wing, room, id) but also a unique index on (id) — this means id is globally unique across wings/rooms, which matches ChromaDB semantics. The composite PK gives sorted_heap the sort key it needs. Clean.

Impact for downstream consumers

This is significant for anyone running MemPalace in multi-agent or server environments. The ChromaDB backend works well for single-user CLI, but concurrent access from multiple agents (e.g., specialist agents each writing to the palace simultaneously) hits ChromaDB's SQLite locking. PostgreSQL with proper connection pooling would eliminate that bottleneck. The BaseCollection abstraction also opens the door for future backends (LanceDB per #574, etc.).

APPROVE — this is ready to merge. Clean architecture, solid test coverage, thorough docs. The minor observations above are suggestions for follow-up, not blockers.

Formalizes the BaseCollection/BaseBackend contract introduced as a seam in #413 into an interchangeability spec that third-party backends can build to. Driven by six in-flight backend PRs (#574, #643, #665, #697, #700, #381) each implementing the interface differently. Key decisions captured: entry-point distribution, typed QueryResult/ GetResult replacing Chroma dict shape, daemon-first multi-palace model via PalaceRef, required where-clause subset (incl. $contains), mandatory embedder injection with model-identity validation, capability tokens, shared pytest conformance suite, and a backend-neutral migrate/verify CLI.

Incorporates review feedback from skuznetsov (Postgres, #665) and dekoza (Lance, #574) on issue #737: - §1.5: split 'accepts embeddings=' (signature compliance) from 'persists embeddings as-is' (correctness). Adds supports_embeddings_passthrough capability; the former is universal, the latter is required to label a migration lossless. - §1.5: model identity check becomes a three-state machine (known_match / known_mismatch / unknown) so legacy palaces without recorded identity don't hard-fail on upgrade. - §1.4: makes explicit that supports_contains_fast is the ONLY performance floor the spec promises; without it callers MUST assume O(n). $contains is a correctness requirement, not a performance one. - §3.3: clarifies auto-detect is an upgrade-compat path only, never the selection mechanism for new palaces. - §8.2: migrate CLI refuses to run against a target lacking supports_embeddings_passthrough unless --accept-re-embed is passed; migration record now captures lossless status and model identities.

dekoza · 2026-04-13T10:30:35Z

(Updated version of my review comment above — adding a question for @web3guru888)

Concern: pg_sorted_heap as "preferred" path introduces a single-maintainer dependency

Thanks for the solid work on the backend abstraction — the lazy HNSW index creation and extension auto-detection are well thought out.

I want to flag a concern about the dual-path architecture that I do not think was addressed in @web3guru888's otherwise thorough review. The "preferred" backend (pg_sorted_heap) is maintained by the PR author at github.com/skuznetsov/pg_sorted_heap. This creates a situation where:

Availability: pg_sorted_heap is not packaged in any Linux distribution, not available on any managed PostgreSQL service (RDS, Cloud SQL, Neon, Supabase), and not listed on PGXN. The 215-line install script in this PR exists because there is no standard installation path. By contrast, pgvector is available everywhere.
Maintenance surface: Every SQL statement in the 631-line backend is forked into two code paths (DDL, types, index operators, casts). This doubles the review and maintenance burden for a path that, realistically, very few users will run.
Marginal benefit for this workload: The physical sort ordering benefits sequential scans within a wing/room, but the dominant access pattern here is HNSW vector search across all data, where heap layout is irrelevant.
Optics: A contributor introducing their own extension as the recommended path in someone else's project — even with a working fallback — is a pattern the project should evaluate carefully. I'm not questioning the author's intent, but the project should be cautious about taking on single-maintainer C extension dependencies that only the contributor can debug.

@web3guru888 — your review praised the composite PK design on the sorted_heap path and the dual-backend detection, but I'd appreciate your take on the supply-chain concern: should this project take a hard dependency path on an unpackaged extension authored by the same person submitting the PR? The pgvector fallback works identically for all practical purposes.

Suggestion: Ship this PR with pgvector as the only path. If pg_sorted_heap matures to the point where it has distro packages, managed-service availability, or independent adopters, it can be added later as an optional optimization — not the default.

This would cut ~200 lines of dual-path code, eliminate the install script, and remove a bus-factor-of-one dependency from the project.

skuznetsov · 2026-04-13T11:42:31Z

Thanks for raising this. I agree the availability / bus-factor concern is real in the general case, and I do not want this PR to make pg_sorted_heap a hard dependency for MemPalace.

I pushed 5976c2f to tighten the docs framing:

ChromaDB remains the zero-config default.
PostgreSQL remains opt-in.
Within PostgreSQL, pgvector is now documented as the broadly available managed-service / simpler supply-chain path.
pg_sorted_heap is now documented as an optional optimized self-managed path, not a blanket recommendation.
The docs also call out that pg_sorted_heap is listed in the PostgreSQL Software Catalogue, but is not bundled with managed PostgreSQL providers.

Update: the PGXN point was fair when raised. pg_sorted_heap is now also published on PGXN: https://pgxn.org/dist/pg_sorted_heap/ (0.13.0, published 2026-04-13). I would still keep the docs wording conservative: PGXN publication improves the standard discovery/download path, but it does not make the extension available on managed PostgreSQL providers.

I do not think removing the pg_sorted_heap path is the right tradeoff for this PR, because the fallback keeps the dependency optional and the dual-path support is useful for validating the backend seam against more than one PostgreSQL storage/index implementation. The runtime selection is also explicit in the docs now: if both extensions are present, MemPalace selects pg_sorted_heap; if an operator wants the pgvector path, install only vector.

So the intended boundary is: no new hard dependency for MemPalace, no managed-service requirement to use pg_sorted_heap, but keep the optimized self-managed path available for people who intentionally install that extension.

…nd registry (RFC 001 §10) Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres, #700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target to align against. Scope (this PR): - Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps existing callers working while the attribute-access migration proceeds. - BaseCollection is now kwargs-only across add/upsert/query/get/delete/update with ABC defaults for estimated_count/close/health and a non-atomic default update() (§1.1–1.2). - PalaceRef replaces raw path strings at the backend boundary (§2.2). - BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3). - mempalace.backends entry-point group + in-tree registry with resolve_backend_for_palace priority order matching §3.2–3.3. - ChromaCollection normalizes chroma returns into typed results; unknown where-clause operators raise UnsupportedFilterError (no silent drop, §1.4). - ChromaBackend absorbs the inode/mtime client-cache freshness check previously duplicated in mcp_server._get_client() (§10 + PR #757). - searcher.py migrated to typed-attribute access as the reference call site; remaining callers land in a follow-up. - pyproject: chroma registered via [project.entry-points."mempalace.backends"]. Out of scope (explicit follow-ups): - Full caller migration off the dict-compat shim across palace.py, mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py, palace_graph.py, cli.py, closet_llm.py. - Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5). - maintenance_state() / run_maintenance() benchmark hooks (§7.3). - AbstractBackendContractSuite full coverage (§7.1–7.2). - mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8). Tests: 970 passed (up from 967 on develop); new coverage for typed results, empty-result outer-shape preservation, \$regex rejection, registry lookup, priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection. Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).

Rebuild the stage 2 PostgreSQL backend on top of the RFC 001 backend contract merged in MemPalace#995 instead of carrying forward the stale Chroma-shaped adapter. The new backend implements BaseBackend/BaseCollection with PalaceRef, typed QueryResult/GetResult returns, registry discovery, and env/config backend selection while keeping Chroma as the default path. The PostgreSQL collection supports both pg_sorted_heap and pgvector. It prefers pg_sorted_heap when the extension is installed, falls back to pgvector when only vector is installed, writes rows with INSERT ... SELECT FROM unnest(...) and ON CONFLICT, preserves first-wins semantics for add() and last-wins semantics for upsert(), supports required metadata filters, returns optional embeddings, keeps exact count() separate from estimated_count(), and creates vector indexes lazily after the threshold. Text embedding reuses Chroma's default local embedding function so the postgres extra only adds psycopg2-binary instead of a second ML dependency stack. Add user-facing PostgreSQL backend documentation, a PGXS install helper for pg_sorted_heap/pgvector, README links, lockfile metadata for the postgres extra, and unit coverage for registry exposure, config/env routing, PostgreSQL validation, filter translation, batch upsert, typed result shapes, estimated count, and palace-level backend selection. Verification: - uv run --frozen python -m pytest tests/ -q --ignore=tests/benchmarks -> 1044 passed, 1 warning - uv run --frozen python -m pytest tests/test_backends.py tests/test_config.py tests/test_config_extra.py -q -> 77 passed - ruff check mempalace/backends/postgres.py mempalace/backends/registry.py mempalace/backends/__init__.py mempalace/config.py mempalace/palace.py tests/test_backends.py tests/test_config_extra.py - ruff format --check mempalace/backends/postgres.py mempalace/backends/registry.py mempalace/backends/__init__.py mempalace/config.py mempalace/palace.py tests/test_backends.py tests/test_config_extra.py - git diff --check && uv lock --check && bash -n scripts/install_pg_backend.sh - uv run --frozen --extra postgres import smoke for psycopg2/PostgresBackend - local PostgreSQL 18 integration smoke with pg_sorted_heap: add/upsert/query/get/delete/health and sorted_heap table AM - local PostgreSQL 18 integration smoke with pgvector fallback: add/query/count and heap table AM - local PostgreSQL 18 forced-threshold vector-index smoke for sorted_hnsw and hnsw - secret-pattern scan across changed files

skuznetsov · 2026-04-19T00:43:14Z

Follow-up after #995 / RFC 001 landed:

I rewrote this PR on top of the merged backend contract instead of keeping the old Chroma-shaped adapter. The current head is 83d5448b6166a67b27d3c903fcf69cf800b0be6c and now implements the RFC path directly:

PostgresBackend implements BaseBackend with get_collection(*, palace: PalaceRef, collection_name, create=False, options=None).
PostgresCollection implements BaseCollection and returns the typed QueryResult / GetResult shapes.
Backend selection now goes through the registry/config path from the RFC work, with Chroma still remaining the default backend.
The PostgreSQL path remains opt-in, with pgvector as the broadly available fallback and pg_sorted_heap as an optional self-managed optimized path when that extension is intentionally installed.

I also re-ran the hostile review pass after the rewrite. Current verification:

local full suite: uv run --frozen python -m pytest tests/ -q --ignore=tests/benchmarks -> 1044 passed, 1 warning
local targeted suite: 77 passed
ruff check, ruff format --check, uv lock --check, git diff --check, and install-script syntax check passed
local PostgreSQL 18 smoke with pg_sorted_heap verified add/upsert/query/get/delete/health and sorted_heap table AM
local PostgreSQL 18 smoke with pgvector fallback and forced vector-index creation was also run before the final rebase
GitHub CI is green on lint, version guard, Linux 3.9/3.11/3.13, macOS, and Windows

So the review surface should now be the RFC-based backend implementation, not the pre-#995 adapter shape.

@skuznetsov

…alace#1072 exist, not a write-from-scratch Previous version claimed Postgres was "out of scope — requires writing a new BaseBackend implementation". That was wrong: @skuznetsov's MemPalace#665 already ships a PostgreSQL backend (pg_sorted_heap preferred, pgvector fallback), and @malakhov-dmitrii's MemPalace#1072 wires MEMPALACE_BACKEND env var through palace._DEFAULT_BACKEND so entry-point backends actually get used. RFC 001 seam (MemPalace#413, MemPalace#995) is merged; registry.py advertises mempalace_postgres as the canonical example. Reframed as parallel track: palace-daemon is deployable today (no migration, wraps current ChromaDB palace); Postgres needs both PRs to land plus a drawer migration (export_palace() + importer — not code we'd write) but eliminates the entire 1.5.x failure class at the storage layer. Starting with palace-daemon since it's deployable now; the daemon is storage-agnostic so Postgres remains available as a later swap. Also fixed palace-daemon lock path: /tmp/palace-daemon-8085.lock (per-port), not /tmp/palace-daemon.lock.

Rebuild the stage 2 PostgreSQL backend on top of the RFC 001 backend contract merged in MemPalace#995 instead of carrying forward the stale Chroma-shaped adapter. The new backend implements BaseBackend/BaseCollection with PalaceRef, typed QueryResult/GetResult returns, registry discovery, and env/config backend selection while keeping Chroma as the default path. The PostgreSQL collection supports both pg_sorted_heap and pgvector. It prefers pg_sorted_heap when the extension is installed, falls back to pgvector when only vector is installed, writes rows with INSERT ... SELECT FROM unnest(...) and ON CONFLICT, preserves first-wins semantics for add() and last-wins semantics for upsert(), supports required metadata filters, returns optional embeddings, keeps exact count() separate from estimated_count(), and creates vector indexes lazily after the threshold. Text embedding reuses Chroma's default local embedding function so the postgres extra only adds psycopg2-binary instead of a second ML dependency stack. Add user-facing PostgreSQL backend documentation, a PGXS install helper for pg_sorted_heap/pgvector, README links, lockfile metadata for the postgres extra, and unit coverage for registry exposure, config/env routing, PostgreSQL validation, filter translation, batch upsert, typed result shapes, estimated count, and palace-level backend selection. Fork resolution notes (Jeffrey Hein, 2026-05-11): - README.md: kept fork README (per feedback_fork_readme_handling). - uv.lock: regenerated from auto-merged pyproject.toml so the fork's existing deps and MemPalace#665's psycopg2-binary extra land coherently. - mempalace/palace.py: took MemPalace#665's get_collection() rewrite + re-added `_DEFAULT_BACKEND = get_backend("chroma")` as a fork-only compatibility shim. mcp_server.py + tests still reference the module attribute (cache invalidation + monkeypatch); follow-up commits will migrate callers to the new abstraction. - tests/test_backends.py: took the union of import sets — fork-side HNSW helpers (_HNSW_MISSING_METADATA_DATA_FLOOR, _pin_hnsw_threads, _segment_appears_healthy, quarantine_invalid_hnsw_metadata, quarantine_stale_hnsw) plus MemPalace#665's new postgres helpers (_metadata_value, _parse_vector_literal, _vec_literal). (cherry picked from commit 83d5448)

…emPalace#665 Cherry-picking MemPalace#665 changed `palace.get_collection()` from `_DEFAULT_BACKEND.get_collection(palace_path, collection_name=...)` to `get_backend(backend_name).get_collection(palace=PalaceRef(...), collection_name=..., options=None)`. `get_configured_collection_name()` survives as a thin back-compat wrapper but the live path now reads `MempalaceConfig().collection_name` directly (env `MEMPALACE_COLLECTION_NAME` or file config). Updated fake_get_collection's signature to match the new contract, switched from monkeypatching the function to setting the env var, and asserts the PalaceRef + options arguments alongside the existing collection_name and create assertions. `palace._DEFAULT_BACKEND` is now aliased to `get_backend("chroma")` (see fork-only shim in palace.py); the monkeypatch on that attribute still routes the captured call because the default backend resolves to chroma. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After cherry-picking MemPalace#665, palace.get_collection() only resolved "use config default" when the caller passed exactly DEFAULT_COLLECTION_NAME. Fork-side callers (searcher.search_memories, convo_miner.process_file, sweeper.sweep_palace, sync.sync_palace, diary_ingest, etc.) pass collection_name=None explicitly per the fork's pre-MemPalace#665 convention, which bypassed the check and propagated None to chromadb. Result: 30 test failures in test_searcher.py with "argument 'name': 'NoneType' object cannot be converted to 'PyString'". Accepting both forms — None (fork) and DEFAULT_COLLECTION_NAME (MemPalace#665) — restores the green floor without disturbing MemPalace#665's structure. Eventual follow-up: migrate fork-side callers to omit the keyword or pass the sentinel, then tighten the type back to `str = DEFAULT_COLLECTION_NAME`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(backends): pgvector substrate via MemPalace#665 cherry-pick + smoke tests

jphein · 2026-05-11T22:33:46Z

Operator data point on the pg_sorted_heap-vs-pgvector decision: the jphein fork cherry-picked this PR onto fork main (commit 5e90c72 in #21) and has been running the pgvector fallback path against a homelab apache/age:release_PG16_1.6.0 container with postgresql-16-pgvector apt-installed on top — no pg_sorted_heap.

The fallback codepath works end-to-end:

3 smoke tests in tests/test_backends_postgres.py pass (BaseCollection conformance + drawer round-trip + L2 distance ordering on a vector(384) column)
palace.get_collection() routes through resolve_backend_for_palace cleanly with backend=postgres in config; PostgresCollection is exposed as the singleton via the entry-point registry
palace=PalaceRef(id=..., local_path=...) + options={"dsn": ...} plumb correctly from the caller boundary into psycopg2

@dekoza's bus-factor concern (pg_sorted_heap availability) doesn't reproduce on this deployment because we never hit that codepath — pgvector's the only vector type installed; _detect_extensions resolves to it; the sorted_heap branch is dead code on this image.

This isn't a vote against pg_sorted_heap as the preferred path when available — your 2026-04-19 rewrite kept the conditional resolution clean. It's a signal that the pgvector fallback is genuinely production-shaped, not just a placeholder. Happy to share the docker-compose service definition + smoke-test harness if useful for your own CI substrate.

Full rationale for the fork's "wait + cherry-pick" composition stance is at docs/internal/pgvector-665-decision.md; Plan-B trigger documented as 2026-06-08 if review stalls. Looking forward to this merging.

The plan was written 2026-05-10. Two days later PR #21 (feat/pgvector-age-impl) merged + commit a3ee623 landed the AGE skeleton — together completing all of Phase 0, Phase 1, and Task 2.1. Status header at top of the plan now captures the reconciliation: Phase 0 (preflight) ✅ Done 2026-05-11 Phase 1 (pgvector backend) ✅ Done 2026-05-11 (1.A.1, 1.2, 1.3, 1.4, 1.5) Phase 2.1 (AGE skeleton) ✅ Done 2026-05-11 Phase 2.2 (add_triple) ❌ Not done — canonical next task Phase 2.3 (temporal) ❌ Phase 2.4 (kg_backend cfg) ❌ Phase 3 (migration tool) ❌ Phase 4 (benchmark) ❌ 37 of 92 checkboxes ticked within done-task ranges. 1.B.* (fork-port path) sub-tasks remain unticked because we took the 1.A path (wait for MemPalace#665 + cherry-pick). Evidence for each ✅ entry is in the new status header — file paths, line numbers, and the test/CI fixtures that demonstrate the work. The canonical next task is Phase 2.2: implement add_triple() with Cypher MERGE/CREATE on the existing KnowledgeGraphAGE skeleton. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jphein · 2026-05-14T14:58:00Z

Operator follow-up to my 2026-05-11 comment above — surfacing a concurrency issue we hit running this code in production.

What we observed

On disks, palace-daemon's /mcp endpoint wedged for 28+ minutes. Three concurrent CREATE INDEX … USING hnsw queries held ACCESS EXCLUSIVE on mempalace_drawers, with INSERT/DELETE/secondary CREATE INDEX queued behind them.

Race

_maybe_create_vector_index does a lookup-then-create with no inter-connection guard:

cur.execute("SELECT 1 FROM pg_indexes WHERE indexname = %s", (index_name,))
if cur.fetchone():
    self._vector_index_ready = True
    return
# ...
cur.execute(
    self._sql.SQL("CREATE INDEX {} ON {} USING {} (embedding {})").format(...)
)

Under any concurrency setting that allows more than one writer (we run PALACE_MAX_CONCURRENCY=4), three writers tipping the VECTOR_INDEX_CHECK_INTERVAL_ROWS threshold in the same window each do the same SELECT, each see no index, each issue CREATE INDEX. All three then serialize on ACCESS EXCLUSIVE, full HNSW builds stack up, and concurrent writes block for the duration of the build.

Aggravating factor (separate, lower priority)

The existence check is by literal name {table_name}_vec_idx. Our migration tool had created a functionally-identical HNSW index on (embedding vector_cosine_ops) under a different name (mempalace_drawers_embedding_hnsw) — the check missed it and the lazy-create fired on every threshold cross even though a perfectly good HNSW already existed on the same column.

This second issue is more opinionated to fix (does mempalace own the index name, or accept any compatible HNSW?) — happy to defer that to a design conversation.

Minimal hardening that would have prevented the wedge

Two-line addition; keeps the existing flow:

# Inter-connection guard + DDL idempotency
with self._get_conn().cursor() as cur:
    cur.execute("SELECT pg_advisory_xact_lock(hashtext(%s))", (f"vec_idx:{self.table_name}",))
    cur.execute("SELECT 1 FROM pg_indexes WHERE indexname = %s", (index_name,))
    if cur.fetchone():
        self._vector_index_ready = True
        return
    # ...
    cur.execute(
        self._sql.SQL("CREATE INDEX IF NOT EXISTS {} ON {} USING {} (embedding {})").format(...)
    )

pg_advisory_xact_lock serializes the check+create across connections, auto-releases at commit.
IF NOT EXISTS belt-and-suspenders against the lock being skipped or the threshold being crossed mid-build by a backend that wasn't holding the advisory.

CREATE INDEX CONCURRENTLY would be a bigger but also worthwhile change — pgvector ≥ 0.5.0 supports it for hnsw, and the trade-off (≈2× build time, no lock) is favorable for write-heavy workloads. Happy to scope that separately.

Local mitigation

We unwedged out-of-band by canceling the duplicate CREATE INDEX backends and letting one finish — the deployment is healthy again. We also detached our daemon's hook caller so we don't block the harness on this case if it ever recurs.

Offer

If a fixup commit on this PR (minimal version above) would help, happy to put one up. Wanted to surface the diagnosis here first since this is your code path and you may want a different shape.

…ex race surfaced Captures today's two threads: - Phases 4.1 + 4.2 of the pgvector+AGE migration actually ran (271k drawers + KG triples migrated via the substrate-portable out-of-band pipeline since chromadb SIGSEGV blocked the canonical CLI tool). - A pgvector lazy-index race in mempalace's PostgresBackend wedged /mcp for 28+ min on first concurrent-writer load; recovered out-of- band, filed techempower-org/mempalace#73 with two-line hardening proposal, posted operator comment on MemPalace/mempalace#665. - palace-daemon hook.py now os.fork()s + closes all three stdio FDs in the child so a wedged daemon can't block the harness. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1/4.2 executed Plan previously said Phases 4.1 and 4.2 were "Runbook shipped" and "Documented"; today they actually ran on the homelab palace. Updated the status table to "Executed 2026-05-14" with evidence pointers, and added a post-mortem section covering: - The substrate-portable out-of-band migration pipeline that was needed because chromadb SIGSEGV blocked migrate-to-postgres against the live 271k-drawer source palace. - The pgvector lazy-index race incident: _maybe_create_vector_index has both a lookup-vs-create race and a name-coupled existence check; filed as #73 with a two-line hardening proposal. - The hook detach work in palace-daemon that hardened the harness against any future daemon wedges of this class. - Open follow-ups (await maintainer response on upstream MemPalace#665 before pre-empting their fix; #73 stays pending; mempalace plugin hooks.json auto-update edge case still latent but self-healing on restart). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

_maybe_create_vector_index had a SELECT-then-CREATE race plus a name- coupled existence check. Three concurrent writers tipping VECTOR_INDEX_CHECK_INTERVAL_ROWS in the same window each did the same lookup (all saw empty), each issued `CREATE INDEX … USING hnsw` with no IF NOT EXISTS guard, each took ACCESS EXCLUSIVE, and they stacked — wedging mempalace_drawers for ~30 min on 271k drawers and blocking every other write behind them. Operator diagnosis at #73, observed in production on disks 2026-05-14. Hardening: - Wrap the check+create in `pg_advisory_xact_lock(hashtext('vec_idx: <table>'))` so concurrent writers serialize through the lock instead of racing. - The main backend connection is autocommit=True, and psycopg2 forbids toggling autocommit mid-flight (already hit in phase_1_schema). Use a short-lived side connection so the advisory has a real transaction to attach to. - Use `CREATE INDEX IF NOT EXISTS` so any path that slipped past the advisory (e.g. an out-of-band loader pre-creating the index under the expected name) is a no-op, not a duplicate-object error. Out of scope here: `CREATE INDEX CONCURRENTLY` (incompatible with the advisory-lock transaction; pgvector supports it but would need a separate code path), and the structural pg_index-by-column lookup (opens an API/design question about who owns the index name). Upstream parallel: operator comment posted on MemPalace#665 with the same minimal-hardening proposal. Conflict resolution at merge time accepted as the cost of fork-first patching. Refs: #73 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Serialize the lazy vector-index check/create path across PostgreSQL connections with a session-level advisory lock. The backend connection uses autocommit, so a transaction-level advisory lock would not protect the subsequent CREATE INDEX statement; the session lock is released explicitly in a finally block. Make vector-index DDL idempotent with CREATE INDEX IF NOT EXISTS and re-check for an existing index while holding the advisory lock. Also accept a compatible ready/valid HNSW or sorted_hnsw index on the embedding column even if it was created under a different name, which avoids rebuilding a functionally equivalent operator-provided index. Add regression coverage for serialized lazy index creation, compatible existing index detection, and advisory unlock on DDL failure. Verification: uv run pytest tests/test_backends.py -q; uv run pytest tests/ -q --ignore=tests/benchmarks; uv run ruff check mempalace/backends/postgres.py tests/test_backends.py; uv run ruff format --check mempalace/backends/postgres.py tests/test_backends.py; git diff --check. Note: repo-wide ruff format --check still reports pre-existing unrelated formatting drift in 9 tests.

skuznetsov · 2026-05-14T16:03:24Z

Thanks for the production report and the clear lock diagnosis, @jphein. The race is real, and your observation about multiple writers crossing the lazy-index threshold at the same time was the missing operational case.

I pushed 85a42f1 with the fix:

The lazy vector-index check/create path is now serialized across PostgreSQL connections with a session-level advisory lock.
I intentionally used pg_advisory_lock(...) / pg_advisory_unlock(...), not pg_advisory_xact_lock(...), because this backend connection currently runs with autocommit = True; a transaction-level advisory lock would be released at the end of the lock statement and would not protect the following CREATE INDEX.
The DDL is now CREATE INDEX IF NOT EXISTS as a second idempotency guard.
The code re-checks for an existing compatible index while holding the lock.
It also now accepts an existing ready/valid HNSW or sorted_hnsw index on embedding with the expected opclass even if the index was created under a different name, so operator-managed indexes like the one you described should not trigger repeated lazy creation.

Verification I ran locally:

uv run pytest tests/test_backends.py -q -> 45 passed
uv run pytest tests/ -q --ignore=tests/benchmarks -> 1047 passed, 1 warning
uv run ruff check mempalace/backends/postgres.py tests/test_backends.py
uv run ruff format --check mempalace/backends/postgres.py tests/test_backends.py
git diff --check
small PostgreSQL catalog sanity probe for the pg_index / pg_opclass compatible-index query shape

I agree CREATE INDEX CONCURRENTLY is worth considering separately, but I kept this commit focused on preventing duplicate lazy builds and avoiding the wedge you hit.

Resolve PR MemPalace#665 conflicts after develop advanced. Resolution details: - Keep the RFC backend registry path in palace.get_collection while preserving develop's optional collection_name semantics. - Update MCP reconnect/cache reset to close the selected registry backend instead of the removed palace._DEFAULT_BACKEND singleton. - Combine PostgreSQL backend tests with develop's Chroma HNSW quarantine tests. - Regenerate uv.lock from the merged pyproject so postgres, gpu, dml, and coreml extras are all represented. Verification: uv run pytest tests/test_backends.py tests/test_mcp_server.py::TestCacheInvalidation -q; uv run pytest tests/ -q --ignore=tests/benchmarks; uv run ruff check mempalace/palace.py mempalace/mcp_server.py mempalace/backends/postgres.py tests/test_backends.py tests/test_mcp_server.py; uv run ruff format --check mempalace/palace.py mempalace/mcp_server.py mempalace/backends/postgres.py tests/test_backends.py tests/test_mcp_server.py; uv lock --check; conflict-marker scan on mempalace/tests/uv.lock.

jphein · 2026-05-15T07:31:58Z

Third operator follow-up — substrate cutover is now production-stable on our fork. Posting numbers so they're indexed against this PR while it's still in flight.

State

Cutover complete on techempower-org/mempalace (formerly jphein/mempalace). Postgres backend + pgvector + Apache AGE serving live workload via palace-daemon on disks:8085. 273k drawers across 7 canonical rooms. Familiar (techempower-org/familiar.realm.watch) reads it over HTTP for live chat retrieval.
Hybrid retrieval shipped. candidate_strategy="hybrid" unions vector ∪ BM25 (postgres tsvector) ∪ graph-expanded candidates (AGE) and hybrid-reranks. Daemon-side endpoints: /search/keyword, /search/hybrid.
pgvector lazy-index race fix from my 2026-05-14 comment landed on our fork at 4566f8a and has held up under concurrent mining since. Recovery procedure from that comment exercised once cleanly during a separate incident.

Benchmark results — postgres backend, scale=small, 273k-drawer palace

8 of 9 mempalace tests/benchmarks/ suites passed end-to-end. (Skipped test_chromadb_stress — chromadb-specific, doesn't apply on postgres.) Wall: 1h 24m 25s on --bench-scale=small. Full JSON snapshot landed on our fork at techempower-org/mempalace#78.

Headlines:

Bench	Number	Notes
`test_ingest_bench` @ 100 files	6.5 files/s, 15.3 drawers/s	peak RSS 389 MB (Δ 25.5 MB), no allocator churn
`test_search_bench` p50 (hybrid)	125 ms	from a separate 2026-05-14 run, captured at 2026-05-14-search-bench-hybrid-cutover.json
`test_search_bench` p99 (4 workers)	200 ms	concurrent recall stable
`test_knowledge_graph_bench`	flat	latency stable 500 → 5000 scale points (AGE indexes holding)
`test_layers_bench`	sub-linear	RSS growth on unbounded fetch through 5000 scale — the chromadb-era quadratic does not reproduce on postgres
`test_mcp_bench` repeated calls	no leak	RSS plateau across long sessions
`test_memory_profile` heap top	identical shape vs chromadb-era	no regression on the search hot path post-cutover

The "no regression vs chromadb-era heap allocators" was the result I was most worried about given the substrate switch — it survived.

Methodology callouts for anyone implementing alternative encoders against this backend

chromadb.api.types.EmbeddingFunction 1.5+ added embed_query to the protocol. A custom encoder that defines only __call__ + name() works during Collection.upsert but raises AttributeError on Collection.query(query_texts=...). mempalace's searcher catches the AttributeError and silently falls back to BM25, so the swap looks like it ran but the encoder's embeddings are never actually exercised on queries. Subclassing EmbeddingFunction (which inherits a default embed_query that delegates to __call__) fixes it.

Worth flagging because this also affects anyone implementing the Backend interface (RFC 001 / #743) with a SentenceTransformer-shaped encoder — the protocol mismatch shows up at query time on the same layer. We landed a defensive doc note on our fork at techempower-org/mempalace#80; happy to upstream as a one-line comment in mempalace/embedding.py if useful.

Adjacent work this surfaced

3-way RRF reproduction at n=200 on the same backend (techempower-org/mempalace#77, tied to #1384): default ONNX + 2 adaptmem FT-Code checkpoints fuse to MRR 0.5101 (+0.0841 over best solo). Postgres backend with pg_trgm GIN + tsvector GIN handles the union-then-rerank fan-out cleanly.
n=200 git-derived probe set for chunking ablations at the same PR — replaces a hand-curated n=20 set whose paired bootstrap CIs all overlapped zero. Eliminates the "your probe set is hand-picked" objection for any future retrieval ablation against the postgres backend.

Happy to share probe-level rank data, raw bench JSONs, or the daemon-side /search/hybrid trace format if any of this is useful while #665 lands.

…37) Four call sites in cli.py still used the legacy positional signature ``backend.get_collection(palace_path, collection_name)`` from before PR #21's MemPalace#665 cherry-pick. They worked because ChromaBackend.get_collection accepts both forms via ``_normalize_get_collection_args`` (the answer to issue #37 question (a): the compat shim exists and is intentional). Migrating for consistency with palace.get_collection and searcher.search_memories, which already use the new shape. Future backends (postgres, in-memory test fakes) may not carry the same positional compat surface; converging the codebase on the kwargs form keeps the migration runway short when one of those tightens. Touched sites (cli.py): - cmd_purge match-count probe (line ~1192) - cmd_drawers list-by-wing/room probe (~1282) - cmd_repair_status drawer count (~1492) - cmd_extract palace open (~1645) One test asserted the old positional call shape — updated to match the new PalaceRef call shape. 75/75 cli tests pass. Closes #37 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…37) (#91) * refactor(cli): migrate get_collection call sites to PalaceRef kwargs (#37) Four call sites in cli.py still used the legacy positional signature ``backend.get_collection(palace_path, collection_name)`` from before PR #21's MemPalace#665 cherry-pick. They worked because ChromaBackend.get_collection accepts both forms via ``_normalize_get_collection_args`` (the answer to issue #37 question (a): the compat shim exists and is intentional). Migrating for consistency with palace.get_collection and searcher.search_memories, which already use the new shape. Future backends (postgres, in-memory test fakes) may not carry the same positional compat surface; converging the codebase on the kwargs form keeps the migration runway short when one of those tightens. Touched sites (cli.py): - cmd_purge match-count probe (line ~1192) - cmd_drawers list-by-wing/room probe (~1282) - cmd_repair_status drawer count (~1492) - cmd_extract palace open (~1645) One test asserted the old positional call shape — updated to match the new PalaceRef call shape. 75/75 cli tests pass. Closes #37 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: ruff format (CI lint pass-through) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

skuznetsov marked this pull request as ready for review April 12, 2026 03:44

skuznetsov requested review from bensig, igorls and milla-jovovich as code owners April 12, 2026 03:44

web3guru888 approved these changes Apr 12, 2026

View reviewed changes

This was referenced Apr 12, 2026

RFC: Storage backend plugin specification #737

Open

feat: support Chroma HttpClient + per-tenant collection prefix #697

Closed

igorls mentioned this pull request Apr 12, 2026

docs: RFC 001 — storage backend plugin specification #743

Open

5 tasks

This comment was marked as duplicate.

Sign in to view

igorls added area/install pip/uv/pipx/plugin install and packaging area/mcp MCP server and tools area/search Search and retrieval storage labels Apr 14, 2026

igorls mentioned this pull request Apr 18, 2026

refactor(backends): RFC 001 §10 cleanup — typed results, PalaceRef, registry #995

Merged

4 tasks

skuznetsov force-pushed the codex/pg-backend-stage2 branch 2 times, most recently from 21df7a8 to 16f8af7 Compare April 19, 2026 00:27

skuznetsov force-pushed the codex/pg-backend-stage2 branch from 16f8af7 to 83d5448 Compare April 19, 2026 00:32

This was referenced May 11, 2026

docs(spec): postgres + pgvector + apache age substrate migration techempower-org/mempalace#19

Merged

docs(plan): pgvector + AGE migration implementation plan techempower-org/mempalace#20

Merged

This was referenced May 11, 2026

feat(backends): pgvector substrate via #665 cherry-pick + smoke tests techempower-org/mempalace#21

Merged

docs(readme): pivot to four-layer model + Auto Dream vindication framing techempower-org/mempalace#22

Merged

jphein added a commit to techempower-org/mempalace that referenced this pull request May 11, 2026

Merge pull request #21 from jphein/feat/pgvector-age-impl

de24984

feat(backends): pgvector substrate via MemPalace#665 cherry-pick + smoke tests

jphein added the enhancement New feature or request label May 11, 2026

jphein mentioned this pull request May 14, 2026

PostgresBackend._maybe_create_vector_index: race + name-mismatch wedges database under concurrent writes techempower-org/mempalace#73

Closed

This was referenced May 22, 2026

refactor(palace): migrate off _DEFAULT_BACKEND compat shim (#35) techempower-org/mempalace#122

Merged

Monitor upstream #665 Plan-B trigger date (2026-06-08) techempower-org/mempalace#166

Open

This was referenced Jun 3, 2026

Add pluggable vector backends #1679

Merged

RFC 001 §7.3: implement backend maintenance hooks (observable run_maintenance) #1725

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional PostgreSQL backend with pg_sorted_heap support#665

Add optional PostgreSQL backend with pg_sorted_heap support#665
skuznetsov wants to merge 3 commits into
MemPalace:developfrom
skuznetsov:codex/pg-backend-stage2

skuznetsov commented Apr 12, 2026 •

edited

Loading

Uh oh!

web3guru888 left a comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

dekoza commented Apr 13, 2026

Uh oh!

skuznetsov commented Apr 13, 2026 •

edited

Loading

Uh oh!

skuznetsov commented Apr 19, 2026

Uh oh!

jphein commented May 11, 2026

Uh oh!

jphein commented May 14, 2026 •

edited

Loading

Uh oh!

skuznetsov commented May 14, 2026

Uh oh!

jphein commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

skuznetsov commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How to test

Checklist

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

PostgreSQL Backend Review — Thorough Read-Through

Strengths

Minor observations (not blocking)

Impact for downstream consumers

Uh oh!

This comment was marked as duplicate.

Uh oh!

dekoza commented Apr 13, 2026

Uh oh!

skuznetsov commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skuznetsov commented Apr 19, 2026

Uh oh!

jphein commented May 11, 2026

Uh oh!

jphein commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What we observed

Race

Aggravating factor (separate, lower priority)

Minimal hardening that would have prevented the wedge

Local mitigation

Offer

Uh oh!

skuznetsov commented May 14, 2026

Uh oh!

jphein commented May 15, 2026

State

Benchmark results — postgres backend, scale=small, 273k-drawer palace

Methodology callouts for anyone implementing alternative encoders against this backend

Adjacent work this surfaced

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

skuznetsov commented Apr 12, 2026 •

edited

Loading

skuznetsov commented Apr 13, 2026 •

edited

Loading

jphein commented May 14, 2026 •

edited

Loading