Skip to content

Commit 8fd0b01

Browse files
jpheinclaude
andauthored
fix(mcp): mempalace_kg_stats returns structured envelope on transient psycopg failures (#309)
Observed in production 2026-05-28 09:59 PDT (familiar): postgres got cgroup-OOM-killed under writethrough load. The mempalace_kg_stats MCP tool propagated the raw psycopg.OperationalError to the envelope as "Tool error in mempalace_kg_stats" — an opaque -32000 internal error with no signal that "retry in a moment" is the right response. Wrap the tool in try/except for the two psycopg families that fire on dropped connections (OperationalError, InterfaceError) and return: {"error": "backend_unavailable", "detail": "...", "retryable": true} Other exceptions (cypher syntax, value validation, schema mismatch) still propagate — those are bugs, not transient backend state, and "retryable" would mask them. The transient-error classifier is broken out as _is_transient_postgres_error() so other MCP tools can adopt the same shape without re-implementing the check. Broader refactor moving this into _call_kg is left as a follow-up. Closes #299 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a3d8fa5 commit 8fd0b01

7 files changed

Lines changed: 348 additions & 156 deletions

File tree

FORK_CHANGELOG.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,37 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
162162
*Files:* `mempalace/knowledge_graph_age.py`, `mempalace/kg_triple_worker.py`, `mempalace/mcp_server.py`, `tests/test_knowledge_graph_age.py`, `tests/test_kg_triple_worker.py`, `tests/test_mcp_server.py`
163163

164164

165+
### Fixed
166+
167+
168+
- **mempalace_kg_stats returns structured backend-unavailable envelope on transient psycopg failures (#299)** ([`HEAD`](https://github.com/techempower-org/mempalace/commit/HEAD))
169+
Observed in production 2026-05-28 09:59 PDT (familiar): postgres
170+
OOM-killed under writethrough load. The `mempalace_kg_stats` MCP
171+
tool propagated the raw `psycopg.OperationalError` to the
172+
envelope as `Tool error in mempalace_kg_stats` — an opaque
173+
-32000 internal error to the caller, with no signal that
174+
"retry in a moment" is the right response.
175+
176+
Wraps `tool_kg_stats` with a try/except that catches the two
177+
psycopg families that fire on dropped connections
178+
(`OperationalError`, `InterfaceError`) and returns
179+
`{"error": "backend_unavailable", "detail": "...", "retryable": true}`.
180+
Other exceptions (cypher syntax, value validation, schema
181+
mismatch) still propagate — those are bugs, not transient
182+
backend state, and "retryable" would mask them.
183+
184+
Transient-error classifier broken out as
185+
`_is_transient_postgres_error()` so other MCP tools can adopt
186+
the shape without re-implementing the family check. Broader
187+
`_call_kg` refactor left as a follow-up — the smallest-blast-
188+
radius fix is the right shape while
189+
techempower-org/familiar.realm.watch#50 (raise postgres
190+
MemoryMax cap) is being addressed.
191+
192+
*Tests:* 3 — tests/test_mcp_server.py (psycopg.OperationalError surfaces structured envelope, psycopg.InterfaceError same, non-transient ValueError propagates)
193+
*Files:* `mempalace/mcp_server.py`, `tests/test_mcp_server.py`
194+
195+
165196
## [2026-05-27]
166197

167198

README.md

Lines changed: 78 additions & 77 deletions
Large diffs are not rendered by default.

docs/fork-changes.yaml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,41 @@
2424

2525
entries:
2626

27+
- id: mcp-kg-stats-structured-error-envelope
28+
date: 2026-05-28
29+
bucket: Fixed
30+
commit: HEAD
31+
area: MCP
32+
summary: "mempalace_kg_stats returns structured backend-unavailable envelope on transient psycopg failures (#299)"
33+
tests: "3 — tests/test_mcp_server.py (psycopg.OperationalError surfaces structured envelope, psycopg.InterfaceError same, non-transient ValueError propagates)"
34+
files:
35+
- mempalace/mcp_server.py
36+
- tests/test_mcp_server.py
37+
body: |
38+
Observed in production 2026-05-28 09:59 PDT (familiar): postgres
39+
OOM-killed under writethrough load. The `mempalace_kg_stats` MCP
40+
tool propagated the raw `psycopg.OperationalError` to the
41+
envelope as `Tool error in mempalace_kg_stats` — an opaque
42+
-32000 internal error to the caller, with no signal that
43+
"retry in a moment" is the right response.
44+
45+
Wraps `tool_kg_stats` with a try/except that catches the two
46+
psycopg families that fire on dropped connections
47+
(`OperationalError`, `InterfaceError`) and returns
48+
`{"error": "backend_unavailable", "detail": "...", "retryable": true}`.
49+
Other exceptions (cypher syntax, value validation, schema
50+
mismatch) still propagate — those are bugs, not transient
51+
backend state, and "retryable" would mask them.
52+
53+
Transient-error classifier broken out as
54+
`_is_transient_postgres_error()` so other MCP tools can adopt
55+
the shape without re-implementing the family check. Broader
56+
`_call_kg` refactor left as a follow-up — the smallest-blast-
57+
radius fix is the right shape while
58+
techempower-org/familiar.realm.watch#50 (raise postgres
59+
MemoryMax cap) is being addressed.
60+
61+
2762
- id: cli-why-and-tunnels-fast-path
2863
date: 2026-05-28
2964
bucket: Added

mempalace/mcp_server.py

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2565,8 +2565,51 @@ def _timeline(kg):
25652565

25662566

25672567
def tool_kg_stats():
2568-
"""Knowledge graph overview: entities, triples, relationship types."""
2569-
return _call_kg(lambda kg: kg.stats())
2568+
"""Knowledge graph overview: entities, triples, relationship types.
2569+
2570+
Returns a structured error envelope on transient postgres failures
2571+
(the connection dropped between `_call_kg` opening the handle and
2572+
`kg.stats()` finishing its query — typically caused by a postgres
2573+
OOM-kill or restart under load). The caller sees
2574+
``{"error": "backend_unavailable", "retryable": True, ...}`` and can
2575+
surface "try again in a moment" instead of an opaque -32000
2576+
internal error. See techempower-org/mempalace#299.
2577+
2578+
Non-transient errors (cypher syntax, value-validation, schema
2579+
mismatch) still propagate — those need a real fix, not a retry.
2580+
"""
2581+
try:
2582+
return _call_kg(lambda kg: kg.stats())
2583+
except Exception as e: # noqa: BLE001
2584+
if _is_transient_postgres_error(e):
2585+
logger.warning("mempalace_kg_stats: backend transiently unavailable: %s", e)
2586+
return {
2587+
"error": "backend_unavailable",
2588+
"detail": str(e),
2589+
"retryable": True,
2590+
}
2591+
raise
2592+
2593+
2594+
def _is_transient_postgres_error(exc: BaseException) -> bool:
2595+
"""True when ``exc`` is a psycopg connection-dropped error.
2596+
2597+
Matches the same two error families ``kg_triple_worker._execute_with_retry``
2598+
catches (techempower-org/mempalace#298): ``OperationalError`` (server
2599+
closed the connection, OOM-restart, network blip) and
2600+
``InterfaceError`` (the connection is closed, pool returned a dead
2601+
handle). Other psycopg errors — ``DataError``, ``ProgrammingError``
2602+
on the postgres side, syntax errors — are bugs, not transient
2603+
backend state, and should propagate.
2604+
2605+
Returns False if psycopg isn't importable (sqlite-only deployments
2606+
can't have postgres-transient errors).
2607+
"""
2608+
try:
2609+
import psycopg
2610+
except ImportError:
2611+
return False
2612+
return isinstance(exc, (psycopg.OperationalError, psycopg.InterfaceError))
25702613

25712614

25722615
# ==================== AGENT DIARY ====================

tests/test_mcp_server.py

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1812,6 +1812,76 @@ def test_kg_stats(self, monkeypatch, config, palace_path, seeded_kg):
18121812
result = tool_kg_stats()
18131813
assert result["entities"] >= 4
18141814

1815+
# --- Transient backend failures return structured envelope (#299) ---
1816+
1817+
def test_kg_stats_returns_structured_envelope_on_postgres_operational_error(
1818+
self, monkeypatch, config, palace_path, seeded_kg
1819+
):
1820+
"""When postgres drops a connection mid-stats (OOM-kill, network
1821+
blip, statement_timeout), ``tool_kg_stats`` should return a
1822+
structured ``{"error": "backend_unavailable", "retryable": True}``
1823+
envelope instead of propagating the raw ``psycopg.OperationalError``
1824+
as a -32000 internal error. Regression for techempower-org/mempalace#299.
1825+
"""
1826+
import pytest
1827+
1828+
psycopg = pytest.importorskip("psycopg")
1829+
1830+
_patch_mcp_server(monkeypatch, config, seeded_kg)
1831+
from mempalace import mcp_server
1832+
1833+
def _boom(kg):
1834+
raise psycopg.OperationalError("server closed the connection unexpectedly")
1835+
1836+
monkeypatch.setattr(mcp_server, "_call_kg", lambda op: _boom(None))
1837+
1838+
result = mcp_server.tool_kg_stats()
1839+
assert result["error"] == "backend_unavailable"
1840+
assert result["retryable"] is True
1841+
assert "server closed" in result["detail"]
1842+
1843+
def test_kg_stats_returns_structured_envelope_on_postgres_interface_error(
1844+
self, monkeypatch, config, palace_path, seeded_kg
1845+
):
1846+
"""``psycopg.InterfaceError`` ("the connection is closed") — the
1847+
sibling family fired by a pool returning a dead handle — also
1848+
gets the structured envelope. See #299."""
1849+
import pytest
1850+
1851+
psycopg = pytest.importorskip("psycopg")
1852+
1853+
_patch_mcp_server(monkeypatch, config, seeded_kg)
1854+
from mempalace import mcp_server
1855+
1856+
def _boom(kg):
1857+
raise psycopg.InterfaceError("the connection is closed")
1858+
1859+
monkeypatch.setattr(mcp_server, "_call_kg", lambda op: _boom(None))
1860+
1861+
result = mcp_server.tool_kg_stats()
1862+
assert result["error"] == "backend_unavailable"
1863+
assert result["retryable"] is True
1864+
1865+
def test_kg_stats_propagates_non_transient_errors(
1866+
self, monkeypatch, config, palace_path, seeded_kg
1867+
):
1868+
"""A ``ValueError`` (cypher syntax, value-validation, etc.) is a
1869+
bug, not a transient backend state. It should propagate so the
1870+
MCP caller sees a real error instead of "retryable backend
1871+
unavailable" — that would mask the bug. See #299."""
1872+
import pytest
1873+
1874+
_patch_mcp_server(monkeypatch, config, seeded_kg)
1875+
from mempalace import mcp_server
1876+
1877+
def _boom(kg):
1878+
raise ValueError("cypher syntax error: unexpected token at position 42")
1879+
1880+
monkeypatch.setattr(mcp_server, "_call_kg", lambda op: _boom(None))
1881+
1882+
with pytest.raises(ValueError, match="cypher syntax error"):
1883+
mcp_server.tool_kg_stats()
1884+
18151885
# --- Date validation at the MCP boundary (issue #1164) ---
18161886

18171887
def test_kg_add_rejects_invalid_valid_from(self, monkeypatch, config, palace_path, kg):

0 commit comments

Comments
 (0)