Skip to content

Commit b87ce05

Browse files
jpheinclaude
andauthored
feat(kg): SPOC temporal validity — context slot + auto-derived valid_from + as_of timeline (#161) (#294)
* feat(kg): SPOC temporal validity — context slot + auto-derived valid_from + as_of timeline KG triples gain a `context` fourth axis on the AGE backend that anchors each fact to its witnessing drawer/conversation. Read paths (`query_triples`, `query_entity`, `query_relationship`, `timeline`) all surface the slot; triples written before this change read back with `context=None` so consumers don't need a missing-key check. No AGE schema migration needed — just a new property on existing RELATION edges. The async KG-extraction worker (`kg_triple_worker.py`) now anchors every auto-extracted triple to `context=drawer:{drawer_id}` and falls back to the drawer's metadata `timestamp` / `filed_at` / `session_created_at` (first non-empty wins) for `valid_from` when the LLM extractor doesn't supply one. An explicit `valid_from` from the extractor — a date parsed out of the prose — still takes precedence. MCP-tool surface: * `mempalace_kg_add` accepts `context` (AGE backend stores; SQLite silently ignores so callers don't branch on backend). * `mempalace_kg_timeline` accepts `as_of` and validates it through the same ISO-8601 gate as `mempalace_kg_query`; the accepted value round-trips in the response. Tests: +20 (3631 → 3651). Worker unit tests cover the context-anchor, auto-derive priority, extractor-vs-drawer precedence, and the missing-timestamp open-interval path; AGE tests cover the context slot on every read path + `timeline(as_of=...)` filtering both with and without an entity filter; MCP tests cover boundary rejection on `as_of` and `context` plus the round-trip response shape. Closes #161 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docs): backtick-quote `_derive_valid_from` in fork-changes entry CI's markdownlint reads underscores following ", " as italic-emphasis markers (MD037), tripping on the unquoted `_derive_valid_from` test description. Backtick-quoting promotes it to inline-code and lets the lint pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 1e6327a commit b87ce05

12 files changed

Lines changed: 970 additions & 185 deletions

FORK_CHANGELOG.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,50 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
6868
*Files:* `scripts/eval_fusion_ab.py`, `tests/test_eval_fusion_ab.py`, `docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.md`, `docs/research/2026-05-28-rrf-vs-hybrid-rerank-ab.json`
6969

7070

71+
- **KG triples gain SPOC context slot + worker auto-derives valid_from from drawer metadata (#161)** ([`HEAD`](https://github.com/techempower-org/mempalace/commit/HEAD))
72+
KG triples now carry a fourth axis — ``context`` — that anchors a
73+
fact to where it was witnessed (e.g. ``drawer:abc123``,
74+
``conversation:2026-05-28``). The ``add_triple`` write path on the
75+
AGE backend stores it as a property on the ``RELATION`` edge; every
76+
read path (``query_triples``, ``query_entity``, ``query_relationship``,
77+
``timeline``) surfaces it in the result dict. Triples written
78+
before this slot existed read back with ``context=None``, so
79+
consumers don't need a missing-key check.
80+
81+
The async KG-extraction worker (``kg_triple_worker.py``) now:
82+
83+
* **Anchors every auto-extracted triple** to its witnessing drawer
84+
via ``context=f"drawer:{drawer_id}"`` — the SPOC fourth axis is
85+
always populated on auto-derived facts.
86+
* **Auto-derives ``valid_from``** from the drawer's metadata
87+
when the LLM extractor doesn't supply one. Priority order is
88+
``timestamp`` (sweeper / convo_miner) → ``filed_at`` (legacy diary)
89+
``session_created_at`` (opencode adapter); first non-empty
90+
wins. Missing keys leave ``valid_from`` open, which read paths
91+
already treat as "active since forever."
92+
* **Defers to the extractor** when it does emit an explicit
93+
``valid_from`` — a date the LLM parsed out of the prose
94+
("starting May 2025") is more specific than the drawer's
95+
authored time and takes precedence.
96+
97+
The MCP-tool surface grew matching parameters:
98+
99+
* ``mempalace_kg_add`` accepts ``context`` (AGE backend stores;
100+
SQLite silently ignores so callers don't need to branch on
101+
backend).
102+
* ``mempalace_kg_timeline`` accepts ``as_of`` and validates it
103+
through the same ISO-8601 gate as ``mempalace_kg_query``. The
104+
accepted value round-trips in the response so callers can echo
105+
the temporal slice.
106+
107+
No AGE schema migration was needed — the slot is just an
108+
additional property on existing edges. Triples written before this
109+
change continue to read back cleanly with ``context=None``.
110+
111+
*Tests:* 20 — tests/test_kg_triple_worker.py (context-cypher inclusion/omission, `_derive_valid_from` priority/None paths, worker anchors context=drawer:id, derives valid_from from metadata timestamp, extractor valid_from wins over drawer timestamp, missing timestamp writes open valid_from); tests/test_knowledge_graph_age.py (add_triple persists context, optional/omitted, query_entity returns context, timeline with as_of, timeline without entity respects as_of, timeline returns context field); tests/test_mcp_server.py (kg_timeline rejects invalid as_of, includes as_of in response, default omits as_of, kg_add rejects context with null bytes)
112+
*Files:* `mempalace/knowledge_graph_age.py`, `mempalace/kg_triple_worker.py`, `mempalace/mcp_server.py`, `tests/test_knowledge_graph_age.py`, `tests/test_kg_triple_worker.py`, `tests/test_mcp_server.py`
113+
114+
71115
## [2026-05-27]
72116

73117

README.md

Lines changed: 77 additions & 76 deletions
Large diffs are not rendered by default.

docs/fork-changes.yaml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,60 @@ entries:
7676
2009); a sweep is a follow-up if RRF is competitive enough to
7777
be worth refining.
7878
79+
- id: kg-spoc-temporal-validity
80+
date: 2026-05-28
81+
bucket: Added
82+
commit: HEAD
83+
area: Search
84+
summary: "KG triples gain SPOC context slot + worker auto-derives valid_from from drawer metadata (#161)"
85+
tests: "20 — tests/test_kg_triple_worker.py (context-cypher inclusion/omission, `_derive_valid_from` priority/None paths, worker anchors context=drawer:id, derives valid_from from metadata timestamp, extractor valid_from wins over drawer timestamp, missing timestamp writes open valid_from); tests/test_knowledge_graph_age.py (add_triple persists context, optional/omitted, query_entity returns context, timeline with as_of, timeline without entity respects as_of, timeline returns context field); tests/test_mcp_server.py (kg_timeline rejects invalid as_of, includes as_of in response, default omits as_of, kg_add rejects context with null bytes)"
86+
files:
87+
- mempalace/knowledge_graph_age.py
88+
- mempalace/kg_triple_worker.py
89+
- mempalace/mcp_server.py
90+
- tests/test_knowledge_graph_age.py
91+
- tests/test_kg_triple_worker.py
92+
- tests/test_mcp_server.py
93+
body: |
94+
KG triples now carry a fourth axis — ``context`` — that anchors a
95+
fact to where it was witnessed (e.g. ``drawer:abc123``,
96+
``conversation:2026-05-28``). The ``add_triple`` write path on the
97+
AGE backend stores it as a property on the ``RELATION`` edge; every
98+
read path (``query_triples``, ``query_entity``, ``query_relationship``,
99+
``timeline``) surfaces it in the result dict. Triples written
100+
before this slot existed read back with ``context=None``, so
101+
consumers don't need a missing-key check.
102+
103+
The async KG-extraction worker (``kg_triple_worker.py``) now:
104+
105+
* **Anchors every auto-extracted triple** to its witnessing drawer
106+
via ``context=f"drawer:{drawer_id}"`` — the SPOC fourth axis is
107+
always populated on auto-derived facts.
108+
* **Auto-derives ``valid_from``** from the drawer's metadata
109+
when the LLM extractor doesn't supply one. Priority order is
110+
``timestamp`` (sweeper / convo_miner) → ``filed_at`` (legacy diary)
111+
→ ``session_created_at`` (opencode adapter); first non-empty
112+
wins. Missing keys leave ``valid_from`` open, which read paths
113+
already treat as "active since forever."
114+
* **Defers to the extractor** when it does emit an explicit
115+
``valid_from`` — a date the LLM parsed out of the prose
116+
("starting May 2025") is more specific than the drawer's
117+
authored time and takes precedence.
118+
119+
The MCP-tool surface grew matching parameters:
120+
121+
* ``mempalace_kg_add`` accepts ``context`` (AGE backend stores;
122+
SQLite silently ignores so callers don't need to branch on
123+
backend).
124+
* ``mempalace_kg_timeline`` accepts ``as_of`` and validates it
125+
through the same ISO-8601 gate as ``mempalace_kg_query``. The
126+
accepted value round-trips in the response so callers can echo
127+
the temporal slice.
128+
129+
No AGE schema migration was needed — the slot is just an
130+
additional property on existing edges. Triples written before this
131+
change continue to read back cleanly with ``context=None``.
132+
79133
- id: cli-bulk-move-relocation
80134
date: 2026-05-27
81135
bucket: Added

mempalace/kg_triple_worker.py

Lines changed: 88 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,40 @@ async def _fetch_drawer_text_async(conn, drawer_id: str) -> Optional[str]:
197197
return row[0] if row else None
198198

199199

200+
async def _fetch_drawer_text_and_metadata_async(
201+
conn, drawer_id: str
202+
) -> tuple[Optional[str], Optional[dict]]:
203+
"""Return (document, metadata) for ``drawer_id`` or (None, None) if missing.
204+
205+
Companion to ``_fetch_drawer_text_async`` used by the SPOC temporal
206+
pipeline (techempower-org/mempalace#161): the metadata blob carries the
207+
drawer's ``timestamp`` / ``filed_at`` from the upstream mining or
208+
conversation stage, which we map to ``valid_from`` on every extracted
209+
triple. A drawer without those keys still extracts cleanly — the worker
210+
just omits the auto-derived ``valid_from``.
211+
"""
212+
async with conn.cursor() as cur:
213+
await cur.execute(
214+
"SELECT document, metadata FROM mempalace_drawers WHERE id = %s LIMIT 1",
215+
(drawer_id,),
216+
)
217+
row = await cur.fetchone()
218+
if not row:
219+
return None, None
220+
document = row[0]
221+
metadata = row[1]
222+
if isinstance(metadata, str):
223+
# psycopg occasionally returns jsonb as text depending on type cast;
224+
# tolerate both so downstream code can use a dict consistently.
225+
try:
226+
import json as _json
227+
228+
metadata = _json.loads(metadata) if metadata else {}
229+
except (ValueError, TypeError):
230+
metadata = {}
231+
return document, metadata if isinstance(metadata, dict) else {}
232+
233+
200234
def _fetch_drawer_text(conn, drawer_id: str) -> Optional[str]:
201235
"""Return the ``document`` column for ``drawer_id``, or None if absent."""
202236
with conn.cursor() as cur:
@@ -208,6 +242,35 @@ def _fetch_drawer_text(conn, drawer_id: str) -> Optional[str]:
208242
return row[0] if row else None
209243

210244

245+
# Metadata keys, in priority order, that the worker maps to a triple's
246+
# ``valid_from`` when the LLM extractor didn't supply one. ``timestamp``
247+
# is the standard sweeper / convo_miner field; ``filed_at`` is the older
248+
# diary stamp; ``session_created_at`` covers opencode adapters. First
249+
# non-empty wins. Centralized so a future format change has one edit
250+
# point. See techempower-org/mempalace#161 for the SPOC rollout.
251+
_DRAWER_TIMESTAMP_KEYS = ("timestamp", "filed_at", "session_created_at")
252+
253+
254+
def _derive_valid_from(metadata: Optional[dict]) -> Optional[str]:
255+
"""Pull the drawer's authored time from metadata for SPOC valid_from.
256+
257+
Returns the first non-empty string value found at one of
258+
``_DRAWER_TIMESTAMP_KEYS``. Returns ``None`` when no candidate key is
259+
populated — in that case the triple is written with an open
260+
``valid_from`` (NULL), which read paths already treat as "active
261+
since forever" (see ``KnowledgeGraphAGE.query_triples`` as_of
262+
semantics). Sanitization happens at the AGE layer
263+
(``sanitize_iso_temporal``); this helper only selects the candidate.
264+
"""
265+
if not isinstance(metadata, dict):
266+
return None
267+
for key in _DRAWER_TIMESTAMP_KEYS:
268+
value = metadata.get(key)
269+
if isinstance(value, str) and value.strip():
270+
return value
271+
return None
272+
273+
211274
async def _mark_completed_async(conn, drawer_id: str, triple_count: int) -> None:
212275
async with conn.cursor() as cur:
213276
await cur.execute(
@@ -459,6 +522,7 @@ def _add_triple_cypher(
459522
valid_from: Optional[str],
460523
confidence: float,
461524
raw_relation_type: Optional[str] = None,
525+
context: Optional[str] = None,
462526
) -> str:
463527
"""Render the inlined Cypher source for a single ``add_triple`` write.
464528
@@ -467,6 +531,10 @@ def _add_triple_cypher(
467531
Upstream callers (``extract_triples``) already strip nothing extra,
468532
so a hostile LLM output that happens to embed ``$mp_age_q$`` will
469533
fail loudly here rather than escape the SQL boundary.
534+
535+
``context`` is the SPOC anchor (techempower-org/mempalace#161) — set
536+
by the worker to ``drawer:{drawer_id}`` on every auto-extracted
537+
triple so consumers can trace a fact back to its witnessing drawer.
470538
"""
471539
# Build the property map keys dynamically — a Cypher property map
472540
# rejects bare ``NULL`` as a value (``SyntaxError: a name constant is
@@ -492,6 +560,9 @@ def _add_triple_cypher(
492560
if raw_relation_type is not None:
493561
prop_pairs.append("raw_relation_type: $rrt")
494562
params["rrt"] = raw_relation_type
563+
if context is not None:
564+
prop_pairs.append("context: $ctx")
565+
params["ctx"] = context
495566

496567
cypher = f"""
497568
MERGE (s:Entity {{name: $subj}})
@@ -526,6 +597,7 @@ async def add_triple(
526597
valid_from: Optional[str] = None,
527598
confidence: float = DEFAULT_TRIPLE_CONFIDENCE,
528599
raw_relation_type: Optional[str] = None,
600+
context: Optional[str] = None,
529601
) -> None:
530602
# Defense in depth: reject any value carrying the AGE outer
531603
# dollar-quote tag before the inlining step. ``_cypher_literal``
@@ -540,6 +612,8 @@ async def add_triple(
540612
_cypher_literal(valid_from)
541613
if raw_relation_type is not None:
542614
_cypher_literal(raw_relation_type)
615+
if context is not None:
616+
_cypher_literal(context)
543617

544618
cypher_inlined = _add_triple_cypher(
545619
subject,
@@ -549,6 +623,7 @@ async def add_triple(
549623
valid_from=valid_from,
550624
confidence=confidence,
551625
raw_relation_type=raw_relation_type,
626+
context=context,
552627
)
553628
# AGE expects cypher() first arg as a single-quoted string literal
554629
# ("name constant"). psycopg3 binds %s as a server-side $1 param
@@ -633,13 +708,22 @@ async def _process_one(
633708
"""
634709
try:
635710
async with pool.conn() as conn:
636-
text = await _fetch_drawer_text_async(conn, drawer.drawer_id)
711+
text, drawer_metadata = await _fetch_drawer_text_and_metadata_async(
712+
conn, drawer.drawer_id
713+
)
637714
if not text:
638715
async with pool.conn() as conn:
639716
await _mark_completed_async(conn, drawer.drawer_id, 0)
640717
stats.drawers_processed += 1
641718
return
642719

720+
# SPOC temporal scoping (#161): if the LLM extractor doesn't supply
721+
# a valid_from on a triple, fall back to the drawer's authored
722+
# time. This anchors auto-extracted facts to when they were
723+
# witnessed even when the LLM doesn't infer a date from the prose.
724+
derived_valid_from = _derive_valid_from(drawer_metadata)
725+
drawer_context = f"drawer:{drawer.drawer_id}"
726+
643727
triples = await _extract_under_sem(http_client, endpoint, model, text, sem)
644728

645729
for t in triples:
@@ -663,9 +747,10 @@ async def _process_one(
663747
t.subject,
664748
mapped.relation_type or t.predicate,
665749
t.object,
666-
source=f"drawer:{drawer.drawer_id}",
667-
valid_from=t.valid_from,
750+
source=drawer_context,
751+
valid_from=t.valid_from or derived_valid_from,
668752
raw_relation_type=mapped.raw_relation_type,
753+
context=drawer_context,
669754
)
670755
except Exception as e: # noqa: BLE001
671756
logger.warning(

0 commit comments

Comments
 (0)