feat(#251 follow-up): authoring-tier backfill worker by jayzalowitz · Pull Request #271 · jayzalowitz/skytwin

jayzalowitz · 2026-05-12T23:20:43Z

Summary

Pages indexed before #252 (Layer 1) have no authoringTier on metadata, which silently disables Layer 2 for their corpora — the multiplier reads metadata.authoringTier and there's nothing to read. This adds a worker that fills it in retroactively, plus the connector now persists the raw classification headers so reclassification works for every new signal going forward.

Closes (partial): #251 backfill follow-up. Pre-Layer-1 signals that don't carry classification headers stay untagged after this — full recovery for those needs a Gmail re-fetch (separate sub-issue, lower priority).

What runs now

The worker schedules runTierBackfillJob every hour. Each pass:

Queries brain_pages for rows where metadata->>'authoringTier' IS NULL, joins on brain_signals via source_ref, returns up to 200 pairs.
For each page, tries two reclassification paths in order:
- Trust the signal — copy signal.data.authoringTier to page metadata when it already exists. Cheap, lossless, same tier the connector produced at ingest time.
- Reclassify — run classifyEmailAuthoringTier locally on the raw to / cc / inReplyTo / listUnsubscribe / listId / labels headers stored in signal.data.
Writes the result via updatePageMetadata (sets both authoringTier and a normalized fromAddress for the per-sender bulk-hide action shipped in PR feat(#251 follow-up): tier-aware privacy controls — pin / hide / hide-sender #270).
Logs an "unreclassifiable" count for signals carrying neither path and leaves them alone.

Idempotent: re-running on a fully-tagged corpus returns 0 from the find query and the pass is a no-op.

Engine changes

findPagesMissingAuthoringTier(userId | null, limit) new adapter helper. JOIN brain_pages ↔ brain_signals on source_ref = id, filter on tier-missing, optional user scope. Returns { page_id, user_id, signal_data }[].
apps/worker/src/jobs/tier-backfill.ts new worker job + scheduled in apps/worker/src/index.ts at TIER_BACKFILL_INTERVAL_MS = 60 * 60 * 1000. Bounded by batchSize (default 200).
Gmail connector messageToSignal now also stamps to, cc, inReplyTo, listUnsubscribe on signal.data. The classifier already consumed these; now they're preserved in the signal row for future reclassification.
In-memory mirror of the find query for tests.

Tests

9 new worker unit tests (tier-backfill.test.ts) cover signal-tier copy, header reclassification (SENT label + List-Unsubscribe → newsletter), unreclassifiable count, failed-update isolation, find-query throw → empty summary, fromAddress omission when missing, userId scope, default null scope.
4 new in-memory repository tests on findPagesMissingAuthoringTier: tier-present page excluded, signal-missing page skipped, userId scoping, limit cap.

Test plan

pnpm build --concurrency=1 → 35/35 packages.
pnpm test → 70/70 turbo tasks green.
pnpm --filter @skytwin/worker test -- tier-backfill → 9 pass.
pnpm --filter @skytwin/memory-gbrain-crdb-adapter test → 80 pass / 6 skipped (DB-gated).

Deferred

Pre-feat(#251 Layer 1 + Layer 3 minimal): stamp authoring tier on email signals #252 signals that don't carry classification headers (oldest mail in a long-connected mailbox) stay untagged after this lands. The worker logs them as unreclassifiable. Recovering their tier needs an OAuth-token-dependent re-fetch from Gmail — separate sub-issue.

🤖 Generated with Claude Code

Pages indexed before Layer 1 of #251 had no `authoringTier` on metadata, which silently disabled Layer 2 for their corpora. This adds a worker that fills in the tier retroactively, plus the connector now persists the raw classification headers so reclassification works going forward. Engine: - New adapter helper `findPagesMissingAuthoringTier(userId|null, limit)` joins brain_pages ↔ brain_signals via `source_ref = id`, filters on pages where `metadata->>'authoringTier' IS NULL`, optional user scope. - `apps/worker/src/jobs/tier-backfill.ts`: the job. Two reclassification paths: 1. Trust the signal — copy `signal.data.authoringTier` to page metadata when it exists (post-#252 paths that bypassed the metadata projection for any reason). 2. Reclassify — run the classifier locally on the raw `to` / `cc` / `inReplyTo` / `listUnsubscribe` / `listId` / `labels` headers. Pages whose signal carries neither path are counted as "unreclassifiable" and left alone — pre-Layer-1 signals that don't preserve classification headers need a Gmail re-fetch (separate sub-issue, lower priority). - Gmail connector `messageToSignal` now also stamps `to`, `cc`, `inReplyTo`, `listUnsubscribe` on `signal.data` so future reclassification has source data. No behavior change to the existing classifier path; just preserves raw inputs. - In-memory adapter mirror for tests. Scheduling: - Worker runs the job hourly (`TIER_BACKFILL_INTERVAL_MS = 60 * 60 * 1000`). Idempotent: once a corpus is fully tagged the find query returns 0 rows and the pass becomes a no-op. - Batch size 200 per pass, plenty for any reasonable mailbox to converge over a few hours. Tests: 9 worker, 4 adapter. All green. 70/70 turbo tasks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Adds an hourly worker backfill to retroactively populate brain_pages.metadata.authoringTier (and fromAddress) for pages indexed before authoring-tier stamping existed, and persists the Gmail classifier’s raw header inputs onto brain_signals.data so future reclassification is possible without re-fetching from Gmail.

Changes:

Add findPagesMissingAuthoringTier(userId|null, limit) to the CRDB adapter (plus in-memory mirror + tests) to locate tier-missing pages with backing signals.
Add runTierBackfillJob worker job and schedule it hourly from the main worker loop.
Extend Gmail messageToSignal to persist to, cc, inReplyTo, and listUnsubscribe into signal.data for downstream reclassification.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/memory-gbrain-crdb-adapter/src/repository.ts	Adds `findPagesMissingAuthoringTier` query + row shaping for worker consumption.
packages/memory-gbrain-crdb-adapter/src/index.ts	Re-exports the new repository function and row type.
packages/memory-gbrain-crdb-adapter/src/in-memory-repository.ts	Adds in-memory mirror implementation of `findPagesMissingAuthoringTier`.
packages/memory-gbrain-crdb-adapter/src/tests/in-memory-repository.test.ts	Tests for the in-memory `findPagesMissingAuthoringTier` behavior (scope/limit/etc.).
packages/connectors/src/gmail-connector.ts	Persists raw classifier header inputs onto `signal.data`.
CHANGELOG.md	Documents the new backfill worker and connector data shape additions.
apps/worker/src/jobs/tier-backfill.ts	New backfill job implementing “trust signal tier” then “reclassify from headers” paths.
apps/worker/src/index.ts	Schedules the tier backfill job hourly in the worker loop.
apps/worker/src/tests/tier-backfill.test.ts	Unit tests for the worker job’s classification/update/error paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jayzalowitz · 2026-05-12T23:26:27Z

+    user_id: row.user_id,
+    signal_data:
+      typeof row.signal_data === 'string'
+        ? (JSON.parse(row.signal_data) as Record<string, unknown>)


Addressed in 0727fb6. Switched to the file-local parseJson helper (same one used by parsePageRow / parseSettingsRow / parseEntityRow etc.). Returns null on JSON.parse failure; coerce to {} so a malformed signal row just logs as "unreclassifiable" instead of crashing the worker pass.

jayzalowitz · 2026-05-12T23:26:28Z

+ * those came from a non-signal write path (episode, entity) and don't
+ * have classifiable email headers.
+ *
+ * Limit is mandatory and caps the worker's per-pass work — a thousand


Addressed in 0727fb6. Updated the comment to match — "the worker's default batch size is 200; callers can pass any value (lower for tests, higher if catching up a large back-catalog manually)."

jayzalowitz · 2026-05-12T23:26:29Z

+      patch['fromAddress'] = result.fromAddress;
+    }
+    try {
+      await updatePageMetadata(row.user_id, row.page_id, patch);


Addressed in 0727fb6. The worker now checks the return value — affected === 0 is treated as a failure (race where the page was deleted between find + update, or ownership mismatch). Bumps summary.failed, logs with pageId/userId, skips the success counters so the report no longer silently overcounts. New unit test (counts updatePageMetadata returning 0 affected rows as failed) covers the path.

Three findings on the backfill worker, all valid: 1. findPagesMissingAuthoringTier did a bare `JSON.parse(row.signal_data)` when the driver returned JSONB as a string. One malformed signal row would have thrown and tanked the whole worker pass. Switched to the file-local `parseJson` helper (the same one parsePageRow / parseSettingsRow / etc. use) — returns null on parse failure; coerce to {} so the worker logs the row as "unreclassifiable" and keeps going. 2. Doc comment claimed "a thousand pages per cycle is the default in the worker" but the actual default is 200. Updated. 3. The worker was discarding updatePageMetadata's affected-row count. A 0 return (page disappeared between find + update, or ownership mismatch) was getting counted as a successful copy/reclass — silent data lie. Now treated as failed: incremented `summary.failed`, logged with pageId/userId, no copiedFromSignal/reclassified bump. New unit test covers the race path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cross-references the 12 PRs that landed across #251 Layer 1+2+follow-ups (authoring tier, tier-weighted retrieval, pin/hide, backfill, real- embedding ablation), #193 Lifebook follow-ups (capabilities filter, provenance wing filter, per-Lifebook briefing), #179 mobile voice, and #187 AC#4 (Piper TTS) against the project's user-facing docs. README.md: - Version badge 0.6.17.0 → 0.6.21.0 - Package/app count "14 packages and 6 apps" → "29 packages and 7 apps" - Project Status reflects the v0.6 series (embedded LLM, tier-aware memory, per-Lifebook surfaces, voice loop) - "What works today" adds mobile voice capture + the on-device embedded LLM stack (llama.cpp / whisper.cpp / Piper TTS) with the /api/voice/transcribe and /api/voice/synthesize endpoints CLAUDE.md: - llm-client row notes the `embedded` provider and the estimateLlmCostCents() helper - New embedded-llm row covers llama.cpp / whisper.cpp / Piper TTS - connectors row notes the AuthoringTier classifier (#251 Layer 1) - memory-gbrain-crdb-adapter row notes Layer 2 tier-weighted RRF scoring, pin/hide controls (#270), and the backfill worker (#271) - mobile app row notes voice capture via expo-audio + the desktop transcribe round-trip - New twin-mcp-server app row No CHANGELOG changes — each PR's entry was authored by /ship and covers its own slice accurately. No TODOS.md changes — the two open P3s (real production tour mode, multi-instance demo rate limiting) remain blocked on the same product decisions; nothing in this sweep closes them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cross-references the 12 PRs that landed across #251 Layer 1+2+follow-ups (authoring tier, tier-weighted retrieval, pin/hide, backfill, real- embedding ablation), #193 Lifebook follow-ups (capabilities filter, provenance wing filter, per-Lifebook briefing), #179 mobile voice, and #187 AC#4 (Piper TTS) against the project's user-facing docs. README.md: - Version badge 0.6.17.0 → 0.6.21.0 - Package/app count "14 packages and 6 apps" → "29 packages and 7 apps" - Project Status reflects the v0.6 series (embedded LLM, tier-aware memory, per-Lifebook surfaces, voice loop) - "What works today" adds mobile voice capture + the on-device embedded LLM stack (llama.cpp / whisper.cpp / Piper TTS) with the /api/voice/transcribe and /api/voice/synthesize endpoints CLAUDE.md: - llm-client row notes the `embedded` provider and the estimateLlmCostCents() helper - New embedded-llm row covers llama.cpp / whisper.cpp / Piper TTS - connectors row notes the AuthoringTier classifier (#251 Layer 1) - memory-gbrain-crdb-adapter row notes Layer 2 tier-weighted RRF scoring, pin/hide controls (#270), and the backfill worker (#271) - mobile app row notes voice capture via expo-audio + the desktop transcribe round-trip - New twin-mcp-server app row No CHANGELOG changes — each PR's entry was authored by /ship and covers its own slice accurately. No TODOS.md changes — the two open P3s (real production tour mode, multi-instance demo rate limiting) remain blocked on the same product decisions; nothing in this sweep closes them. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 12, 2026 23:20

Copilot started reviewing on behalf of jayzalowitz May 12, 2026 23:21 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

jayzalowitz merged commit 9519394 into main May 12, 2026
8 checks passed

This was referenced May 13, 2026

feat(#251 follow-up): real-embedding ablation result + opt-in test #272

Merged

docs: sync README + CLAUDE.md with v0.6.18-0.6.21 merge sweep #273

Merged

jayzalowitz mentioned this pull request May 18, 2026

Epic: Capability Acquisition Loop — an MCP-native autonomous twin (OSS launch v1) #195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#251 follow-up): authoring-tier backfill worker#271

feat(#251 follow-up): authoring-tier backfill worker#271
jayzalowitz merged 2 commits into
mainfrom
jayzalowitz/251-backfill

jayzalowitz commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jayzalowitz May 12, 2026

Uh oh!

jayzalowitz May 12, 2026

Uh oh!

jayzalowitz May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayzalowitz commented May 12, 2026

Summary

What runs now

Engine changes

Tests

Test plan

Deferred

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jayzalowitz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jayzalowitz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jayzalowitz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants