feat(#251 follow-up): authoring-tier backfill worker#271
Conversation
Pages indexed before Layer 1 of #251 had no `authoringTier` on metadata, which silently disabled Layer 2 for their corpora. This adds a worker that fills in the tier retroactively, plus the connector now persists the raw classification headers so reclassification works going forward. Engine: - New adapter helper `findPagesMissingAuthoringTier(userId|null, limit)` joins brain_pages ↔ brain_signals via `source_ref = id`, filters on pages where `metadata->>'authoringTier' IS NULL`, optional user scope. - `apps/worker/src/jobs/tier-backfill.ts`: the job. Two reclassification paths: 1. Trust the signal — copy `signal.data.authoringTier` to page metadata when it exists (post-#252 paths that bypassed the metadata projection for any reason). 2. Reclassify — run the classifier locally on the raw `to` / `cc` / `inReplyTo` / `listUnsubscribe` / `listId` / `labels` headers. Pages whose signal carries neither path are counted as "unreclassifiable" and left alone — pre-Layer-1 signals that don't preserve classification headers need a Gmail re-fetch (separate sub-issue, lower priority). - Gmail connector `messageToSignal` now also stamps `to`, `cc`, `inReplyTo`, `listUnsubscribe` on `signal.data` so future reclassification has source data. No behavior change to the existing classifier path; just preserves raw inputs. - In-memory adapter mirror for tests. Scheduling: - Worker runs the job hourly (`TIER_BACKFILL_INTERVAL_MS = 60 * 60 * 1000`). Idempotent: once a corpus is fully tagged the find query returns 0 rows and the pass becomes a no-op. - Batch size 200 per pass, plenty for any reasonable mailbox to converge over a few hours. Tests: 9 worker, 4 adapter. All green. 70/70 turbo tasks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an hourly worker backfill to retroactively populate brain_pages.metadata.authoringTier (and fromAddress) for pages indexed before authoring-tier stamping existed, and persists the Gmail classifier’s raw header inputs onto brain_signals.data so future reclassification is possible without re-fetching from Gmail.
Changes:
- Add
findPagesMissingAuthoringTier(userId|null, limit)to the CRDB adapter (plus in-memory mirror + tests) to locate tier-missing pages with backing signals. - Add
runTierBackfillJobworker job and schedule it hourly from the main worker loop. - Extend Gmail
messageToSignalto persistto,cc,inReplyTo, andlistUnsubscribeintosignal.datafor downstream reclassification.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/memory-gbrain-crdb-adapter/src/repository.ts | Adds findPagesMissingAuthoringTier query + row shaping for worker consumption. |
| packages/memory-gbrain-crdb-adapter/src/index.ts | Re-exports the new repository function and row type. |
| packages/memory-gbrain-crdb-adapter/src/in-memory-repository.ts | Adds in-memory mirror implementation of findPagesMissingAuthoringTier. |
| packages/memory-gbrain-crdb-adapter/src/tests/in-memory-repository.test.ts | Tests for the in-memory findPagesMissingAuthoringTier behavior (scope/limit/etc.). |
| packages/connectors/src/gmail-connector.ts | Persists raw classifier header inputs onto signal.data. |
| CHANGELOG.md | Documents the new backfill worker and connector data shape additions. |
| apps/worker/src/jobs/tier-backfill.ts | New backfill job implementing “trust signal tier” then “reclassify from headers” paths. |
| apps/worker/src/index.ts | Schedules the tier backfill job hourly in the worker loop. |
| apps/worker/src/tests/tier-backfill.test.ts | Unit tests for the worker job’s classification/update/error paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| user_id: row.user_id, | ||
| signal_data: | ||
| typeof row.signal_data === 'string' | ||
| ? (JSON.parse(row.signal_data) as Record<string, unknown>) |
There was a problem hiding this comment.
Addressed in 0727fb6. Switched to the file-local parseJson helper (same one used by parsePageRow / parseSettingsRow / parseEntityRow etc.). Returns null on JSON.parse failure; coerce to {} so a malformed signal row just logs as "unreclassifiable" instead of crashing the worker pass.
| * those came from a non-signal write path (episode, entity) and don't | ||
| * have classifiable email headers. | ||
| * | ||
| * Limit is mandatory and caps the worker's per-pass work — a thousand |
There was a problem hiding this comment.
Addressed in 0727fb6. Updated the comment to match — "the worker's default batch size is 200; callers can pass any value (lower for tests, higher if catching up a large back-catalog manually)."
| patch['fromAddress'] = result.fromAddress; | ||
| } | ||
| try { | ||
| await updatePageMetadata(row.user_id, row.page_id, patch); |
There was a problem hiding this comment.
Addressed in 0727fb6. The worker now checks the return value — affected === 0 is treated as a failure (race where the page was deleted between find + update, or ownership mismatch). Bumps summary.failed, logs with pageId/userId, skips the success counters so the report no longer silently overcounts. New unit test (counts updatePageMetadata returning 0 affected rows as failed) covers the path.
Three findings on the backfill worker, all valid:
1. findPagesMissingAuthoringTier did a bare `JSON.parse(row.signal_data)`
when the driver returned JSONB as a string. One malformed signal row
would have thrown and tanked the whole worker pass. Switched to the
file-local `parseJson` helper (the same one parsePageRow / parseSettingsRow
/ etc. use) — returns null on parse failure; coerce to {} so the
worker logs the row as "unreclassifiable" and keeps going.
2. Doc comment claimed "a thousand pages per cycle is the default in the
worker" but the actual default is 200. Updated.
3. The worker was discarding updatePageMetadata's affected-row count.
A 0 return (page disappeared between find + update, or ownership
mismatch) was getting counted as a successful copy/reclass — silent
data lie. Now treated as failed: incremented `summary.failed`,
logged with pageId/userId, no copiedFromSignal/reclassified bump.
New unit test covers the race path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-references the 12 PRs that landed across #251 Layer 1+2+follow-ups (authoring tier, tier-weighted retrieval, pin/hide, backfill, real- embedding ablation), #193 Lifebook follow-ups (capabilities filter, provenance wing filter, per-Lifebook briefing), #179 mobile voice, and #187 AC#4 (Piper TTS) against the project's user-facing docs. README.md: - Version badge 0.6.17.0 → 0.6.21.0 - Package/app count "14 packages and 6 apps" → "29 packages and 7 apps" - Project Status reflects the v0.6 series (embedded LLM, tier-aware memory, per-Lifebook surfaces, voice loop) - "What works today" adds mobile voice capture + the on-device embedded LLM stack (llama.cpp / whisper.cpp / Piper TTS) with the /api/voice/transcribe and /api/voice/synthesize endpoints CLAUDE.md: - llm-client row notes the `embedded` provider and the estimateLlmCostCents() helper - New embedded-llm row covers llama.cpp / whisper.cpp / Piper TTS - connectors row notes the AuthoringTier classifier (#251 Layer 1) - memory-gbrain-crdb-adapter row notes Layer 2 tier-weighted RRF scoring, pin/hide controls (#270), and the backfill worker (#271) - mobile app row notes voice capture via expo-audio + the desktop transcribe round-trip - New twin-mcp-server app row No CHANGELOG changes — each PR's entry was authored by /ship and covers its own slice accurately. No TODOS.md changes — the two open P3s (real production tour mode, multi-instance demo rate limiting) remain blocked on the same product decisions; nothing in this sweep closes them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-references the 12 PRs that landed across #251 Layer 1+2+follow-ups (authoring tier, tier-weighted retrieval, pin/hide, backfill, real- embedding ablation), #193 Lifebook follow-ups (capabilities filter, provenance wing filter, per-Lifebook briefing), #179 mobile voice, and #187 AC#4 (Piper TTS) against the project's user-facing docs. README.md: - Version badge 0.6.17.0 → 0.6.21.0 - Package/app count "14 packages and 6 apps" → "29 packages and 7 apps" - Project Status reflects the v0.6 series (embedded LLM, tier-aware memory, per-Lifebook surfaces, voice loop) - "What works today" adds mobile voice capture + the on-device embedded LLM stack (llama.cpp / whisper.cpp / Piper TTS) with the /api/voice/transcribe and /api/voice/synthesize endpoints CLAUDE.md: - llm-client row notes the `embedded` provider and the estimateLlmCostCents() helper - New embedded-llm row covers llama.cpp / whisper.cpp / Piper TTS - connectors row notes the AuthoringTier classifier (#251 Layer 1) - memory-gbrain-crdb-adapter row notes Layer 2 tier-weighted RRF scoring, pin/hide controls (#270), and the backfill worker (#271) - mobile app row notes voice capture via expo-audio + the desktop transcribe round-trip - New twin-mcp-server app row No CHANGELOG changes — each PR's entry was authored by /ship and covers its own slice accurately. No TODOS.md changes — the two open P3s (real production tour mode, multi-instance demo rate limiting) remain blocked on the same product decisions; nothing in this sweep closes them. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Pages indexed before #252 (Layer 1) have no
authoringTieron metadata, which silently disables Layer 2 for their corpora — the multiplier readsmetadata.authoringTierand there's nothing to read. This adds a worker that fills it in retroactively, plus the connector now persists the raw classification headers so reclassification works for every new signal going forward.Closes (partial): #251 backfill follow-up. Pre-Layer-1 signals that don't carry classification headers stay untagged after this — full recovery for those needs a Gmail re-fetch (separate sub-issue, lower priority).
What runs now
The worker schedules
runTierBackfillJobevery hour. Each pass:brain_pagesfor rows wheremetadata->>'authoringTier' IS NULL, joins onbrain_signalsviasource_ref, returns up to 200 pairs.signal.data.authoringTierto page metadata when it already exists. Cheap, lossless, same tier the connector produced at ingest time.classifyEmailAuthoringTierlocally on the rawto/cc/inReplyTo/listUnsubscribe/listId/labelsheaders stored insignal.data.updatePageMetadata(sets bothauthoringTierand a normalizedfromAddressfor the per-sender bulk-hide action shipped in PR feat(#251 follow-up): tier-aware privacy controls — pin / hide / hide-sender #270).Idempotent: re-running on a fully-tagged corpus returns 0 from the find query and the pass is a no-op.
Engine changes
findPagesMissingAuthoringTier(userId | null, limit)new adapter helper. JOINbrain_pages↔brain_signalsonsource_ref = id, filter on tier-missing, optional user scope. Returns{ page_id, user_id, signal_data }[].apps/worker/src/jobs/tier-backfill.tsnew worker job + scheduled inapps/worker/src/index.tsatTIER_BACKFILL_INTERVAL_MS = 60 * 60 * 1000. Bounded bybatchSize(default 200).messageToSignalnow also stampsto,cc,inReplyTo,listUnsubscribeonsignal.data. The classifier already consumed these; now they're preserved in the signal row for future reclassification.Tests
tier-backfill.test.ts) cover signal-tier copy, header reclassification (SENT label + List-Unsubscribe → newsletter), unreclassifiable count, failed-update isolation, find-query throw → empty summary, fromAddress omission when missing, userId scope, default null scope.findPagesMissingAuthoringTier: tier-present page excluded, signal-missing page skipped, userId scoping, limit cap.Test plan
pnpm build --concurrency=1→ 35/35 packages.pnpm test→ 70/70 turbo tasks green.pnpm --filter @skytwin/worker test -- tier-backfill→ 9 pass.pnpm --filter @skytwin/memory-gbrain-crdb-adapter test→ 80 pass / 6 skipped (DB-gated).Deferred
unreclassifiable. Recovering their tier needs an OAuth-token-dependent re-fetch from Gmail — separate sub-issue.🤖 Generated with Claude Code