Skip to content

feat(#251 follow-up): authoring-tier backfill worker#271

Merged
jayzalowitz merged 2 commits into
mainfrom
jayzalowitz/251-backfill
May 12, 2026
Merged

feat(#251 follow-up): authoring-tier backfill worker#271
jayzalowitz merged 2 commits into
mainfrom
jayzalowitz/251-backfill

Conversation

@jayzalowitz

Copy link
Copy Markdown
Owner

Summary

Pages indexed before #252 (Layer 1) have no authoringTier on metadata, which silently disables Layer 2 for their corpora — the multiplier reads metadata.authoringTier and there's nothing to read. This adds a worker that fills it in retroactively, plus the connector now persists the raw classification headers so reclassification works for every new signal going forward.

Closes (partial): #251 backfill follow-up. Pre-Layer-1 signals that don't carry classification headers stay untagged after this — full recovery for those needs a Gmail re-fetch (separate sub-issue, lower priority).

What runs now

The worker schedules runTierBackfillJob every hour. Each pass:

  1. Queries brain_pages for rows where metadata->>'authoringTier' IS NULL, joins on brain_signals via source_ref, returns up to 200 pairs.
  2. For each page, tries two reclassification paths in order:
    • Trust the signal — copy signal.data.authoringTier to page metadata when it already exists. Cheap, lossless, same tier the connector produced at ingest time.
    • Reclassify — run classifyEmailAuthoringTier locally on the raw to / cc / inReplyTo / listUnsubscribe / listId / labels headers stored in signal.data.
  3. Writes the result via updatePageMetadata (sets both authoringTier and a normalized fromAddress for the per-sender bulk-hide action shipped in PR feat(#251 follow-up): tier-aware privacy controls — pin / hide / hide-sender #270).
  4. Logs an "unreclassifiable" count for signals carrying neither path and leaves them alone.

Idempotent: re-running on a fully-tagged corpus returns 0 from the find query and the pass is a no-op.

Engine changes

  • findPagesMissingAuthoringTier(userId | null, limit) new adapter helper. JOIN brain_pagesbrain_signals on source_ref = id, filter on tier-missing, optional user scope. Returns { page_id, user_id, signal_data }[].
  • apps/worker/src/jobs/tier-backfill.ts new worker job + scheduled in apps/worker/src/index.ts at TIER_BACKFILL_INTERVAL_MS = 60 * 60 * 1000. Bounded by batchSize (default 200).
  • Gmail connector messageToSignal now also stamps to, cc, inReplyTo, listUnsubscribe on signal.data. The classifier already consumed these; now they're preserved in the signal row for future reclassification.
  • In-memory mirror of the find query for tests.

Tests

  • 9 new worker unit tests (tier-backfill.test.ts) cover signal-tier copy, header reclassification (SENT label + List-Unsubscribe → newsletter), unreclassifiable count, failed-update isolation, find-query throw → empty summary, fromAddress omission when missing, userId scope, default null scope.
  • 4 new in-memory repository tests on findPagesMissingAuthoringTier: tier-present page excluded, signal-missing page skipped, userId scoping, limit cap.

Test plan

  • pnpm build --concurrency=1 → 35/35 packages.
  • pnpm test → 70/70 turbo tasks green.
  • pnpm --filter @skytwin/worker test -- tier-backfill → 9 pass.
  • pnpm --filter @skytwin/memory-gbrain-crdb-adapter test → 80 pass / 6 skipped (DB-gated).

Deferred

🤖 Generated with Claude Code

Pages indexed before Layer 1 of #251 had no `authoringTier` on metadata,
which silently disabled Layer 2 for their corpora. This adds a worker
that fills in the tier retroactively, plus the connector now persists
the raw classification headers so reclassification works going forward.

Engine:
- New adapter helper `findPagesMissingAuthoringTier(userId|null, limit)`
  joins brain_pages ↔ brain_signals via `source_ref = id`, filters on
  pages where `metadata->>'authoringTier' IS NULL`, optional user scope.
- `apps/worker/src/jobs/tier-backfill.ts`: the job. Two reclassification
  paths:
    1. Trust the signal — copy `signal.data.authoringTier` to page
       metadata when it exists (post-#252 paths that bypassed the
       metadata projection for any reason).
    2. Reclassify — run the classifier locally on the raw `to` / `cc` /
       `inReplyTo` / `listUnsubscribe` / `listId` / `labels` headers.
  Pages whose signal carries neither path are counted as
  "unreclassifiable" and left alone — pre-Layer-1 signals that don't
  preserve classification headers need a Gmail re-fetch (separate
  sub-issue, lower priority).
- Gmail connector `messageToSignal` now also stamps `to`, `cc`,
  `inReplyTo`, `listUnsubscribe` on `signal.data` so future
  reclassification has source data. No behavior change to the existing
  classifier path; just preserves raw inputs.
- In-memory adapter mirror for tests.

Scheduling:
- Worker runs the job hourly (`TIER_BACKFILL_INTERVAL_MS = 60 * 60 *
  1000`). Idempotent: once a corpus is fully tagged the find query
  returns 0 rows and the pass becomes a no-op.
- Batch size 200 per pass, plenty for any reasonable mailbox to
  converge over a few hours.

Tests: 9 worker, 4 adapter. All green. 70/70 turbo tasks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 12, 2026 23:20

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an hourly worker backfill to retroactively populate brain_pages.metadata.authoringTier (and fromAddress) for pages indexed before authoring-tier stamping existed, and persists the Gmail classifier’s raw header inputs onto brain_signals.data so future reclassification is possible without re-fetching from Gmail.

Changes:

  • Add findPagesMissingAuthoringTier(userId|null, limit) to the CRDB adapter (plus in-memory mirror + tests) to locate tier-missing pages with backing signals.
  • Add runTierBackfillJob worker job and schedule it hourly from the main worker loop.
  • Extend Gmail messageToSignal to persist to, cc, inReplyTo, and listUnsubscribe into signal.data for downstream reclassification.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/memory-gbrain-crdb-adapter/src/repository.ts Adds findPagesMissingAuthoringTier query + row shaping for worker consumption.
packages/memory-gbrain-crdb-adapter/src/index.ts Re-exports the new repository function and row type.
packages/memory-gbrain-crdb-adapter/src/in-memory-repository.ts Adds in-memory mirror implementation of findPagesMissingAuthoringTier.
packages/memory-gbrain-crdb-adapter/src/tests/in-memory-repository.test.ts Tests for the in-memory findPagesMissingAuthoringTier behavior (scope/limit/etc.).
packages/connectors/src/gmail-connector.ts Persists raw classifier header inputs onto signal.data.
CHANGELOG.md Documents the new backfill worker and connector data shape additions.
apps/worker/src/jobs/tier-backfill.ts New backfill job implementing “trust signal tier” then “reclassify from headers” paths.
apps/worker/src/index.ts Schedules the tier backfill job hourly in the worker loop.
apps/worker/src/tests/tier-backfill.test.ts Unit tests for the worker job’s classification/update/error paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

user_id: row.user_id,
signal_data:
typeof row.signal_data === 'string'
? (JSON.parse(row.signal_data) as Record<string, unknown>)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0727fb6. Switched to the file-local parseJson helper (same one used by parsePageRow / parseSettingsRow / parseEntityRow etc.). Returns null on JSON.parse failure; coerce to {} so a malformed signal row just logs as "unreclassifiable" instead of crashing the worker pass.

* those came from a non-signal write path (episode, entity) and don't
* have classifiable email headers.
*
* Limit is mandatory and caps the worker's per-pass work — a thousand

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0727fb6. Updated the comment to match — "the worker's default batch size is 200; callers can pass any value (lower for tests, higher if catching up a large back-catalog manually)."

Comment thread apps/worker/src/jobs/tier-backfill.ts Outdated
patch['fromAddress'] = result.fromAddress;
}
try {
await updatePageMetadata(row.user_id, row.page_id, patch);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 0727fb6. The worker now checks the return value — affected === 0 is treated as a failure (race where the page was deleted between find + update, or ownership mismatch). Bumps summary.failed, logs with pageId/userId, skips the success counters so the report no longer silently overcounts. New unit test (counts updatePageMetadata returning 0 affected rows as failed) covers the path.

Three findings on the backfill worker, all valid:

1. findPagesMissingAuthoringTier did a bare `JSON.parse(row.signal_data)`
   when the driver returned JSONB as a string. One malformed signal row
   would have thrown and tanked the whole worker pass. Switched to the
   file-local `parseJson` helper (the same one parsePageRow / parseSettingsRow
   / etc. use) — returns null on parse failure; coerce to {} so the
   worker logs the row as "unreclassifiable" and keeps going.

2. Doc comment claimed "a thousand pages per cycle is the default in the
   worker" but the actual default is 200. Updated.

3. The worker was discarding updatePageMetadata's affected-row count.
   A 0 return (page disappeared between find + update, or ownership
   mismatch) was getting counted as a successful copy/reclass — silent
   data lie. Now treated as failed: incremented `summary.failed`,
   logged with pageId/userId, no copiedFromSignal/reclassified bump.
   New unit test covers the race path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jayzalowitz jayzalowitz merged commit 9519394 into main May 12, 2026
8 checks passed
jayzalowitz added a commit that referenced this pull request May 13, 2026
Cross-references the 12 PRs that landed across #251 Layer 1+2+follow-ups
(authoring tier, tier-weighted retrieval, pin/hide, backfill, real-
embedding ablation), #193 Lifebook follow-ups (capabilities filter,
provenance wing filter, per-Lifebook briefing), #179 mobile voice, and
#187 AC#4 (Piper TTS) against the project's user-facing docs.

README.md:
- Version badge 0.6.17.0 → 0.6.21.0
- Package/app count "14 packages and 6 apps" → "29 packages and 7 apps"
- Project Status reflects the v0.6 series (embedded LLM, tier-aware
  memory, per-Lifebook surfaces, voice loop)
- "What works today" adds mobile voice capture + the on-device
  embedded LLM stack (llama.cpp / whisper.cpp / Piper TTS) with the
  /api/voice/transcribe and /api/voice/synthesize endpoints

CLAUDE.md:
- llm-client row notes the `embedded` provider and the
  estimateLlmCostCents() helper
- New embedded-llm row covers llama.cpp / whisper.cpp / Piper TTS
- connectors row notes the AuthoringTier classifier (#251 Layer 1)
- memory-gbrain-crdb-adapter row notes Layer 2 tier-weighted RRF
  scoring, pin/hide controls (#270), and the backfill worker (#271)
- mobile app row notes voice capture via expo-audio + the desktop
  transcribe round-trip
- New twin-mcp-server app row

No CHANGELOG changes — each PR's entry was authored by /ship and
covers its own slice accurately. No TODOS.md changes — the two open
P3s (real production tour mode, multi-instance demo rate limiting)
remain blocked on the same product decisions; nothing in this sweep
closes them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jayzalowitz added a commit that referenced this pull request May 13, 2026
Cross-references the 12 PRs that landed across #251 Layer 1+2+follow-ups
(authoring tier, tier-weighted retrieval, pin/hide, backfill, real-
embedding ablation), #193 Lifebook follow-ups (capabilities filter,
provenance wing filter, per-Lifebook briefing), #179 mobile voice, and
#187 AC#4 (Piper TTS) against the project's user-facing docs.

README.md:
- Version badge 0.6.17.0 → 0.6.21.0
- Package/app count "14 packages and 6 apps" → "29 packages and 7 apps"
- Project Status reflects the v0.6 series (embedded LLM, tier-aware
  memory, per-Lifebook surfaces, voice loop)
- "What works today" adds mobile voice capture + the on-device
  embedded LLM stack (llama.cpp / whisper.cpp / Piper TTS) with the
  /api/voice/transcribe and /api/voice/synthesize endpoints

CLAUDE.md:
- llm-client row notes the `embedded` provider and the
  estimateLlmCostCents() helper
- New embedded-llm row covers llama.cpp / whisper.cpp / Piper TTS
- connectors row notes the AuthoringTier classifier (#251 Layer 1)
- memory-gbrain-crdb-adapter row notes Layer 2 tier-weighted RRF
  scoring, pin/hide controls (#270), and the backfill worker (#271)
- mobile app row notes voice capture via expo-audio + the desktop
  transcribe round-trip
- New twin-mcp-server app row

No CHANGELOG changes — each PR's entry was authored by /ship and
covers its own slice accurately. No TODOS.md changes — the two open
P3s (real production tour mode, multi-instance demo rate limiting)
remain blocked on the same product decisions; nothing in this sweep
closes them.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants