Multi-source generalization: extractors consume normalized signals

# Multi-source generalization: extractors consume normalized signals, not email

## Context

The reference product that seeded these specs is email-only. SkyTwin is not — it
ingests calendar, filesystem (idle-miner), voice transcripts, and more, all
normalized to one signal shape. Specs 02 (commitments), 03 (deadlines), and 06
(security alerts) were drafted with email-shaped inputs, which would wrongly bind
generic capabilities to one channel. This spec makes them source-agnostic by routing
every extractor through the normalized `RawSignal` contract and defines which sources
each capability covers. Get this right once and every new connector inherits the
intelligence for free; get it wrong and we re-implement commitment/deadline/security
logic per channel.

## Current State

Verified 2026-06-06. The architecture is already source-agnostic past ingest — the
gap is that the spec inputs weren't.

- `packages/connectors/src/connector-interface.ts:5-11` — the normalized contract:
  ```ts
  interface RawSignal { id: string; source: string; type: string; data: Record<string, unknown>; timestamp: Date; }
  ```
  Every connector emits this. `source` is the channel string (`gmail`,
  `google_calendar`, etc.); `data` holds the channel-specific payload.
- `packages/decision-engine/src/situation-interpreter.ts:89-94` — already reads
  `source`, `type`, `subject`, `category`, `body` generically off the raw event. It
  is channel-neutral by construction.
- `packages/decision-engine/src/situation-interpreter.ts:73-84` — `deriveProvenance`
  reads `source` + `authoringTier` from the event (top-level or nested `data`),
  source-agnostic.
- `packages/connectors/src/authoring-tier.ts:11-14` — explicit comment: the
  `authoringTier` field is "deliberately channel-agnostic" and anticipates
  `authored_originated` / `received_personal` tiers for non-email channels.
- `packages/connectors/src/google-calendar-connector.ts:206` — calendar already
  stamps `authoringTier` via `classifyCalendarAuthoringTier` (#251 Phase 3). So
  multi-channel authoring classification is already a live pattern, not hypothetical.
- Sources emitting signals today: `gmail` (`gmail-connector.ts:414`),
  `google_calendar` (`google-calendar-connector.ts:214`), `email`/`calendar` mocks,
  filesystem signals from `@skytwin/idle-miner`, and voice transcripts via
  `/api/voice/transcribe` (whisper.cpp, `@skytwin/embedded-llm`).

## Proposed Change

1. **Normalized text accessor** — one helper that pulls displayable/extractable text
   and metadata from any `RawSignal`, so extractors never touch `data` shape directly.
   ```ts
   // packages/decision-engine/src/signal-text.ts
   export interface SignalText {
     source: string;              // 'gmail' | 'google_calendar' | 'filesystem' | 'voice' | ...
     title: string;               // subject / event title / file name / "voice note"
     body: string;                // body / event description / file excerpt / transcript
     authoringTier?: AuthoringTier;
     authoredByUser: boolean;     // derived: tier ∈ {user_sent_*, authored_*}
     occurredAt: Date;            // timestamp anchor for relative deadlines
     participants: string[];      // recipients / attendees / collaborators
   }
   export function toSignalText(signal: RawSignal): SignalText;
   ```
   A per-source adapter map fills the fields (email → subject/body/to; calendar →
   summary/description/attendees; filesystem → filename/excerpt; voice →
   "voice note"/transcript). Unknown sources fall back to best-effort
   `data.title`/`data.body` and `authoredByUser = false` (fail safe).
2. **Refactor extractors to consume `SignalText`** — specs 02, 03, 06 take
   `SignalText` instead of email-specific inputs. Their logic is unchanged; only the
   input adapter differs. This is the one edit that makes them multi-source.
3. **Extend `AuthoringTier` for non-email channels** — add `authored_originated`
   (user created this doc/event/note) and `received_shared` (someone shared it with
   the user) per the `authoring-tier.ts:11-14` plan. `authoredByUser` in `SignalText`
   maps both email `user_sent_*` and the new `authored_*` tiers to `true`.
4. **Source coverage matrix** — make per-capability/per-source coverage explicit and
   tested, so "does commitment extraction run on voice notes?" has a defined answer:

   | Capability (spec)        | email | calendar | filesystem | voice | chat/MCP* |
   |--------------------------|:-----:|:--------:|:----------:|:-----:|:---------:|
   | Commitment extr. (02)    |  ✅   |  ✅ desc |    ❌      |  ✅   |  future   |
   | Deadline extr. (03)      |  ✅   |  ✅      |  ✅ TODOs  |  ✅   |  future   |
   | Security alert (06)      |  ✅   |  ❌      |    ❌      |  ❌   |  future   |
   | Topic clustering (04)    |  ✅   |  ✅      |    ✅      |  ✅   |  ✅       |
   | Entity linking (05)      |  ✅   |  ✅      |    ✅      |  ✅   |  ✅       |

   `*chat/MCP` = not a connector yet; matrix reserves the slot so adding one is config,
   not new extractor code. Rationale per cell:
   - Commitments come from *authored* content: sent mail, calendar event
     descriptions the user wrote, **transcribed voice notes** (a strong source — the
     user literally says "I'll do X"). Not filesystem (code/files aren't promises).
   - Deadlines appear in any text body, including idle-miner TODO/deadline comments
     scraped from project files.
   - Security alerts are inbound-notification shaped — email today; SMS/push later.
     Calendar/filesystem/voice don't carry breach alerts, so leaving them off is
     correct, not a gap.
   - Clustering + entity linking are already source-agnostic (operate on `SignalText`
     + tagged domain), so they cover everything.

## Acceptance Criteria

1. `toSignalText` returns correct `title`/`body`/`authoredByUser`/`occurredAt` for a
   `gmail`, a `google_calendar`, a `filesystem`, and a `voice` signal fixture.
2. An unknown `source` returns best-effort text with `authoredByUser = false` (fail
   safe — never defaults to "authored").
3. Specs 02/03/06 extractors accept `SignalText` and produce identical results to
   their email-only fixtures (no behavior regression from the refactor).
4. Commitment extraction runs on a `voice` transcript fixture and a calendar
   description fixture; returns `[]` for a `filesystem` signal (matrix enforced).
5. `authored_originated` and `received_shared` tiers exist and map correctly through
   `authoredByUser`.
6. The coverage matrix is encoded as a tested config (a capability×source allowlist),
   not scattered `if (source === 'gmail')` checks.
7. Tests written and passing. No degradation of existing functionality.

## Testing Plan

| Layer       | What                                                              | Count |
|-------------|------------------------------------------------------------------|-------|
| Unit        | `toSignalText` per source (email/calendar/filesystem/voice)      | +4 |
| Unit        | Unknown source fail-safe (`authoredByUser=false`)                | +2 |
| Unit        | New tier mapping → `authoredByUser`                              | +2 |
| Unit        | Coverage-matrix allowlist gates each extractor per source        | +5 |
| Integration | Same extractor, two sources, parity vs. email-only baseline      | +3 |

## Rollback Plan

`toSignalText` is additive. If the refactor of 02/03/06 regresses, revert those
extractors to their email-only input and keep `toSignalText` unused — no schema or
connector changes to undo. The new `AuthoringTier` values are additive enum members;
unused values are harmless.

## Effort Estimate

- `signal-text.ts` + per-source adapters: ~4h
- `AuthoringTier` extension + mapping: ~2h
- Coverage-matrix config + gating: ~2h
- Refactor 02/03/06 inputs to `SignalText`: ~3h
- Tests: ~4h

Total: ~2 days. Do this BEFORE or alongside 02/03/06 so they're born multi-source.

## Files Reference

| File | Change |
|------|--------|
| `packages/decision-engine/src/signal-text.ts` | New: normalized text accessor + adapters |
| `packages/connectors/src/connector-interface.ts:5-11` | Reference (the `RawSignal` contract) |
| `packages/connectors/src/authoring-tier.ts:17-23` | Add `authored_originated`, `received_shared` |
| `packages/decision-engine/src/commitment-extractor.ts` (spec 02) | Consume `SignalText` |
| `packages/decision-engine/src/deadline-extractor.ts` (spec 03) | Consume `SignalText` |
| `packages/decision-engine/src/situation-interpreter.ts` (spec 06) | Security markers on `SignalText.body` |
| coverage-matrix config | New: capability×source allowlist |

## Out of Scope

- Building new connectors (chat/MCP/SMS). This reserves their slots; it doesn't add
  them.
- Per-source ML tuning. One extractor, source-tagged input; tune later if a source
  underperforms.

## Related

- Foundational for specs 02, 03, 06 (they should consume `SignalText`).
- 04 and 05 are already source-agnostic; this formalizes that they stay so.
- Builds on `AuthoringTier` (#251), including the calendar Phase 3 already shipped.


File	Change
`packages/decision-engine/src/signal-text.ts`	New: normalized text accessor + adapters
`packages/connectors/src/connector-interface.ts:5-11`	Reference (the `RawSignal` contract)
`packages/connectors/src/authoring-tier.ts:17-23`	Add `authored_originated`, `received_shared`
`packages/decision-engine/src/commitment-extractor.ts` (spec 02)	Consume `SignalText`
`packages/decision-engine/src/deadline-extractor.ts` (spec 03)	Consume `SignalText`
`packages/decision-engine/src/situation-interpreter.ts` (spec 06)	Security markers on `SignalText.body`
coverage-matrix config	New: capability×source allowlist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-source generalization: extractors consume normalized signals #480

Multi-source generalization: extractors consume normalized signals, not email

Context

Current State

Proposed Change

Acceptance Criteria

Testing Plan

Rollback Plan

Effort Estimate

Files Reference

Out of Scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Capability (spec)	email	calendar	filesystem	voice	chat/MCP*
Commitment extr. (02)	✅	✅ desc	❌	✅	future
Deadline extr. (03)	✅	✅	✅ TODOs	✅	future
Security alert (06)	✅	❌	❌	❌	future
Topic clustering (04)	✅	✅	✅	✅	✅
Entity linking (05)	✅	✅	✅	✅	✅

Layer	What	Count
Unit	`toSignalText` per source (email/calendar/filesystem/voice)	+4
Unit	Unknown source fail-safe (`authoredByUser=false`)	+2
Unit	New tier mapping → `authoredByUser`	+2
Unit	Coverage-matrix allowlist gates each extractor per source	+5
Integration	Same extractor, two sources, parity vs. email-only baseline	+3

Multi-source generalization: extractors consume normalized signals #480

Description

Multi-source generalization: extractors consume normalized signals, not email

Context

Current State

Proposed Change

Acceptance Criteria

Testing Plan

Rollback Plan

Effort Estimate

Files Reference

Out of Scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions