Skip to content

Multi-source generalization: extractors consume normalized signals #480

@jayzalowitz

Description

@jayzalowitz

Multi-source generalization: extractors consume normalized signals, not email

Context

The reference product that seeded these specs is email-only. SkyTwin is not — it
ingests calendar, filesystem (idle-miner), voice transcripts, and more, all
normalized to one signal shape. Specs 02 (commitments), 03 (deadlines), and 06
(security alerts) were drafted with email-shaped inputs, which would wrongly bind
generic capabilities to one channel. This spec makes them source-agnostic by routing
every extractor through the normalized RawSignal contract and defines which sources
each capability covers. Get this right once and every new connector inherits the
intelligence for free; get it wrong and we re-implement commitment/deadline/security
logic per channel.

Current State

Verified 2026-06-06. The architecture is already source-agnostic past ingest — the
gap is that the spec inputs weren't.

  • packages/connectors/src/connector-interface.ts:5-11 — the normalized contract:
    interface RawSignal { id: string; source: string; type: string; data: Record<string, unknown>; timestamp: Date; }
    Every connector emits this. source is the channel string (gmail,
    google_calendar, etc.); data holds the channel-specific payload.
  • packages/decision-engine/src/situation-interpreter.ts:89-94 — already reads
    source, type, subject, category, body generically off the raw event. It
    is channel-neutral by construction.
  • packages/decision-engine/src/situation-interpreter.ts:73-84deriveProvenance
    reads source + authoringTier from the event (top-level or nested data),
    source-agnostic.
  • packages/connectors/src/authoring-tier.ts:11-14 — explicit comment: the
    authoringTier field is "deliberately channel-agnostic" and anticipates
    authored_originated / received_personal tiers for non-email channels.
  • packages/connectors/src/google-calendar-connector.ts:206 — calendar already
    stamps authoringTier via classifyCalendarAuthoringTier (Memory bootstrap: weight user-sent emails higher than received #251 Phase 3). So
    multi-channel authoring classification is already a live pattern, not hypothetical.
  • Sources emitting signals today: gmail (gmail-connector.ts:414),
    google_calendar (google-calendar-connector.ts:214), email/calendar mocks,
    filesystem signals from @skytwin/idle-miner, and voice transcripts via
    /api/voice/transcribe (whisper.cpp, @skytwin/embedded-llm).

Proposed Change

  1. Normalized text accessor — one helper that pulls displayable/extractable text
    and metadata from any RawSignal, so extractors never touch data shape directly.

    // packages/decision-engine/src/signal-text.ts
    export interface SignalText {
      source: string;              // 'gmail' | 'google_calendar' | 'filesystem' | 'voice' | ...
      title: string;               // subject / event title / file name / "voice note"
      body: string;                // body / event description / file excerpt / transcript
      authoringTier?: AuthoringTier;
      authoredByUser: boolean;     // derived: tier ∈ {user_sent_*, authored_*}
      occurredAt: Date;            // timestamp anchor for relative deadlines
      participants: string[];      // recipients / attendees / collaborators
    }
    export function toSignalText(signal: RawSignal): SignalText;

    A per-source adapter map fills the fields (email → subject/body/to; calendar →
    summary/description/attendees; filesystem → filename/excerpt; voice →
    "voice note"/transcript). Unknown sources fall back to best-effort
    data.title/data.body and authoredByUser = false (fail safe).

  2. Refactor extractors to consume SignalText — specs 02, 03, 06 take
    SignalText instead of email-specific inputs. Their logic is unchanged; only the
    input adapter differs. This is the one edit that makes them multi-source.

  3. Extend AuthoringTier for non-email channels — add authored_originated
    (user created this doc/event/note) and received_shared (someone shared it with
    the user) per the authoring-tier.ts:11-14 plan. authoredByUser in SignalText
    maps both email user_sent_* and the new authored_* tiers to true.

  4. Source coverage matrix — make per-capability/per-source coverage explicit and
    tested, so "does commitment extraction run on voice notes?" has a defined answer:

    Capability (spec) email calendar filesystem voice chat/MCP*
    Commitment extr. (02) ✅ desc future
    Deadline extr. (03) ✅ TODOs future
    Security alert (06) future
    Topic clustering (04)
    Entity linking (05)

    *chat/MCP = not a connector yet; matrix reserves the slot so adding one is config,
    not new extractor code. Rationale per cell:

    • Commitments come from authored content: sent mail, calendar event
      descriptions the user wrote, transcribed voice notes (a strong source — the
      user literally says "I'll do X"). Not filesystem (code/files aren't promises).
    • Deadlines appear in any text body, including idle-miner TODO/deadline comments
      scraped from project files.
    • Security alerts are inbound-notification shaped — email today; SMS/push later.
      Calendar/filesystem/voice don't carry breach alerts, so leaving them off is
      correct, not a gap.
    • Clustering + entity linking are already source-agnostic (operate on SignalText
      • tagged domain), so they cover everything.

Acceptance Criteria

  1. toSignalText returns correct title/body/authoredByUser/occurredAt for a
    gmail, a google_calendar, a filesystem, and a voice signal fixture.
  2. An unknown source returns best-effort text with authoredByUser = false (fail
    safe — never defaults to "authored").
  3. Specs 02/03/06 extractors accept SignalText and produce identical results to
    their email-only fixtures (no behavior regression from the refactor).
  4. Commitment extraction runs on a voice transcript fixture and a calendar
    description fixture; returns [] for a filesystem signal (matrix enforced).
  5. authored_originated and received_shared tiers exist and map correctly through
    authoredByUser.
  6. The coverage matrix is encoded as a tested config (a capability×source allowlist),
    not scattered if (source === 'gmail') checks.
  7. Tests written and passing. No degradation of existing functionality.

Testing Plan

Layer What Count
Unit toSignalText per source (email/calendar/filesystem/voice) +4
Unit Unknown source fail-safe (authoredByUser=false) +2
Unit New tier mapping → authoredByUser +2
Unit Coverage-matrix allowlist gates each extractor per source +5
Integration Same extractor, two sources, parity vs. email-only baseline +3

Rollback Plan

toSignalText is additive. If the refactor of 02/03/06 regresses, revert those
extractors to their email-only input and keep toSignalText unused — no schema or
connector changes to undo. The new AuthoringTier values are additive enum members;
unused values are harmless.

Effort Estimate

  • signal-text.ts + per-source adapters: ~4h
  • AuthoringTier extension + mapping: ~2h
  • Coverage-matrix config + gating: ~2h
  • Refactor 02/03/06 inputs to SignalText: ~3h
  • Tests: ~4h

Total: ~2 days. Do this BEFORE or alongside 02/03/06 so they're born multi-source.

Files Reference

File Change
packages/decision-engine/src/signal-text.ts New: normalized text accessor + adapters
packages/connectors/src/connector-interface.ts:5-11 Reference (the RawSignal contract)
packages/connectors/src/authoring-tier.ts:17-23 Add authored_originated, received_shared
packages/decision-engine/src/commitment-extractor.ts (spec 02) Consume SignalText
packages/decision-engine/src/deadline-extractor.ts (spec 03) Consume SignalText
packages/decision-engine/src/situation-interpreter.ts (spec 06) Security markers on SignalText.body
coverage-matrix config New: capability×source allowlist

Out of Scope

  • Building new connectors (chat/MCP/SMS). This reserves their slots; it doesn't add
    them.
  • Per-source ML tuning. One extractor, source-tagged input; tune later if a source
    underperforms.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions