You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi-source generalization: extractors consume normalized signals, not email
Context
The reference product that seeded these specs is email-only. SkyTwin is not — it
ingests calendar, filesystem (idle-miner), voice transcripts, and more, all
normalized to one signal shape. Specs 02 (commitments), 03 (deadlines), and 06
(security alerts) were drafted with email-shaped inputs, which would wrongly bind
generic capabilities to one channel. This spec makes them source-agnostic by routing
every extractor through the normalized RawSignal contract and defines which sources
each capability covers. Get this right once and every new connector inherits the
intelligence for free; get it wrong and we re-implement commitment/deadline/security
logic per channel.
Current State
Verified 2026-06-06. The architecture is already source-agnostic past ingest — the
gap is that the spec inputs weren't.
packages/connectors/src/connector-interface.ts:5-11 — the normalized contract:
Every connector emits this. source is the channel string (gmail, google_calendar, etc.); data holds the channel-specific payload.
packages/decision-engine/src/situation-interpreter.ts:89-94 — already reads source, type, subject, category, body generically off the raw event. It
is channel-neutral by construction.
packages/decision-engine/src/situation-interpreter.ts:73-84 — deriveProvenance
reads source + authoringTier from the event (top-level or nested data),
source-agnostic.
packages/connectors/src/authoring-tier.ts:11-14 — explicit comment: the authoringTier field is "deliberately channel-agnostic" and anticipates authored_originated / received_personal tiers for non-email channels.
packages/connectors/src/google-calendar-connector.ts:206 — calendar already
stamps authoringTier via classifyCalendarAuthoringTier (Memory bootstrap: weight user-sent emails higher than received #251 Phase 3). So
multi-channel authoring classification is already a live pattern, not hypothetical.
Sources emitting signals today: gmail (gmail-connector.ts:414), google_calendar (google-calendar-connector.ts:214), email/calendar mocks,
filesystem signals from @skytwin/idle-miner, and voice transcripts via /api/voice/transcribe (whisper.cpp, @skytwin/embedded-llm).
Proposed Change
Normalized text accessor — one helper that pulls displayable/extractable text
and metadata from any RawSignal, so extractors never touch data shape directly.
A per-source adapter map fills the fields (email → subject/body/to; calendar →
summary/description/attendees; filesystem → filename/excerpt; voice →
"voice note"/transcript). Unknown sources fall back to best-effort data.title/data.body and authoredByUser = false (fail safe).
Refactor extractors to consume SignalText — specs 02, 03, 06 take SignalText instead of email-specific inputs. Their logic is unchanged; only the
input adapter differs. This is the one edit that makes them multi-source.
Extend AuthoringTier for non-email channels — add authored_originated
(user created this doc/event/note) and received_shared (someone shared it with
the user) per the authoring-tier.ts:11-14 plan. authoredByUser in SignalText
maps both email user_sent_* and the new authored_* tiers to true.
Source coverage matrix — make per-capability/per-source coverage explicit and
tested, so "does commitment extraction run on voice notes?" has a defined answer:
Capability (spec)
email
calendar
filesystem
voice
chat/MCP*
Commitment extr. (02)
✅
✅ desc
❌
✅
future
Deadline extr. (03)
✅
✅
✅ TODOs
✅
future
Security alert (06)
✅
❌
❌
❌
future
Topic clustering (04)
✅
✅
✅
✅
✅
Entity linking (05)
✅
✅
✅
✅
✅
*chat/MCP = not a connector yet; matrix reserves the slot so adding one is config,
not new extractor code. Rationale per cell:
Commitments come from authored content: sent mail, calendar event
descriptions the user wrote, transcribed voice notes (a strong source — the
user literally says "I'll do X"). Not filesystem (code/files aren't promises).
Deadlines appear in any text body, including idle-miner TODO/deadline comments
scraped from project files.
Security alerts are inbound-notification shaped — email today; SMS/push later.
Calendar/filesystem/voice don't carry breach alerts, so leaving them off is
correct, not a gap.
Clustering + entity linking are already source-agnostic (operate on SignalText
tagged domain), so they cover everything.
Acceptance Criteria
toSignalText returns correct title/body/authoredByUser/occurredAt for a gmail, a google_calendar, a filesystem, and a voice signal fixture.
An unknown source returns best-effort text with authoredByUser = false (fail
safe — never defaults to "authored").
Specs 02/03/06 extractors accept SignalText and produce identical results to
their email-only fixtures (no behavior regression from the refactor).
Commitment extraction runs on a voice transcript fixture and a calendar
description fixture; returns [] for a filesystem signal (matrix enforced).
authored_originated and received_shared tiers exist and map correctly through authoredByUser.
The coverage matrix is encoded as a tested config (a capability×source allowlist),
not scattered if (source === 'gmail') checks.
Tests written and passing. No degradation of existing functionality.
Testing Plan
Layer
What
Count
Unit
toSignalText per source (email/calendar/filesystem/voice)
+4
Unit
Unknown source fail-safe (authoredByUser=false)
+2
Unit
New tier mapping → authoredByUser
+2
Unit
Coverage-matrix allowlist gates each extractor per source
+5
Integration
Same extractor, two sources, parity vs. email-only baseline
+3
Rollback Plan
toSignalText is additive. If the refactor of 02/03/06 regresses, revert those
extractors to their email-only input and keep toSignalText unused — no schema or
connector changes to undo. The new AuthoringTier values are additive enum members;
unused values are harmless.
Effort Estimate
signal-text.ts + per-source adapters: ~4h
AuthoringTier extension + mapping: ~2h
Coverage-matrix config + gating: ~2h
Refactor 02/03/06 inputs to SignalText: ~3h
Tests: ~4h
Total: ~2 days. Do this BEFORE or alongside 02/03/06 so they're born multi-source.
Multi-source generalization: extractors consume normalized signals, not email
Context
The reference product that seeded these specs is email-only. SkyTwin is not — it
ingests calendar, filesystem (idle-miner), voice transcripts, and more, all
normalized to one signal shape. Specs 02 (commitments), 03 (deadlines), and 06
(security alerts) were drafted with email-shaped inputs, which would wrongly bind
generic capabilities to one channel. This spec makes them source-agnostic by routing
every extractor through the normalized
RawSignalcontract and defines which sourceseach capability covers. Get this right once and every new connector inherits the
intelligence for free; get it wrong and we re-implement commitment/deadline/security
logic per channel.
Current State
Verified 2026-06-06. The architecture is already source-agnostic past ingest — the
gap is that the spec inputs weren't.
packages/connectors/src/connector-interface.ts:5-11— the normalized contract:sourceis the channel string (gmail,google_calendar, etc.);dataholds the channel-specific payload.packages/decision-engine/src/situation-interpreter.ts:89-94— already readssource,type,subject,category,bodygenerically off the raw event. Itis channel-neutral by construction.
packages/decision-engine/src/situation-interpreter.ts:73-84—deriveProvenancereads
source+authoringTierfrom the event (top-level or nesteddata),source-agnostic.
packages/connectors/src/authoring-tier.ts:11-14— explicit comment: theauthoringTierfield is "deliberately channel-agnostic" and anticipatesauthored_originated/received_personaltiers for non-email channels.packages/connectors/src/google-calendar-connector.ts:206— calendar alreadystamps
authoringTierviaclassifyCalendarAuthoringTier(Memory bootstrap: weight user-sent emails higher than received #251 Phase 3). Somulti-channel authoring classification is already a live pattern, not hypothetical.
gmail(gmail-connector.ts:414),google_calendar(google-calendar-connector.ts:214),email/calendarmocks,filesystem signals from
@skytwin/idle-miner, and voice transcripts via/api/voice/transcribe(whisper.cpp,@skytwin/embedded-llm).Proposed Change
Normalized text accessor — one helper that pulls displayable/extractable text
and metadata from any
RawSignal, so extractors never touchdatashape directly.A per-source adapter map fills the fields (email → subject/body/to; calendar →
summary/description/attendees; filesystem → filename/excerpt; voice →
"voice note"/transcript). Unknown sources fall back to best-effort
data.title/data.bodyandauthoredByUser = false(fail safe).Refactor extractors to consume
SignalText— specs 02, 03, 06 takeSignalTextinstead of email-specific inputs. Their logic is unchanged; only theinput adapter differs. This is the one edit that makes them multi-source.
Extend
AuthoringTierfor non-email channels — addauthored_originated(user created this doc/event/note) and
received_shared(someone shared it withthe user) per the
authoring-tier.ts:11-14plan.authoredByUserinSignalTextmaps both email
user_sent_*and the newauthored_*tiers totrue.Source coverage matrix — make per-capability/per-source coverage explicit and
tested, so "does commitment extraction run on voice notes?" has a defined answer:
*chat/MCP= not a connector yet; matrix reserves the slot so adding one is config,not new extractor code. Rationale per cell:
descriptions the user wrote, transcribed voice notes (a strong source — the
user literally says "I'll do X"). Not filesystem (code/files aren't promises).
scraped from project files.
Calendar/filesystem/voice don't carry breach alerts, so leaving them off is
correct, not a gap.
SignalTextAcceptance Criteria
toSignalTextreturns correcttitle/body/authoredByUser/occurredAtfor agmail, agoogle_calendar, afilesystem, and avoicesignal fixture.sourcereturns best-effort text withauthoredByUser = false(failsafe — never defaults to "authored").
SignalTextand produce identical results totheir email-only fixtures (no behavior regression from the refactor).
voicetranscript fixture and a calendardescription fixture; returns
[]for afilesystemsignal (matrix enforced).authored_originatedandreceived_sharedtiers exist and map correctly throughauthoredByUser.not scattered
if (source === 'gmail')checks.Testing Plan
toSignalTextper source (email/calendar/filesystem/voice)authoredByUser=false)authoredByUserRollback Plan
toSignalTextis additive. If the refactor of 02/03/06 regresses, revert thoseextractors to their email-only input and keep
toSignalTextunused — no schema orconnector changes to undo. The new
AuthoringTiervalues are additive enum members;unused values are harmless.
Effort Estimate
signal-text.ts+ per-source adapters: ~4hAuthoringTierextension + mapping: ~2hSignalText: ~3hTotal: ~2 days. Do this BEFORE or alongside 02/03/06 so they're born multi-source.
Files Reference
packages/decision-engine/src/signal-text.tspackages/connectors/src/connector-interface.ts:5-11RawSignalcontract)packages/connectors/src/authoring-tier.ts:17-23authored_originated,received_sharedpackages/decision-engine/src/commitment-extractor.ts(spec 02)SignalTextpackages/decision-engine/src/deadline-extractor.ts(spec 03)SignalTextpackages/decision-engine/src/situation-interpreter.ts(spec 06)SignalText.bodyOut of Scope
them.
underperforms.
Related
SignalText).AuthoringTier(Memory bootstrap: weight user-sent emails higher than received #251), including the calendar Phase 3 already shipped.