Skip to content

Signal topic clustering for the digest #477

@jayzalowitz

Description

@jayzalowitz

Signal topic clustering for the digest

Context

A reference AI-inbox product groups awareness items into a handful of named topic
clusters (one per life-area, each holding several related messages) so the user
reads by theme instead of one flat list. SkyTwin has the raw materials — a
life-domain extractor and a Lifebook that tags signals to domains — but it has no
step that clusters the current window of signals into presentable topic groups for
the briefing. Today the "Signals" section is a flat importance-ranked list. Cluster
quality is also where SkyTwin can beat the reference product: the reference digest
visibly mis-files items (it put a developer-infra shutdown under personal finance),
and SkyTwin's twin model + domain extractor can do better.

Current State

Verified 2026-06-06.

  • packages/policy-prompts/prompts/domain-extraction/v1.md — extracts 3-10 stable
    life domains for a user (e.g. "Software Development", "Personal Finance"). This is
    a profile-level operation (the user's domains), not a per-window clustering of
    current signals.
  • packages/db/src/repositories/lifebook-repository.ts — signals are tagged to
    domains (manually or by inference) and stored. Tagging ≠ clustering: there is no
    step that takes "the last N signals" and produces topic groups for display.
  • packages/twin-model/src/analyzers/cross-domain-analyzer.ts — detects behavioral
    traits (e.g. cautious_spender, quick_responder), not topic groups.
  • packages/policy-prompts/prompts/briefing-prose/v1.md:49 — briefing sections are
    fixed (Meetings / Tasks / Signals); Signals is a flat list with no sub-grouping.

Proposed Change

Add a clustering step that takes the briefing's input window of awareness signals and
groups them into named topic clusters, each scoped to a life domain, for the Topics
section produced by spec 01. Two-tier strategy (LLM + deterministic fallback), aligned
with the existing domain-extraction approach.

A cluster is a named group of related signals sharing a topic within a domain.
Synthetic example output:

[
  { "domain": "finance", "title": "Subscriptions & billing", "signalRefs": ["s1","s4","s9"] },
  { "domain": "work",    "title": "Vendor onboarding",        "signalRefs": ["s2","s7"] }
]

Implementation Details

  1. New module packages/decision-engine/src/topic-clusterer.ts (or a twin-model
    analyzer — place beside cross-domain-analyzer.ts if it needs profile context):
    export interface ClusterInput {
      signals: Array<{ ref: string; domain: string | null; subject: string; summary: string }>;
      knownDomains: string[];   // from domain-extraction, to anchor cluster domains
      maxClusters: number;      // default 8 (matches the reference product's shape)
    }
    export interface TopicCluster {
      domain: string;
      title: string;            // short human label
      signalRefs: string[];
      confidence: number;
    }
    export function clusterSignals(input: ClusterInput): TopicCluster[];
  2. Anchor to known domains — clusters must map to a domain from
    domain-extraction output when one fits; only mint an "Other / Misc" bucket for
    genuine no-fit signals. This is the precision lever that beats the reference
    product's mis-filing: a signal's domain is decided against the user's actual
    domain set, not a generic taxonomy.
  3. Strategy:
    • LLM: versioned policy-prompts/topic-clustering/v1.md, JSON-schema output
      (array of TopicCluster), input is the signal window + known domains.
    • Deterministic fallback: group by the domain already tagged on each signal
      (from Lifebook / situation-interpreter extractDomain,
      situation-interpreter.ts:306-334); title = domain name. Lower quality, zero
      cost, always available.
  4. Bounded output — at most maxClusters; overflow merges into the
    lowest-confidence cluster or "More updates" (the reference product caps similarly).
    Log when merging happens (no silent truncation).
  5. Wiring — runs before the briefing prompt; its TopicCluster[] populates spec
    01's topics array. Each cluster's signalRefs preserve citations.
  6. Stability — same signals on a re-render should produce the same clusters
    (titles may vary with LLM temperature; pin clustering temperature low and key
    dedup on domain+signalRefs, not on title text).

Acceptance Criteria

  1. Given 10 awareness signals spanning 3 known domains, clusterSignals returns ≤8
    clusters, each mapped to a known domain or an explicit "Other" bucket.
  2. No signal appears in two clusters; every input signal appears in exactly one.
  3. A signal whose tagged domain matches a known domain is never placed in "Other".
  4. Output never exceeds maxClusters; when input would exceed it, a merge occurs and
    is logged.
  5. With no LLM, the deterministic fallback groups by tagged domain and still returns
    valid clusters.
  6. Cluster signalRefs round-trip to the original signals (citations preserved).
  7. Tests written and passing. No degradation of existing functionality.

Testing Plan

Layer What Count
Unit Partition completeness + no-overlap invariants +3
Unit Known-domain anchoring; "Other" only for no-fit +3
Unit maxClusters cap + merge logging +2
Unit Deterministic fallback grouping by tagged domain +2
Integration clusterer → spec 01 topics payload with citations intact +2

Rollback Plan

Additive and flagged (TOPIC_CLUSTERING=off). With it off, spec 01's Topics section
falls back to a flat domain-tagged list (the deterministic path), which is the
current behavior. No schema or data changes to reverse.

Effort Estimate

  • Clusterer module + types: ~3h
  • LLM prompt + schema: ~3h
  • Deterministic fallback + cap/merge: ~2h
  • Wiring into briefing: ~2h
  • Tests: ~3h

Total: ~1.5-2 days.

Files Reference

File Change
packages/decision-engine/src/topic-clusterer.ts New: clustering + types
packages/policy-prompts/prompts/topic-clustering/v1.md New: LLM template + schema
packages/policy-prompts/prompts/domain-extraction/v1.md Reference (supplies known domains)
packages/decision-engine/src/situation-interpreter.ts:306-334 Reference (per-signal domain tag)
briefing generator Feed TopicCluster[] into spec 01 topics

Out of Scope

  • Persisting clusters as durable entities (clusters are per-briefing presentation).
  • Cross-window topic continuity ("this topic continued from yesterday").
  • Entity-level linking inside a cluster — spec 05.

Related

  • Consumes domains from domain-extraction.
  • Produces the topics array consumed by spec 01.
  • Spec 05 dedups entities that recur across clusters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions