Signal topic clustering for the digest

# Signal topic clustering for the digest

## Context

A reference AI-inbox product groups awareness items into a handful of named topic
clusters (one per life-area, each holding several related messages) so the user
reads by theme instead of one flat list. SkyTwin has the raw materials — a
life-domain extractor and a Lifebook that tags signals to domains — but it has no
step that *clusters the current window of signals into presentable topic groups* for
the briefing. Today the "Signals" section is a flat importance-ranked list. Cluster
quality is also where SkyTwin can beat the reference product: the reference digest
visibly mis-files items (it put a developer-infra shutdown under personal finance),
and SkyTwin's twin model + domain extractor can do better.

## Current State

Verified 2026-06-06.

- `packages/policy-prompts/prompts/domain-extraction/v1.md` — extracts 3-10 stable
  life domains for a user (e.g. "Software Development", "Personal Finance"). This is
  a *profile-level* operation (the user's domains), not a per-window clustering of
  current signals.
- `packages/db/src/repositories/lifebook-repository.ts` — signals are tagged to
  domains (manually or by inference) and stored. Tagging ≠ clustering: there is no
  step that takes "the last N signals" and produces topic groups for display.
- `packages/twin-model/src/analyzers/cross-domain-analyzer.ts` — detects behavioral
  *traits* (e.g. cautious_spender, quick_responder), not topic groups.
- `packages/policy-prompts/prompts/briefing-prose/v1.md:49` — briefing sections are
  fixed (Meetings / Tasks / Signals); Signals is a flat list with no sub-grouping.

## Proposed Change

Add a clustering step that takes the briefing's input window of awareness signals and
groups them into named topic clusters, each scoped to a life domain, for the Topics
section produced by spec 01. Two-tier strategy (LLM + deterministic fallback), aligned
with the existing domain-extraction approach.

A cluster is a named group of related signals sharing a topic within a domain.
Synthetic example output:

```
[
  { "domain": "finance", "title": "Subscriptions & billing", "signalRefs": ["s1","s4","s9"] },
  { "domain": "work",    "title": "Vendor onboarding",        "signalRefs": ["s2","s7"] }
]
```

### Implementation Details

1. **New module** `packages/decision-engine/src/topic-clusterer.ts` (or a twin-model
   analyzer — place beside `cross-domain-analyzer.ts` if it needs profile context):
   ```ts
   export interface ClusterInput {
     signals: Array<{ ref: string; domain: string | null; subject: string; summary: string }>;
     knownDomains: string[];   // from domain-extraction, to anchor cluster domains
     maxClusters: number;      // default 8 (matches the reference product's shape)
   }
   export interface TopicCluster {
     domain: string;
     title: string;            // short human label
     signalRefs: string[];
     confidence: number;
   }
   export function clusterSignals(input: ClusterInput): TopicCluster[];
   ```
2. **Anchor to known domains** — clusters must map to a domain from
   `domain-extraction` output when one fits; only mint an "Other / Misc" bucket for
   genuine no-fit signals. This is the precision lever that beats the reference
   product's mis-filing: a signal's domain is decided against the user's *actual*
   domain set, not a generic taxonomy.
3. **Strategy**:
   - **LLM:** versioned `policy-prompts/topic-clustering/v1.md`, JSON-schema output
     (array of `TopicCluster`), input is the signal window + known domains.
   - **Deterministic fallback:** group by the `domain` already tagged on each signal
     (from Lifebook / situation-interpreter `extractDomain`,
     `situation-interpreter.ts:306-334`); title = domain name. Lower quality, zero
     cost, always available.
4. **Bounded output** — at most `maxClusters`; overflow merges into the
   lowest-confidence cluster or "More updates" (the reference product caps similarly).
   Log when merging happens (no silent truncation).
5. **Wiring** — runs before the briefing prompt; its `TopicCluster[]` populates spec
   01's `topics` array. Each cluster's `signalRefs` preserve citations.
6. **Stability** — same signals on a re-render should produce the same clusters
   (titles may vary with LLM temperature; pin clustering temperature low and key
   dedup on domain+signalRefs, not on title text).

## Acceptance Criteria

1. Given 10 awareness signals spanning 3 known domains, `clusterSignals` returns ≤8
   clusters, each mapped to a known domain or an explicit "Other" bucket.
2. No signal appears in two clusters; every input signal appears in exactly one.
3. A signal whose tagged domain matches a known domain is never placed in "Other".
4. Output never exceeds `maxClusters`; when input would exceed it, a merge occurs and
   is logged.
5. With no LLM, the deterministic fallback groups by tagged domain and still returns
   valid clusters.
6. Cluster `signalRefs` round-trip to the original signals (citations preserved).
7. Tests written and passing. No degradation of existing functionality.

## Testing Plan

| Layer       | What                                                            | Count |
|-------------|-----------------------------------------------------------------|-------|
| Unit        | Partition completeness + no-overlap invariants                  | +3 |
| Unit        | Known-domain anchoring; "Other" only for no-fit                 | +3 |
| Unit        | maxClusters cap + merge logging                                 | +2 |
| Unit        | Deterministic fallback grouping by tagged domain                | +2 |
| Integration | clusterer → spec 01 `topics` payload with citations intact      | +2 |

## Rollback Plan

Additive and flagged (`TOPIC_CLUSTERING=off`). With it off, spec 01's Topics section
falls back to a flat domain-tagged list (the deterministic path), which is the
current behavior. No schema or data changes to reverse.

## Effort Estimate

- Clusterer module + types: ~3h
- LLM prompt + schema: ~3h
- Deterministic fallback + cap/merge: ~2h
- Wiring into briefing: ~2h
- Tests: ~3h

Total: ~1.5-2 days.

## Files Reference

| File | Change |
|------|--------|
| `packages/decision-engine/src/topic-clusterer.ts` | New: clustering + types |
| `packages/policy-prompts/prompts/topic-clustering/v1.md` | New: LLM template + schema |
| `packages/policy-prompts/prompts/domain-extraction/v1.md` | Reference (supplies known domains) |
| `packages/decision-engine/src/situation-interpreter.ts:306-334` | Reference (per-signal domain tag) |
| briefing generator | Feed `TopicCluster[]` into spec 01 `topics` |

## Out of Scope

- Persisting clusters as durable entities (clusters are per-briefing presentation).
- Cross-window topic continuity ("this topic continued from yesterday").
- Entity-level linking inside a cluster — spec 05.

## Related

- Consumes domains from `domain-extraction`.
- Produces the `topics` array consumed by spec 01.
- Spec 05 dedups entities that recur across clusters.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signal topic clustering for the digest #477

Signal topic clustering for the digest

Context

Current State

Proposed Change

Implementation Details

Acceptance Criteria

Testing Plan

Rollback Plan

Effort Estimate

Files Reference

Out of Scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Layer	What	Count
Unit	Partition completeness + no-overlap invariants	+3
Unit	Known-domain anchoring; "Other" only for no-fit	+3
Unit	maxClusters cap + merge logging	+2
Unit	Deterministic fallback grouping by tagged domain	+2
Integration	clusterer → spec 01 `topics` payload with citations intact	+2

File	Change
`packages/decision-engine/src/topic-clusterer.ts`	New: clustering + types
`packages/policy-prompts/prompts/topic-clustering/v1.md`	New: LLM template + schema
`packages/policy-prompts/prompts/domain-extraction/v1.md`	Reference (supplies known domains)
`packages/decision-engine/src/situation-interpreter.ts:306-334`	Reference (per-signal domain tag)
briefing generator	Feed `TopicCluster[]` into spec 01 `topics`

Signal topic clustering for the digest #477

Description

Signal topic clustering for the digest

Context

Current State

Proposed Change

Implementation Details

Acceptance Criteria

Testing Plan

Rollback Plan

Effort Estimate

Files Reference

Out of Scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions