Entity cross-linking across signals

# Entity cross-linking across signals

## Context

A reference AI-inbox product cross-references the same entity wherever it appears —
one person or one vendor shows up in a to-do, in a topic cluster, and in a related
thread, and the digest treats them as the same thing. It also sometimes *repeats* the
same entity in two places because its linking is imperfect, which is the failure mode
to avoid. SkyTwin deduplicates signals only by message identity; it has no notion that
two different signals refer to the same person, organization, or thing. Without entity
linking, the briefing can list the same underlying matter three times under three
clusters, and the twin can't reason about "everything touching entity X."

## Current State

Verified 2026-06-06.

- `apps/worker/src/signal-dedupe.ts:40-54` — `SignalDeduper` keys on
  `${signal.source}:${signal.id}` with a TTL (`DEFAULT_TTL_MS = 24h`,
  `signal-dedupe.ts:37`) and per-user capacity (`DEFAULT_MAX_PER_USER = 5000`,
  `signal-dedupe.ts:38`). This prevents re-ingesting the *same message*; it does
  nothing about two distinct messages referencing the same entity.
- No entity extraction (people / orgs / things) exists in the signal pipeline.
- Substrate that *could* hold entity links already exists — and more of it than first
  assumed (confirmed during review):
  - `@skytwin/memory-port` already defines `MemoryPort.recordEntity(KnowledgeEntity)`
    (`packages/memory-port/src/port.ts:59`) and a `KnowledgeEntity` interface
    (`id, userId, name, entityType, attributes, firstSeenAt, lastSeenAt`). So entity
    WRITE is already a contract method — this spec REUSES it, it does not invent it.
  - `@skytwin/memory-gbrain` (default) — vector + tsvector RRF over `brain_*` tables.
  - `@skytwin/memory-mempalace` — knowledge graph with temporal triples + episodic.
  - **Missing:** nothing extracts entities from signals to feed `recordEntity`, AND
    there is no READ-by-entity method (`getSignalsForEntity`) on `MemoryPort` yet.

## Proposed Change

Add an entity extraction + resolution step that pulls named entities (person, org,
thing/topic) from signals, resolves each to a stable `entityId` (linking mentions
across signals), and writes them to the memory backend's graph so retrieval and the
briefing can group by entity and avoid repeating one matter across clusters.

This is the heaviest spec — it introduces an entity store and a resolution problem
(when are two mentions the same entity?). Recommend landing it last.

### Implementation Details

1. **New module** `packages/decision-engine/src/entity-extractor.ts`:
   ```ts
   export type EntityKind = 'person' | 'org' | 'thing';
   export interface ExtractedEntity {
     kind: EntityKind;
     surface: string;        // text as it appeared ("the vendor", "Acme")
     normalized: string;     // canonical key for matching (lowercased, stripped)
     signalRef: string;
     confidence: number;
   }
   export function extractEntities(signal: {
     ref: string; subject: string; body: string; senderAddress?: string;
   }): ExtractedEntity[];
   ```
2. **Resolution** — `packages/decision-engine/src/entity-resolver.ts` maps an
   `ExtractedEntity` to a stable `entityId`:
   - **People:** prefer email address as the strong key (exact, no fuzzy needed);
     fall back to normalized display name only within a thread.
   - **Orgs/things:** normalized-string exact match first; fuzzy match (token
     overlap above a threshold) gated behind a `floorRatio`-style confidence bar so
     weak matches don't merge unrelated entities. Conservative: when unsure, mint a
     NEW entityId rather than wrongly merge (a false merge is worse than a false
     split — it corrupts the graph).
3. **Storage** — reuse the existing `MemoryPort.recordEntity(KnowledgeEntity)`
   (`packages/memory-port/src/port.ts:59`) to persist resolved entities; carry
   `entityId`, `kind`→`entityType`, `surface`/`signalRef`/provenance in
   `KnowledgeEntity.attributes` (or extend the interface if a first-class field reads
   cleaner — decide in the spike below). Works against gbrain + mempalace via the port;
   do NOT bind to one backend.
4. **Briefing dedup** — when spec 04 produces clusters, collapse signals that share a
   primary `entityId` into one cluster line with multiple citations, instead of
   repeating the matter across clusters. This is the concrete win: the reference
   product's "same thing listed twice" bug does not happen.
5. **Query surface** — `getSignalsForEntity(entityId)` does NOT exist on `MemoryPort`
   today; this spec ADDS it to the contract (`packages/memory-port/src/port.ts`) and
   implements it in both the gbrain and mempalace adapters. Read-only; no auto-actions.
   **Pre-work spike (1-2h):** confirm whether `getSignalsForEntity` belongs on
   `MemoryPort` vs. a separate entity-query service, and whether `KnowledgeEntity`
   needs a provenance field vs. stashing it in `attributes`. Lock both before coding.
6. **Provenance preserved** — entities extracted from `untrusted_external` signals
   are tagged as such; the graph records origin so downstream consumers never treat
   an inbound-asserted entity claim as trusted (safety invariant #8).

## Acceptance Criteria

1. Two distinct signals mentioning the same person (same email address) resolve to
   one `entityId`.
2. Two signals mentioning different people with similar display names but different
   addresses resolve to two distinct `entityId`s (no false merge).
3. An org name below the fuzzy-match confidence bar mints a new `entityId` rather
   than merging into a near-match.
4. `getSignalsForEntity(entityId)` returns all and only the signals linked to that
   entity.
5. In a briefing window where one matter spans 3 signals across 2 clusters, the
   matter renders once with 3 citations (no cross-cluster repetition).
6. Entity records carry the originating signal's provenance.
7. Works against both gbrain and mempalace via `MemoryPort` (no backend-specific
   code in the extractor/resolver).
8. Tests written and passing. No degradation of existing functionality.

## Testing Plan

| Layer       | What                                                                | Count |
|-------------|---------------------------------------------------------------------|-------|
| Unit        | Extraction: person/org/thing from synthetic bodies                  | +5 |
| Unit        | Resolution: email-key merge; name-collision split; fuzzy-bar reject | +5 |
| Unit        | Conservative no-merge-when-unsure behavior                          | +2 |
| Integration | Write to MemoryPort → `getSignalsForEntity` round-trip (gbrain)     | +2 |
| Integration | Same against mempalace backend (port parity)                        | +2 |
| Integration | Briefing collapses cross-cluster repeated matter to one line        | +2 |

## Rollback Plan

Flagged (`ENTITY_LINKING=off`). With it off, no entities are written and the briefing
keeps spec 04's behavior (possible cross-cluster repetition, i.e. parity with the
reference product). Entity rows are additive in the memory backend; orphaned rows are
harmless and can be left or swept. Resolution false-merges are the main risk — the
conservative "mint-on-doubt" policy bounds blast radius; a bad merge affects only the
two entities involved and is reversible by re-running extraction after tuning the bar.

## Effort Estimate

- Entity extractor: ~4h
- Resolver (keys + fuzzy bar + conservative policy): ~6h
- MemoryPort write/read surface: ~4h
- Briefing collapse integration: ~3h
- Tests (incl. dual-backend): ~6h

Total: ~3 days. Largest spec in the set; sequence last.

## Files Reference

| File | Change |
|------|--------|
| `packages/decision-engine/src/entity-extractor.ts` | New: entity extraction |
| `packages/decision-engine/src/entity-resolver.ts` | New: mention → stable entityId |
| `packages/memory-port/*` | Add entity write/read to the `MemoryPort` contract |
| `apps/worker/src/signal-dedupe.ts` | Reference (this is message-dedup; entity-link is separate) |
| briefing generator + spec 04 clusterer | Collapse by primary entityId |

## Out of Scope

- A full relationship graph between entities ("X works at Y"). Mention-linking only.
- Cross-user entity sharing (entities are per-user).
- Coreference resolution beyond thread scope for pronouns/aliases.

## Related

- Builds on the memory backends (`@skytwin/memory-gbrain`, `@skytwin/mempalace`).
- Dedups across spec 04 clusters; sequenced after 01-04.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity cross-linking across signals #478

Entity cross-linking across signals

Context

Current State

Proposed Change

Implementation Details

Acceptance Criteria

Testing Plan

Rollback Plan

Effort Estimate

Files Reference

Out of Scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Layer	What	Count
Unit	Extraction: person/org/thing from synthetic bodies	+5
Unit	Resolution: email-key merge; name-collision split; fuzzy-bar reject	+5
Unit	Conservative no-merge-when-unsure behavior	+2
Integration	Write to MemoryPort → `getSignalsForEntity` round-trip (gbrain)	+2
Integration	Same against mempalace backend (port parity)	+2
Integration	Briefing collapses cross-cluster repeated matter to one line	+2

File	Change
`packages/decision-engine/src/entity-extractor.ts`	New: entity extraction
`packages/decision-engine/src/entity-resolver.ts`	New: mention → stable entityId
`packages/memory-port/*`	Add entity write/read to the `MemoryPort` contract
`apps/worker/src/signal-dedupe.ts`	Reference (this is message-dedup; entity-link is separate)
briefing generator + spec 04 clusterer	Collapse by primary entityId

Entity cross-linking across signals #478

Description

Entity cross-linking across signals

Context

Current State

Proposed Change

Implementation Details

Acceptance Criteria

Testing Plan

Rollback Plan

Effort Estimate

Files Reference

Out of Scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions