Skip to content

Entity cross-linking across signals #478

@jayzalowitz

Description

@jayzalowitz

Entity cross-linking across signals

Context

A reference AI-inbox product cross-references the same entity wherever it appears —
one person or one vendor shows up in a to-do, in a topic cluster, and in a related
thread, and the digest treats them as the same thing. It also sometimes repeats the
same entity in two places because its linking is imperfect, which is the failure mode
to avoid. SkyTwin deduplicates signals only by message identity; it has no notion that
two different signals refer to the same person, organization, or thing. Without entity
linking, the briefing can list the same underlying matter three times under three
clusters, and the twin can't reason about "everything touching entity X."

Current State

Verified 2026-06-06.

  • apps/worker/src/signal-dedupe.ts:40-54SignalDeduper keys on
    ${signal.source}:${signal.id} with a TTL (DEFAULT_TTL_MS = 24h,
    signal-dedupe.ts:37) and per-user capacity (DEFAULT_MAX_PER_USER = 5000,
    signal-dedupe.ts:38). This prevents re-ingesting the same message; it does
    nothing about two distinct messages referencing the same entity.
  • No entity extraction (people / orgs / things) exists in the signal pipeline.
  • Substrate that could hold entity links already exists — and more of it than first
    assumed (confirmed during review):
    • @skytwin/memory-port already defines MemoryPort.recordEntity(KnowledgeEntity)
      (packages/memory-port/src/port.ts:59) and a KnowledgeEntity interface
      (id, userId, name, entityType, attributes, firstSeenAt, lastSeenAt). So entity
      WRITE is already a contract method — this spec REUSES it, it does not invent it.
    • @skytwin/memory-gbrain (default) — vector + tsvector RRF over brain_* tables.
    • @skytwin/memory-mempalace — knowledge graph with temporal triples + episodic.
    • Missing: nothing extracts entities from signals to feed recordEntity, AND
      there is no READ-by-entity method (getSignalsForEntity) on MemoryPort yet.

Proposed Change

Add an entity extraction + resolution step that pulls named entities (person, org,
thing/topic) from signals, resolves each to a stable entityId (linking mentions
across signals), and writes them to the memory backend's graph so retrieval and the
briefing can group by entity and avoid repeating one matter across clusters.

This is the heaviest spec — it introduces an entity store and a resolution problem
(when are two mentions the same entity?). Recommend landing it last.

Implementation Details

  1. New module packages/decision-engine/src/entity-extractor.ts:
    export type EntityKind = 'person' | 'org' | 'thing';
    export interface ExtractedEntity {
      kind: EntityKind;
      surface: string;        // text as it appeared ("the vendor", "Acme")
      normalized: string;     // canonical key for matching (lowercased, stripped)
      signalRef: string;
      confidence: number;
    }
    export function extractEntities(signal: {
      ref: string; subject: string; body: string; senderAddress?: string;
    }): ExtractedEntity[];
  2. Resolutionpackages/decision-engine/src/entity-resolver.ts maps an
    ExtractedEntity to a stable entityId:
    • People: prefer email address as the strong key (exact, no fuzzy needed);
      fall back to normalized display name only within a thread.
    • Orgs/things: normalized-string exact match first; fuzzy match (token
      overlap above a threshold) gated behind a floorRatio-style confidence bar so
      weak matches don't merge unrelated entities. Conservative: when unsure, mint a
      NEW entityId rather than wrongly merge (a false merge is worse than a false
      split — it corrupts the graph).
  3. Storage — reuse the existing MemoryPort.recordEntity(KnowledgeEntity)
    (packages/memory-port/src/port.ts:59) to persist resolved entities; carry
    entityId, kindentityType, surface/signalRef/provenance in
    KnowledgeEntity.attributes (or extend the interface if a first-class field reads
    cleaner — decide in the spike below). Works against gbrain + mempalace via the port;
    do NOT bind to one backend.
  4. Briefing dedup — when spec 04 produces clusters, collapse signals that share a
    primary entityId into one cluster line with multiple citations, instead of
    repeating the matter across clusters. This is the concrete win: the reference
    product's "same thing listed twice" bug does not happen.
  5. Query surfacegetSignalsForEntity(entityId) does NOT exist on MemoryPort
    today; this spec ADDS it to the contract (packages/memory-port/src/port.ts) and
    implements it in both the gbrain and mempalace adapters. Read-only; no auto-actions.
    Pre-work spike (1-2h): confirm whether getSignalsForEntity belongs on
    MemoryPort vs. a separate entity-query service, and whether KnowledgeEntity
    needs a provenance field vs. stashing it in attributes. Lock both before coding.
  6. Provenance preserved — entities extracted from untrusted_external signals
    are tagged as such; the graph records origin so downstream consumers never treat
    an inbound-asserted entity claim as trusted (safety invariant Live notification layer: SSE, approval expiry cron, push alerts #8).

Acceptance Criteria

  1. Two distinct signals mentioning the same person (same email address) resolve to
    one entityId.
  2. Two signals mentioning different people with similar display names but different
    addresses resolve to two distinct entityIds (no false merge).
  3. An org name below the fuzzy-match confidence bar mints a new entityId rather
    than merging into a near-match.
  4. getSignalsForEntity(entityId) returns all and only the signals linked to that
    entity.
  5. In a briefing window where one matter spans 3 signals across 2 clusters, the
    matter renders once with 3 citations (no cross-cluster repetition).
  6. Entity records carry the originating signal's provenance.
  7. Works against both gbrain and mempalace via MemoryPort (no backend-specific
    code in the extractor/resolver).
  8. Tests written and passing. No degradation of existing functionality.

Testing Plan

Layer What Count
Unit Extraction: person/org/thing from synthetic bodies +5
Unit Resolution: email-key merge; name-collision split; fuzzy-bar reject +5
Unit Conservative no-merge-when-unsure behavior +2
Integration Write to MemoryPort → getSignalsForEntity round-trip (gbrain) +2
Integration Same against mempalace backend (port parity) +2
Integration Briefing collapses cross-cluster repeated matter to one line +2

Rollback Plan

Flagged (ENTITY_LINKING=off). With it off, no entities are written and the briefing
keeps spec 04's behavior (possible cross-cluster repetition, i.e. parity with the
reference product). Entity rows are additive in the memory backend; orphaned rows are
harmless and can be left or swept. Resolution false-merges are the main risk — the
conservative "mint-on-doubt" policy bounds blast radius; a bad merge affects only the
two entities involved and is reversible by re-running extraction after tuning the bar.

Effort Estimate

  • Entity extractor: ~4h
  • Resolver (keys + fuzzy bar + conservative policy): ~6h
  • MemoryPort write/read surface: ~4h
  • Briefing collapse integration: ~3h
  • Tests (incl. dual-backend): ~6h

Total: ~3 days. Largest spec in the set; sequence last.

Files Reference

File Change
packages/decision-engine/src/entity-extractor.ts New: entity extraction
packages/decision-engine/src/entity-resolver.ts New: mention → stable entityId
packages/memory-port/* Add entity write/read to the MemoryPort contract
apps/worker/src/signal-dedupe.ts Reference (this is message-dedup; entity-link is separate)
briefing generator + spec 04 clusterer Collapse by primary entityId

Out of Scope

  • A full relationship graph between entities ("X works at Y"). Mention-linking only.
  • Cross-user entity sharing (entities are per-user).
  • Coreference resolution beyond thread scope for pronouns/aliases.

Related

  • Builds on the memory backends (@skytwin/memory-gbrain, @skytwin/mempalace).
  • Dedups across spec 04 clusters; sequenced after 01-04.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions