test(twin-model): cover preference-archaeologist action keys, multi-group, expiry, supportingEvidence cap by jayzalowitz · Pull Request #92 · jayzalowitz/skytwin

jayzalowitz · 2026-04-27T03:26:12Z

Summary

Closes the twin-model preference test gap from the session-start audit. `preference-archaeologist` had 4 tests covering the basics (empty, single-group proposal, no re-propose explicit, confidence scaling) but left several behaviors uncovered.

Additions (+8 tests, 4 → 12 in preference-archaeologist):

`extractAction` fallback chain: `data.preference_key` wins when `data.action` is absent; `data.behavior` wins when neither of the first two are present; nothing recognizable → skip
Multi-group analysis: distinct `domain:action` pairs each emit their own proposal when each meets the 5-item threshold
Sub-threshold groups in a mixed batch: only the >=5 group emits
`supportingEvidence` cap: even with 25 items, the proposal lists 10
`expiresAt` window: 30 days from `detectedAt` (5s slack to absorb timing)
Non-explicit existing preferences do NOT block re-proposal — only `source === 'explicit'` is treated as user-confirmed

Net coverage: twin-model 97 → 105 tests.

(TrustTierEngine — also called out by the audit — already has 21 tests in `@skytwin/policy-engine`. Skipping.)

Test plan

`pnpm --filter @skytwin/twin-model test` — 105/105 (was 97, +8)
`pnpm --filter @skytwin/twin-model lint` — clean
`pnpm test` — 38/38 turbo tasks

🤖 Generated with Claude Code

…roup, expiry, and supportingEvidence cap Closes the twin-model preference test gap from the session-start audit. preference-archaeologist had 4 tests covering the basics (empty, single-group proposal, no re-propose explicit, confidence scaling) but left several behaviors uncovered. Additions (+8 tests, 4 → 12): - extractAction fallback chain: data.preference_key wins when data.action is absent; data.behavior wins when neither of the first two are present; nothing recognizable returns null and the item is skipped - Multi-group: distinct domain:action pairs each get their own proposal when each meets the 5-item threshold - Sub-threshold groups in a mixed batch: only the >=5 group emits - supportingEvidence cap: even with 25 items, the proposal lists 10 - expiresAt window: 30 days from detectedAt (5s slack to absorb timing) - Non-explicit existing preferences do NOT block re-proposal — only source === 'explicit' is treated as user-confirmed Net coverage: twin-model 97 → 105 tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR expands test coverage for PreferenceArchaeologist in @skytwin/twin-model, closing gaps identified in the session-start audit by exercising additional proposal-generation behaviors.

Changes:

Added tests for action-key extraction fallback (action → preference_key → behavior → skip).
Added tests for multi-group proposal generation and threshold filtering.
Added tests for proposal metadata rules (supportingEvidence cap, expiry window, explicit vs inferred preference blocking).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T03:28:10Z

+    const before = Date.now();
+    const proposals = await archaeologist.analyze('user_1');
+    expect(proposals).toHaveLength(1);
+    const proposal = proposals[0]!;
+    const expectedMs = 30 * 24 * 60 * 60 * 1000;
+    const actualMs = proposal.expiresAt.getTime() - before;
+    // Allow some slack for the few ms between `before` capture and proposal creation
+    expect(actualMs).toBeGreaterThan(expectedMs - 5_000);
+    expect(actualMs).toBeLessThan(expectedMs + 5_000);


The assertion here doesn’t actually verify “30 days from detectedAt” (it compares expiresAt to before, not to proposal.detectedAt). This can both mask a real bug (expiresAt drifting away from detectedAt) and introduce flakiness if analyze() ever takes >5s. Consider asserting against proposal.expiresAt - proposal.detectedAt (or using vi.useFakeTimers() / vi.setSystemTime() to make the timing deterministic).

Suggested change

const before = Date.now();

const proposals = await archaeologist.analyze('user_1');

expect(proposals).toHaveLength(1);

const proposal = proposals[0]!;

const expectedMs = 30 * 24 * 60 * 60 * 1000;

const actualMs = proposal.expiresAt.getTime() - before;

// Allow some slack for the few ms between `before` capture and proposal creation

expect(actualMs).toBeGreaterThan(expectedMs - 5_000);

expect(actualMs).toBeLessThan(expectedMs + 5_000);

const proposals = await archaeologist.analyze('user_1');

expect(proposals).toHaveLength(1);

const proposal = proposals[0]!;

const expectedMs = 30 * 24 * 60 * 60 * 1000;

const actualMs = proposal.expiresAt.getTime() - proposal.detectedAt.getTime();

expect(actualMs).toBe(expectedMs);

Copilot · 2026-04-27T03:28:10Z

+      evidence.push({
+        id: `ev_${i}`,
+        userId: 'user_1',
+        source: 'test',
+        type: 'observation',
+        data: { preference_key: 'protect_morning_focus' },
+        domain: 'calendar',
+        timestamp: new Date(),
+      });


These tests hand-construct TwinEvidence objects, duplicating fields already covered by the makeEvidence() helper at the top of the file. Using makeEvidence(..., overrides) would reduce repetition and make future schema changes (e.g., new required fields on TwinEvidence) less likely to break only some tests.

Captures the second half of the audit cleanup work that landed after #87: - #88 config/core test coverage (was 0/missing helpers) - #89 connectors gmail + calendar pure-logic tests (8 → 43) - #90 llm-client URL validation hardening (zone IDs, trailing dots, CGNAT) - #91 adapter-discovery factory shape check + manifest defaultConfig - #92 preference-archaeologist coverage (action keys, multi-group, expiry) - #93 worker SignalDeduper extraction + 11 tests Pure docs. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 27, 2026 03:26

Copilot started reviewing on behalf of jayzalowitz April 27, 2026 03:26 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

jayzalowitz merged commit 5437e57 into main Apr 27, 2026
12 checks passed

jayzalowitz deleted the jayzalowitz/twin-preference-tests branch April 27, 2026 03:40

jayzalowitz mentioned this pull request Apr 27, 2026

docs: extend [Unreleased] CHANGELOG for #88-#93 #94

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(twin-model): cover preference-archaeologist action keys, multi-group, expiry, supportingEvidence cap#92

test(twin-model): cover preference-archaeologist action keys, multi-group, expiry, supportingEvidence cap#92
jayzalowitz merged 1 commit into
mainfrom
jayzalowitz/twin-preference-tests

jayzalowitz commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayzalowitz commented Apr 27, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants