test(twin-model): cover preference-archaeologist action keys, multi-group, expiry, supportingEvidence cap#92
Conversation
…roup, expiry, and supportingEvidence cap Closes the twin-model preference test gap from the session-start audit. preference-archaeologist had 4 tests covering the basics (empty, single-group proposal, no re-propose explicit, confidence scaling) but left several behaviors uncovered. Additions (+8 tests, 4 → 12): - extractAction fallback chain: data.preference_key wins when data.action is absent; data.behavior wins when neither of the first two are present; nothing recognizable returns null and the item is skipped - Multi-group: distinct domain:action pairs each get their own proposal when each meets the 5-item threshold - Sub-threshold groups in a mixed batch: only the >=5 group emits - supportingEvidence cap: even with 25 items, the proposal lists 10 - expiresAt window: 30 days from detectedAt (5s slack to absorb timing) - Non-explicit existing preferences do NOT block re-proposal — only source === 'explicit' is treated as user-confirmed Net coverage: twin-model 97 → 105 tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR expands test coverage for PreferenceArchaeologist in @skytwin/twin-model, closing gaps identified in the session-start audit by exercising additional proposal-generation behaviors.
Changes:
- Added tests for action-key extraction fallback (
action→preference_key→behavior→ skip). - Added tests for multi-group proposal generation and threshold filtering.
- Added tests for proposal metadata rules (supportingEvidence cap, expiry window, explicit vs inferred preference blocking).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const before = Date.now(); | ||
| const proposals = await archaeologist.analyze('user_1'); | ||
| expect(proposals).toHaveLength(1); | ||
| const proposal = proposals[0]!; | ||
| const expectedMs = 30 * 24 * 60 * 60 * 1000; | ||
| const actualMs = proposal.expiresAt.getTime() - before; | ||
| // Allow some slack for the few ms between `before` capture and proposal creation | ||
| expect(actualMs).toBeGreaterThan(expectedMs - 5_000); | ||
| expect(actualMs).toBeLessThan(expectedMs + 5_000); |
There was a problem hiding this comment.
The assertion here doesn’t actually verify “30 days from detectedAt” (it compares expiresAt to before, not to proposal.detectedAt). This can both mask a real bug (expiresAt drifting away from detectedAt) and introduce flakiness if analyze() ever takes >5s. Consider asserting against proposal.expiresAt - proposal.detectedAt (or using vi.useFakeTimers() / vi.setSystemTime() to make the timing deterministic).
| const before = Date.now(); | |
| const proposals = await archaeologist.analyze('user_1'); | |
| expect(proposals).toHaveLength(1); | |
| const proposal = proposals[0]!; | |
| const expectedMs = 30 * 24 * 60 * 60 * 1000; | |
| const actualMs = proposal.expiresAt.getTime() - before; | |
| // Allow some slack for the few ms between `before` capture and proposal creation | |
| expect(actualMs).toBeGreaterThan(expectedMs - 5_000); | |
| expect(actualMs).toBeLessThan(expectedMs + 5_000); | |
| const proposals = await archaeologist.analyze('user_1'); | |
| expect(proposals).toHaveLength(1); | |
| const proposal = proposals[0]!; | |
| const expectedMs = 30 * 24 * 60 * 60 * 1000; | |
| const actualMs = proposal.expiresAt.getTime() - proposal.detectedAt.getTime(); | |
| expect(actualMs).toBe(expectedMs); |
| evidence.push({ | ||
| id: `ev_${i}`, | ||
| userId: 'user_1', | ||
| source: 'test', | ||
| type: 'observation', | ||
| data: { preference_key: 'protect_morning_focus' }, | ||
| domain: 'calendar', | ||
| timestamp: new Date(), | ||
| }); |
There was a problem hiding this comment.
These tests hand-construct TwinEvidence objects, duplicating fields already covered by the makeEvidence() helper at the top of the file. Using makeEvidence(..., overrides) would reduce repetition and make future schema changes (e.g., new required fields on TwinEvidence) less likely to break only some tests.
Captures the second half of the audit cleanup work that landed after #87: - #88 config/core test coverage (was 0/missing helpers) - #89 connectors gmail + calendar pure-logic tests (8 → 43) - #90 llm-client URL validation hardening (zone IDs, trailing dots, CGNAT) - #91 adapter-discovery factory shape check + manifest defaultConfig - #92 preference-archaeologist coverage (action keys, multi-group, expiry) - #93 worker SignalDeduper extraction + 11 tests Pure docs. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Closes the twin-model preference test gap from the session-start audit. `preference-archaeologist` had 4 tests covering the basics (empty, single-group proposal, no re-propose explicit, confidence scaling) but left several behaviors uncovered.
Additions (+8 tests, 4 → 12 in preference-archaeologist):
Net coverage: twin-model 97 → 105 tests.
(TrustTierEngine — also called out by the audit — already has 21 tests in `@skytwin/policy-engine`. Skipping.)
Test plan
🤖 Generated with Claude Code