Skip to content

test(twin-model): cover preference-archaeologist action keys, multi-group, expiry, supportingEvidence cap#92

Merged
jayzalowitz merged 1 commit into
mainfrom
jayzalowitz/twin-preference-tests
Apr 27, 2026
Merged

test(twin-model): cover preference-archaeologist action keys, multi-group, expiry, supportingEvidence cap#92
jayzalowitz merged 1 commit into
mainfrom
jayzalowitz/twin-preference-tests

Conversation

@jayzalowitz

Copy link
Copy Markdown
Owner

Summary

Closes the twin-model preference test gap from the session-start audit. `preference-archaeologist` had 4 tests covering the basics (empty, single-group proposal, no re-propose explicit, confidence scaling) but left several behaviors uncovered.

Additions (+8 tests, 4 → 12 in preference-archaeologist):

  • `extractAction` fallback chain: `data.preference_key` wins when `data.action` is absent; `data.behavior` wins when neither of the first two are present; nothing recognizable → skip
  • Multi-group analysis: distinct `domain:action` pairs each emit their own proposal when each meets the 5-item threshold
  • Sub-threshold groups in a mixed batch: only the >=5 group emits
  • `supportingEvidence` cap: even with 25 items, the proposal lists 10
  • `expiresAt` window: 30 days from `detectedAt` (5s slack to absorb timing)
  • Non-explicit existing preferences do NOT block re-proposal — only `source === 'explicit'` is treated as user-confirmed

Net coverage: twin-model 97 → 105 tests.

(TrustTierEngine — also called out by the audit — already has 21 tests in `@skytwin/policy-engine`. Skipping.)

Test plan

  • `pnpm --filter @skytwin/twin-model test` — 105/105 (was 97, +8)
  • `pnpm --filter @skytwin/twin-model lint` — clean
  • `pnpm test` — 38/38 turbo tasks

🤖 Generated with Claude Code

…roup, expiry, and supportingEvidence cap

Closes the twin-model preference test gap from the session-start audit.
preference-archaeologist had 4 tests covering the basics (empty,
single-group proposal, no re-propose explicit, confidence scaling) but
left several behaviors uncovered.

Additions (+8 tests, 4 → 12):

- extractAction fallback chain: data.preference_key wins when data.action
  is absent; data.behavior wins when neither of the first two are
  present; nothing recognizable returns null and the item is skipped
- Multi-group: distinct domain:action pairs each get their own proposal
  when each meets the 5-item threshold
- Sub-threshold groups in a mixed batch: only the >=5 group emits
- supportingEvidence cap: even with 25 items, the proposal lists 10
- expiresAt window: 30 days from detectedAt (5s slack to absorb timing)
- Non-explicit existing preferences do NOT block re-proposal — only
  source === 'explicit' is treated as user-confirmed

Net coverage: twin-model 97 → 105 tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 03:26

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands test coverage for PreferenceArchaeologist in @skytwin/twin-model, closing gaps identified in the session-start audit by exercising additional proposal-generation behaviors.

Changes:

  • Added tests for action-key extraction fallback (actionpreference_keybehavior → skip).
  • Added tests for multi-group proposal generation and threshold filtering.
  • Added tests for proposal metadata rules (supportingEvidence cap, expiry window, explicit vs inferred preference blocking).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +263 to +271
const before = Date.now();
const proposals = await archaeologist.analyze('user_1');
expect(proposals).toHaveLength(1);
const proposal = proposals[0]!;
const expectedMs = 30 * 24 * 60 * 60 * 1000;
const actualMs = proposal.expiresAt.getTime() - before;
// Allow some slack for the few ms between `before` capture and proposal creation
expect(actualMs).toBeGreaterThan(expectedMs - 5_000);
expect(actualMs).toBeLessThan(expectedMs + 5_000);

Copilot AI Apr 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion here doesn’t actually verify “30 days from detectedAt” (it compares expiresAt to before, not to proposal.detectedAt). This can both mask a real bug (expiresAt drifting away from detectedAt) and introduce flakiness if analyze() ever takes >5s. Consider asserting against proposal.expiresAt - proposal.detectedAt (or using vi.useFakeTimers() / vi.setSystemTime() to make the timing deterministic).

Suggested change
const before = Date.now();
const proposals = await archaeologist.analyze('user_1');
expect(proposals).toHaveLength(1);
const proposal = proposals[0]!;
const expectedMs = 30 * 24 * 60 * 60 * 1000;
const actualMs = proposal.expiresAt.getTime() - before;
// Allow some slack for the few ms between `before` capture and proposal creation
expect(actualMs).toBeGreaterThan(expectedMs - 5_000);
expect(actualMs).toBeLessThan(expectedMs + 5_000);
const proposals = await archaeologist.analyze('user_1');
expect(proposals).toHaveLength(1);
const proposal = proposals[0]!;
const expectedMs = 30 * 24 * 60 * 60 * 1000;
const actualMs = proposal.expiresAt.getTime() - proposal.detectedAt.getTime();
expect(actualMs).toBe(expectedMs);

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +172
evidence.push({
id: `ev_${i}`,
userId: 'user_1',
source: 'test',
type: 'observation',
data: { preference_key: 'protect_morning_focus' },
domain: 'calendar',
timestamp: new Date(),
});

Copilot AI Apr 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests hand-construct TwinEvidence objects, duplicating fields already covered by the makeEvidence() helper at the top of the file. Using makeEvidence(..., overrides) would reduce repetition and make future schema changes (e.g., new required fields on TwinEvidence) less likely to break only some tests.

Copilot uses AI. Check for mistakes.
@jayzalowitz jayzalowitz merged commit 5437e57 into main Apr 27, 2026
12 checks passed
@jayzalowitz jayzalowitz deleted the jayzalowitz/twin-preference-tests branch April 27, 2026 03:40
jayzalowitz added a commit that referenced this pull request Apr 27, 2026
Captures the second half of the audit cleanup work that landed after
#87:

- #88 config/core test coverage (was 0/missing helpers)
- #89 connectors gmail + calendar pure-logic tests (8 → 43)
- #90 llm-client URL validation hardening (zone IDs, trailing dots, CGNAT)
- #91 adapter-discovery factory shape check + manifest defaultConfig
- #92 preference-archaeologist coverage (action keys, multi-group, expiry)
- #93 worker SignalDeduper extraction + 11 tests

Pure docs.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants