Tracking parent: #80171
Depends on: Phase 1 #80172 (drift classifier)
Goal
Eva's "loop 3 agents on difficult scenarios for testing based on a real jsonl session history." Take captured session transcripts, replay through fresh sessions on each runtime, diff trajectories.
This catches the regression class where a synthetic prompt looks fine but a real long-running session — with its accumulated context, tool-call history, and edge-case branching — exposes drift.
Scope
Curated jsonl fixture set, not real-customer transcripts. Real-customer transcript ingestion is a separate concern (PII, consent, retention). This PR ships with a small curated set checked into the repo (3–5 transcripts the maintainers approve) so the harness can land without depending on a customer-data pipeline.
Concrete deliverables
Code
- New
extensions/qa-lab/src/jsonl-replay.ts — reads a directory of jsonl session transcripts, extracts user-turn boundaries, replays each through both runtimes via the Phase 1 orchestrator. Public API:
export type JsonlReplayInput = {
directory: string;
runtimePair: ["pi", "codex"];
providerMode: "mock-openai" | "live-frontier";
};
export type JsonlReplayResult = {
transcripts: Array<{
transcriptPath: string;
userTurnCount: number;
cells: { pi: RuntimeParityCell[]; codex: RuntimeParityCell[] }; // one per turn
drift: Array<RuntimeParityResult["drift"]>; // one per turn
firstDriftAtTurn?: number; // for triage
}>;
};
- New
extensions/qa-lab/src/jsonl-replay.test.ts — unit tests for the user-turn extraction, plus integration test against the curated fixtures.
- New
qa/scenarios/jsonl-replay/<curated-name>.jsonl — 3–5 maintainer-approved fixtures. Strip PII, fix any external dependencies (URLs, channel ids).
- Extend
extensions/qa-lab/src/cli.ts — qa jsonl-replay --runtime-pair pi,codex --transcripts qa/scenarios/jsonl-replay.
Tests
- User-turn extraction unit test (handle edge cases: tool-only turns, system prompts, empty turns, partial transcripts).
- "First drift at turn N" reporter test — long sessions are useless if the report just says "drifted somewhere"; the report must surface the earliest divergent turn.
Acceptance criteria
Out of scope
- Customer transcript ingestion pipeline.
- Live-mode replay against live-frontier — the curated fixtures are mock-mode.
- Three-agent loop (Eva mentioned "loop 3 agents on difficult scenarios"). For this phase the harness loops one agent through each transcript's user turns. A multi-agent variant can be a follow-up if needed.
Open questions for the maintainers
- Where should the curated fixture set live? Suggested:
qa/scenarios/jsonl-replay/. Need maintainer sign-off on 3–5 specific transcripts to use.
- Should we ship a
qa-lab helper to scrub a real jsonl into a fixture (PII removal, URL substitution)? Probably yes, in a follow-up — too much scope for this PR.
- Three-agent loop — confirm with @Eva-⚡🐑 whether "loop 3 agents" means three concurrent agents in a session, three sequential replay runs for stability sampling, or three different captured agents. The fix shape differs.
References
Tracking parent: #80171
Depends on: Phase 1 #80172 (drift classifier)
Goal
Eva's "loop 3 agents on difficult scenarios for testing based on a real jsonl session history." Take captured session transcripts, replay through fresh sessions on each runtime, diff trajectories.
This catches the regression class where a synthetic prompt looks fine but a real long-running session — with its accumulated context, tool-call history, and edge-case branching — exposes drift.
Scope
Curated jsonl fixture set, not real-customer transcripts. Real-customer transcript ingestion is a separate concern (PII, consent, retention). This PR ships with a small curated set checked into the repo (3–5 transcripts the maintainers approve) so the harness can land without depending on a customer-data pipeline.
Concrete deliverables
Code
extensions/qa-lab/src/jsonl-replay.ts— reads a directory of jsonl session transcripts, extracts user-turn boundaries, replays each through both runtimes via the Phase 1 orchestrator. Public API:extensions/qa-lab/src/jsonl-replay.test.ts— unit tests for the user-turn extraction, plus integration test against the curated fixtures.qa/scenarios/jsonl-replay/<curated-name>.jsonl— 3–5 maintainer-approved fixtures. Strip PII, fix any external dependencies (URLs, channel ids).extensions/qa-lab/src/cli.ts—qa jsonl-replay --runtime-pair pi,codex --transcripts qa/scenarios/jsonl-replay.Tests
Acceptance criteria
qa jsonl-replay --runtime-pair pi,codex --transcripts <dir>runs each jsonl through both runtimes and produces a per-transcript drift report.Out of scope
Open questions for the maintainers
qa/scenarios/jsonl-replay/. Need maintainer sign-off on 3–5 specific transcripts to use.qa-labhelper to scrub a real jsonl into a fixture (PII removal, URL substitution)? Probably yes, in a follow-up — too much scope for this PR.References