[Codex×Pi parity Phase 5] JSONL session-replay harness

**Tracking parent:** #80171
**Depends on:** Phase 1 #80172 (drift classifier)

## Goal

Eva's "loop 3 agents on difficult scenarios for testing based on a real jsonl session history." Take captured session transcripts, replay through fresh sessions on each runtime, diff trajectories.

This catches the regression class where a synthetic prompt looks fine but a real long-running session — with its accumulated context, tool-call history, and edge-case branching — exposes drift.

## Scope

**Curated jsonl fixture set, not real-customer transcripts.** Real-customer transcript ingestion is a separate concern (PII, consent, retention). This PR ships with a small curated set checked into the repo (3–5 transcripts the maintainers approve) so the harness can land without depending on a customer-data pipeline.

## Concrete deliverables

### Code

- **New** `extensions/qa-lab/src/jsonl-replay.ts` — reads a directory of jsonl session transcripts, extracts user-turn boundaries, replays each through both runtimes via the Phase 1 orchestrator. Public API:
  ```ts
  export type JsonlReplayInput = {
    directory: string;
    runtimePair: ["pi", "codex"];
    providerMode: "mock-openai" | "live-frontier";
  };
  export type JsonlReplayResult = {
    transcripts: Array<{
      transcriptPath: string;
      userTurnCount: number;
      cells: { pi: RuntimeParityCell[]; codex: RuntimeParityCell[] };  // one per turn
      drift: Array<RuntimeParityResult["drift"]>;                       // one per turn
      firstDriftAtTurn?: number;                                        // for triage
    }>;
  };
  ```
- **New** `extensions/qa-lab/src/jsonl-replay.test.ts` — unit tests for the user-turn extraction, plus integration test against the curated fixtures.
- **New** `qa/scenarios/jsonl-replay/<curated-name>.jsonl` — 3–5 maintainer-approved fixtures. Strip PII, fix any external dependencies (URLs, channel ids).
- **Extend** `extensions/qa-lab/src/cli.ts` — `qa jsonl-replay --runtime-pair pi,codex --transcripts qa/scenarios/jsonl-replay`.

### Tests

- User-turn extraction unit test (handle edge cases: tool-only turns, system prompts, empty turns, partial transcripts).
- "First drift at turn N" reporter test — long sessions are useless if the report just says "drifted somewhere"; the report must surface the earliest divergent turn.

## Acceptance criteria

- [ ] `qa jsonl-replay --runtime-pair pi,codex --transcripts <dir>` runs each jsonl through both runtimes and produces a per-transcript drift report.
- [ ] The report surfaces the **earliest divergent turn** per transcript (this is what makes long-session bugs triagable).
- [ ] Curated fixture set checked in (3–5 transcripts), maintainer-approved, PII-stripped, no external network dependencies.
- [ ] Integration test running the harness against the curated set on mock-openai mode passes in <5min.
- [ ] No real-customer data in the repo.

## Out of scope

- Customer transcript ingestion pipeline.
- Live-mode replay against live-frontier — the curated fixtures are mock-mode.
- Three-agent loop (Eva mentioned "loop 3 agents on difficult scenarios"). For this phase the harness loops one agent through each transcript's user turns. A multi-agent variant can be a follow-up if needed.

## Open questions for the maintainers

- Where should the curated fixture set live? Suggested: `qa/scenarios/jsonl-replay/`. Need maintainer sign-off on 3–5 specific transcripts to use.
- Should we ship a `qa-lab` helper to scrub a real jsonl into a fixture (PII removal, URL substitution)? Probably yes, in a follow-up — too much scope for this PR.
- Three-agent loop — confirm with @Eva-⚡🐑 whether "loop 3 agents" means three concurrent agents in a session, three sequential replay runs for stability sampling, or three different captured agents. The fix shape differs.

## References

- Tracking parent: #80171
- Phase 1: #80172
- Eva's request: maintainer thread (Yesterday): "Loop 3 agents on difficult scenarios for testing based on a real jsonl session history"


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Codex×Pi parity Phase 5] JSONL session-replay harness #80176

Goal

Scope

Concrete deliverables

Code

Tests

Acceptance criteria

Out of scope

Open questions for the maintainers

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Codex×Pi parity Phase 5] JSONL session-replay harness #80176

Description

Goal

Scope

Concrete deliverables

Code

Tests

Acceptance criteria

Out of scope

Open questions for the maintainers

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions