Skip to content

[Codex×Pi parity Phase 5] JSONL session-replay harness #80176

@100yenadmin

Description

@100yenadmin

Tracking parent: #80171
Depends on: Phase 1 #80172 (drift classifier)

Goal

Eva's "loop 3 agents on difficult scenarios for testing based on a real jsonl session history." Take captured session transcripts, replay through fresh sessions on each runtime, diff trajectories.

This catches the regression class where a synthetic prompt looks fine but a real long-running session — with its accumulated context, tool-call history, and edge-case branching — exposes drift.

Scope

Curated jsonl fixture set, not real-customer transcripts. Real-customer transcript ingestion is a separate concern (PII, consent, retention). This PR ships with a small curated set checked into the repo (3–5 transcripts the maintainers approve) so the harness can land without depending on a customer-data pipeline.

Concrete deliverables

Code

  • New extensions/qa-lab/src/jsonl-replay.ts — reads a directory of jsonl session transcripts, extracts user-turn boundaries, replays each through both runtimes via the Phase 1 orchestrator. Public API:
    export type JsonlReplayInput = {
      directory: string;
      runtimePair: ["pi", "codex"];
      providerMode: "mock-openai" | "live-frontier";
    };
    export type JsonlReplayResult = {
      transcripts: Array<{
        transcriptPath: string;
        userTurnCount: number;
        cells: { pi: RuntimeParityCell[]; codex: RuntimeParityCell[] };  // one per turn
        drift: Array<RuntimeParityResult["drift"]>;                       // one per turn
        firstDriftAtTurn?: number;                                        // for triage
      }>;
    };
  • New extensions/qa-lab/src/jsonl-replay.test.ts — unit tests for the user-turn extraction, plus integration test against the curated fixtures.
  • New qa/scenarios/jsonl-replay/<curated-name>.jsonl — 3–5 maintainer-approved fixtures. Strip PII, fix any external dependencies (URLs, channel ids).
  • Extend extensions/qa-lab/src/cli.tsqa jsonl-replay --runtime-pair pi,codex --transcripts qa/scenarios/jsonl-replay.

Tests

  • User-turn extraction unit test (handle edge cases: tool-only turns, system prompts, empty turns, partial transcripts).
  • "First drift at turn N" reporter test — long sessions are useless if the report just says "drifted somewhere"; the report must surface the earliest divergent turn.

Acceptance criteria

  • qa jsonl-replay --runtime-pair pi,codex --transcripts <dir> runs each jsonl through both runtimes and produces a per-transcript drift report.
  • The report surfaces the earliest divergent turn per transcript (this is what makes long-session bugs triagable).
  • Curated fixture set checked in (3–5 transcripts), maintainer-approved, PII-stripped, no external network dependencies.
  • Integration test running the harness against the curated set on mock-openai mode passes in <5min.
  • No real-customer data in the repo.

Out of scope

  • Customer transcript ingestion pipeline.
  • Live-mode replay against live-frontier — the curated fixtures are mock-mode.
  • Three-agent loop (Eva mentioned "loop 3 agents on difficult scenarios"). For this phase the harness loops one agent through each transcript's user turns. A multi-agent variant can be a follow-up if needed.

Open questions for the maintainers

  • Where should the curated fixture set live? Suggested: qa/scenarios/jsonl-replay/. Need maintainer sign-off on 3–5 specific transcripts to use.
  • Should we ship a qa-lab helper to scrub a real jsonl into a fixture (PII removal, URL substitution)? Probably yes, in a follow-up — too much scope for this PR.
  • Three-agent loop — confirm with @Eva-⚡🐑 whether "loop 3 agents" means three concurrent agents in a session, three sequential replay runs for stability sampling, or three different captured agents. The fix shape differs.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions