Skip to content

fix: replay behavior for parent + subgraphs!#7038

Merged
Sydney Runkle (sydney-runkle) merged 38 commits into
mainfrom
sr/diabolical
Mar 10, 2026
Merged

fix: replay behavior for parent + subgraphs!#7038
Sydney Runkle (sydney-runkle) merged 38 commits into
mainfrom
sr/diabolical

Conversation

@sydney-runkle

@sydney-runkle Sydney Runkle (sydney-runkle) commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fix time travel (replay and fork) for graphs with interrupts and subgraphs.

Problem

Two issues with replaying/forking from earlier checkpoints:

  1. Stale interrupt values during replay — Replays incorrectly reused cached RESUME values from prior interrupt() calls, so interrupts silently returned stale answers instead of re-firing.

  2. Wrong subgraph state during time travel — Subgraphs always loaded their latest checkpoint instead of the one corresponding to the parent's historical state. This caused subgraphs to skip execution or produce incorrect results during replay/fork.

Changes

Code changes span libs/langgraph/langgraph/pregel/_loop.py, libs/langgraph/langgraph/_internal/_constants.py, and a new libs/langgraph/langgraph/_internal/_replay.py module:

  • Strip stale RESUME writes on replay — During replays, cached RESUME writes are filtered out so interrupt() re-fires instead of returning old values. Genuine resumes (Command(resume=...)) preserve these writes.

  • Rename skip_done_tasksis_replaying — Clearer naming for the flag that tracks whether the current run is replaying from a specific checkpoint.

  • New ReplayState class (_replay.py) — Encapsulates subgraph checkpoint loading during time-travel. Tracks a parent checkpoint ID upper bound and which subgraph namespaces have already loaded their pre-replay checkpoint. On the first visit to a subgraph namespace, it loads the latest checkpoint created before the replay point (via checkpointer.list(..., before=...) with limit=1). On subsequent visits (e.g. the same subgraph in a later loop iteration), it falls back to normal latest-checkpoint loading. The task-id suffix is stripped from namespaces so the same logical subgraph is recognized across loop iterations.

  • New CONFIG_KEY_REPLAY_STATE config key — The parent graph creates a ReplayState instance and passes it to subgraphs via config. For forks (source=update), the replay state uses the fork's parent checkpoint ID since the fork was created after the subgraph's original checkpoints. The single ReplayState instance is shared by reference across all derived configs within one parent execution.

  • Subgraph checkpoint loading in __enter__/__aenter__ — When a subgraph detects a ReplayState in its config, it delegates checkpoint loading to ReplayState.get_checkpoint/aget_checkpoint instead of using the default get_tuple. It also clears CONFIG_KEY_RESUMING so _first re-applies input and recreates ephemeral routing channels.

Tests

New test files test_time_travel.py (~2500 lines) and test_time_travel_async.py (~2200 lines) covering:

  • Replay and fork with interrupts (single and multiple)
  • Replay and fork for graphs with and without subgraphs
  • Correct subgraph checkpoint restoration during parent time travel
  • get_state with subgraph state during replay

Comment thread libs/langgraph/langgraph/pregel/_loop.py Outdated
@sydney-runkle Sydney Runkle (sydney-runkle) changed the title chore: this is evil fix: replay behavior for parent + subgraphs! Mar 6, 2026
Comment thread libs/langgraph/langgraph/pregel/_algo.py Outdated
Comment thread libs/langgraph/langgraph/pregel/_algo.py Outdated
@sydney-runkle Sydney Runkle (sydney-runkle) marked this pull request as ready for review March 9, 2026 16:52
Comment thread libs/langgraph/langgraph/pregel/_loop.py Outdated
Comment thread libs/langgraph/langgraph/_internal/_replay.py
@sydney-runkle Sydney Runkle (sydney-runkle) merged commit 9c2deac into main Mar 10, 2026
66 checks passed
@sydney-runkle Sydney Runkle (sydney-runkle) deleted the sr/diabolical branch March 10, 2026 01:21
xingshuozhu1998 pushed a commit to xingshuozhu1998/langgraph that referenced this pull request May 1, 2026
## Summary

Fix time travel (replay and fork) for graphs with interrupts and
subgraphs.

## Problem

Two issues with replaying/forking from earlier checkpoints:

1. **Stale interrupt values during replay** — Replays incorrectly reused
cached `RESUME` values from prior `interrupt()` calls, so interrupts
silently returned stale answers instead of re-firing.

2. **Wrong subgraph state during time travel** — Subgraphs always loaded
their **latest** checkpoint instead of the one corresponding to the
parent's historical state. This caused subgraphs to skip execution or
produce incorrect results during replay/fork.

## Changes

Code changes span `libs/langgraph/langgraph/pregel/_loop.py`,
`libs/langgraph/langgraph/_internal/_constants.py`, and a new
`libs/langgraph/langgraph/_internal/_replay.py` module:

- **Strip stale `RESUME` writes on replay** — During replays, cached
`RESUME` writes are filtered out so `interrupt()` re-fires instead of
returning old values. Genuine resumes (`Command(resume=...)`) preserve
these writes.

- **Rename `skip_done_tasks` → `is_replaying`** — Clearer naming for the
flag that tracks whether the current run is replaying from a specific
checkpoint.

- **New `ReplayState` class (`_replay.py`)** — Encapsulates subgraph
checkpoint loading during time-travel. Tracks a parent checkpoint ID
upper bound and which subgraph namespaces have already loaded their
pre-replay checkpoint. On the first visit to a subgraph namespace, it
loads the latest checkpoint created *before* the replay point (via
`checkpointer.list(..., before=...)` with `limit=1`). On subsequent
visits (e.g. the same subgraph in a later loop iteration), it falls back
to normal latest-checkpoint loading. The task-id suffix is stripped from
namespaces so the same logical subgraph is recognized across loop
iterations.

- **New `CONFIG_KEY_REPLAY_STATE` config key** — The parent graph
creates a `ReplayState` instance and passes it to subgraphs via config.
For forks (`source=update`), the replay state uses the fork's parent
checkpoint ID since the fork was created after the subgraph's original
checkpoints. The single `ReplayState` instance is shared by reference
across all derived configs within one parent execution.

- **Subgraph checkpoint loading in `__enter__`/`__aenter__`** — When a
subgraph detects a `ReplayState` in its config, it delegates checkpoint
loading to `ReplayState.get_checkpoint`/`aget_checkpoint` instead of
using the default `get_tuple`. It also clears `CONFIG_KEY_RESUMING` so
`_first` re-applies input and recreates ephemeral routing channels.

## Tests

New test files `test_time_travel.py` (~2500 lines) and
`test_time_travel_async.py` (~2200 lines) covering:
- Replay and fork with interrupts (single and multiple)
- Replay and fork for graphs with and without subgraphs
- Correct subgraph checkpoint restoration during parent time travel
- `get_state` with subgraph state during replay
Christian Bromann (christian-bromann) added a commit to langchain-ai/langgraphjs that referenced this pull request Jun 10, 2026
## Summary

Ports Python time-travel fixes
([#7038](langchain-ai/langgraph#7038),
[#7115](langchain-ai/langgraph#7115),
[#7498](langchain-ai/langgraph#7498),
[#7499](langchain-ai/langgraph#7499)) into
`@langchain/langgraph` so replay/fork behave correctly with interrupts
and nested subgraphs.

- **Stale `RESUME` on replay** — Replaying from a checkpoint before an
interrupt no longer consumes cached resume writes; interrupts re-fire
with the correct payload.
- **Subgraph checkpoint loading on time travel** — Introduces
`ReplayState` (`CONFIG_KEY_REPLAY_STATE`) so nested subgraphs load the
checkpoint that existed at the replay point on first visit, then resume
normal head loading within the same run.
- **Parent fork checkpoints on replay** — Time travel runs through
`PregelLoop._first()` (not `stream()` delegation on the parent
`Pregel`), creating an eager `source: "fork"` checkpoint and propagating
`ReplayState` to subgraphs.
- **Direct-to-subgraph time travel** — `getState()` subgraph delegation
is guarded with `CONFIG_KEY_READ`; direct subgraph configs strip stale
`RESUME` writes and prefer explicit `checkpoint_id` over
`checkpoint_map` when both are set.
- **Streaming** — Fixes subgraph interrupt namespace when streaming with
`subgraphs: true` (empty `checkpoint_ns` no longer becomes `[""]`;
parent emits interrupts under the deepest `checkpoint_map` namespace).

Closes #2325 (supersedes the earlier partial port).

### Implementation notes

| Area | Change |
|------|--------|
| `pregel/replay.ts` | New `ReplayState` class (mirrors Python) |
| `pregel/loop.ts` | Replay/time-travel detection, fork creation,
`RESUME` stripping, `ReplayState` wiring, stream namespace helpers |
| `pregel/index.ts` | `getState` subgraph delegation guard only (removed
`stream()` bypass that skipped parent fork creation) |
| Tests | `time_travel.test.ts` (14), `time_travel_extended.test.ts`
(33), shared `time_travel_helpers.ts`, Vitest matchers `toBeInterrupted`
/ `toHaveInterruptValue` |

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants