Skip to content

fix: time travel when going back to interrupt node#7498

Merged
Sydney Runkle (sydney-runkle) merged 4 commits into
mainfrom
sr/time-travel-bug
Apr 16, 2026
Merged

fix: time travel when going back to interrupt node#7498
Sydney Runkle (sydney-runkle) merged 4 commits into
mainfrom
sr/time-travel-bug

Conversation

@sydney-runkle

@sydney-runkle Sydney Runkle (sydney-runkle) commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

Fix: Create fork checkpoint on subgraph time travel

Problem

When time-traveling to a subgraph checkpoint that has an interrupt, and then resuming, the resume would load the wrong state — it would pick up the original execution's latest checkpoint instead of the time-traveled one.

This happened because replaying from a subgraph checkpoint never created a new parent checkpoint. If the replay hit an interrupt before after_tick() ran, no checkpoint was written at all, so the parent's "latest" checkpoint was still the old one from the original execution.

Fix

When the loop detects a time-travel replay (not an update_state fork), it now eagerly writes a fork checkpoint at the start of the tick. This ensures:

  1. The parent thread's latest checkpoint points to the replayed state
  2. Subsequent Command(resume=...) calls find the correct checkpoint
  3. Stale INTERRUPT pending writes from the old checkpoint are cleared (they reference old task IDs)

Additionally, the subgraph replay logic now uses the parent checkpoint ID (from prev_checkpoint_config) when resolving subgraph checkpoints during time-travel, matching the existing behavior for update_state forks.

Checkpoint flow diagrams

Before fix: time travel leaves no fork

Original execution:

  C0 (start) --> C1 (step_a) --> C2 (ask_1 interrupt) --> C3 (resume) --> C4 (ask_2 interrupt) --> C5 (done)

Time travel to C2 (subgraph config):

  Replay runs... hits interrupt... no new checkpoint written.
  Parent "latest" is still C5.

  Command(resume="new_answer"):
    Loads C5 (wrong!) instead of the replayed C2 state.

After fix: time travel creates a fork

Original execution:

  C0 --> C1 --> C2 --> C3 --> C4 --> C5 (done)

Time travel to C2 (subgraph config):

  C0 --> C1 --> C2 --> C3 --> C4 --> C5
                  \
                   F1 (fork, source="fork")  <-- new latest

  Command(resume="new_answer"):
    Loads F1 (correct!) --> resumes from the right state.

  After full resume:

  C0 --> C1 --> C2 --> C3 --> C4 --> C5
                  \
                   F1 --> F2 (ask_1 result) --> F3 (ask_2 interrupt) --> F4 (done)

Manual fork via update_state (unchanged)

  C0 --> C1 --> C2 --> C3
                  \
                   U1 (source="update")  <-- created by update_state()

  This path already worked. The fix skips update/fork sources
  so existing behavior is preserved.

Changes

  • libs/langgraph/langgraph/pregel/_loop.py:
    • Extract is_time_traveling flag from the existing replay detection logic for reuse
    • Write a fork checkpoint (source="fork") eagerly at the start of a time-travel tick, before execution begins
    • Clear stale INTERRUPT pending writes when creating the fork (they reference old task IDs that won't match the new checkpoint)
    • Unify subgraph replay ID resolution: check source in ("update", "fork") instead of a separate is_time_traveling condition, since the new fork checkpoint now has source="fork"
  • libs/langgraph/tests/test_time_travel.py and test_time_travel_async.py: Added 4 new test cases (sync + async):
    • test_replay_from_before_interrupt_then_resume — replays from a checkpoint before an interrupt, resumes with a new answer, and verifies the full checkpoint history (source, next, values) at each stage
    • test_subgraph_time_travel_resume_from_first_interrupt — time-travels to a subgraph's first interrupt, resumes both interrupts with new answers, and verifies the fork creates a new branch while preserving the original
    • test_subgraph_time_travel_resume_from_second_interrupt — time-travels to a subgraph's second interrupt, resumes with a new answer, and verifies the first interrupt's original answer is preserved
    • test_subgraph_time_travel_checkpoint_pattern — verifies the fork checkpoint branches from the correct replay point and that the full checkpoint tree is correct after resume
  • libs/langgraph/tests/test_pregel.py / test_pregel_async.py: Updated existing test_weather_subgraph_state to account for the new fork checkpoint appearing in history (history length increases by 1)

@sydney-runkle Sydney Runkle (sydney-runkle) merged commit 51cbdbd into main Apr 16, 2026
102 of 128 checks passed
@sydney-runkle Sydney Runkle (sydney-runkle) deleted the sr/time-travel-bug branch April 16, 2026 12:29
Sydney Runkle (sydney-runkle) added a commit that referenced this pull request Apr 27, 2026
…ist (#7582)

## Summary

Fixes #7498 — `MESSAGE_COERCION_FAILURE` when resuming threads
checkpointed before v1.0.1.

**Root cause:** PR #6269 (v1.0.1) added an `_allowed_json_modules`
security gate to `JsonPlusSerializer._reviver`. The gate defaults to
`None`, so old `"json"`-format checkpoint blobs containing `lc=2`
constructor dicts (the pre-msgpack serialization format for pydantic
objects like `HumanMessage`) are now returned as raw dicts instead of
being reconstructed. Those raw dicts reach `add_messages →
convert_to_messages`, which sees `type="constructor"` and raises
`MESSAGE_COERCION_FAILURE`. Fresh first-turn messages are unaffected
because current `dumps_typed` only writes `"msgpack"` blobs.

**Fix:** `_reviver` now reconstructs `lc=2` blobs whose target class is
already in `SAFE_MSGPACK_TYPES` — the same curated allowlist already
used by the msgpack deserialization path (includes all standard
LangChain message types). Unknown classes are still blocked, preserving
the security intent of #6269.

## Changes

- `libs/checkpoint/langgraph/checkpoint/serde/jsonplus.py` — add
`_is_safe_json_type()` helper; update `_reviver` and
`_check_allowed_json_modules` to allow safe types without an explicit
allowlist
- `libs/checkpoint/tests/test_jsonplus.py` — two new regression tests:
safe-type `lc=2` blobs revive correctly; unknown-type `lc=2` blobs stay
blocked

## Test plan

- [ ] `test_lc2_json_safe_type_revives_without_allowlist` —
`HumanMessage`/`AIMessage` lc=2 JSON blobs round-trip to proper
`BaseMessage` objects with no allowlist configured
- [ ] `test_lc2_json_unknown_type_stays_blocked_without_allowlist` —
`pprint.pprint` lc=2 blob still returns raw dict (not reconstructed)
- [ ] `test_deserde_invalid_module` — existing behaviour unchanged
- [ ] Full `test_jsonplus.py` suite: 93/93 passing

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Christian Bromann (christian-bromann) added a commit to langchain-ai/langgraphjs that referenced this pull request Jun 10, 2026
## Summary

Ports Python time-travel fixes
([#7038](langchain-ai/langgraph#7038),
[#7115](langchain-ai/langgraph#7115),
[#7498](langchain-ai/langgraph#7498),
[#7499](langchain-ai/langgraph#7499)) into
`@langchain/langgraph` so replay/fork behave correctly with interrupts
and nested subgraphs.

- **Stale `RESUME` on replay** — Replaying from a checkpoint before an
interrupt no longer consumes cached resume writes; interrupts re-fire
with the correct payload.
- **Subgraph checkpoint loading on time travel** — Introduces
`ReplayState` (`CONFIG_KEY_REPLAY_STATE`) so nested subgraphs load the
checkpoint that existed at the replay point on first visit, then resume
normal head loading within the same run.
- **Parent fork checkpoints on replay** — Time travel runs through
`PregelLoop._first()` (not `stream()` delegation on the parent
`Pregel`), creating an eager `source: "fork"` checkpoint and propagating
`ReplayState` to subgraphs.
- **Direct-to-subgraph time travel** — `getState()` subgraph delegation
is guarded with `CONFIG_KEY_READ`; direct subgraph configs strip stale
`RESUME` writes and prefer explicit `checkpoint_id` over
`checkpoint_map` when both are set.
- **Streaming** — Fixes subgraph interrupt namespace when streaming with
`subgraphs: true` (empty `checkpoint_ns` no longer becomes `[""]`;
parent emits interrupts under the deepest `checkpoint_map` namespace).

Closes #2325 (supersedes the earlier partial port).

### Implementation notes

| Area | Change |
|------|--------|
| `pregel/replay.ts` | New `ReplayState` class (mirrors Python) |
| `pregel/loop.ts` | Replay/time-travel detection, fork creation,
`RESUME` stripping, `ReplayState` wiring, stream namespace helpers |
| `pregel/index.ts` | `getState` subgraph delegation guard only (removed
`stream()` bypass that skipped parent fork creation) |
| Tests | `time_travel.test.ts` (14), `time_travel_extended.test.ts`
(33), shared `time_travel_helpers.ts`, Vitest matchers `toBeInterrupted`
/ `toHaveInterruptValue` |

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants