Skip to content

feat: allow graph to graceful shutdown/drain by request#7274

Merged
William FH (hinthornw) merged 9 commits into
mainfrom
wfh/graceful
Apr 29, 2026
Merged

feat: allow graph to graceful shutdown/drain by request#7274
William FH (hinthornw) merged 9 commits into
mainfrom
wfh/graceful

Conversation

@hinthornw

@hinthornw William FH (hinthornw) commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds cooperative drain support for Pregel runs so a graph can be asked to stop at the next superstep boundary, persist its checkpoint, and surface a resumable terminal exception.

  • New RunControl (in langgraph.runtime) — a thread-safe handle whose request_drain(reason="shutdown") sets a single flag.
  • New GraphDrained(GraphBubbleUp) exception (in langgraph.errors) raised when a run exits early due to drain. Carries the reason string.
  • New control: RunControl | None kwarg on invoke / ainvoke / stream / astream / stream_v2 / astream_v2. Wired through to Runtime.control, so nodes can read runtime.control.drain_requested / drain_reason and even call request_drain() from inside a node.
  • Stream transformers learn "drained" as a terminal SubgraphStatus.

The intended use is hooking SIGTERM (or any external supervisor signal) to control.request_drain("sigterm") so an in-flight graph run can stop cleanly and be resumed later from the saved checkpoint.

Semantics: cooperative, between-superstep

request_drain() flips a flag. The Pregel loop checks it at the top of each tick(), after the previous superstep's writes have been applied and checkpointed. It never preempts work that is already running.

Scenario Behavior
Node mid-execution (blocking I/O, sleeps, etc.) Runs to completion. Drain takes effect on the next superstep.
Node with a retry policy currently retrying Retry loop runs to exhaustion or success (drain is not checked between retries). Drain takes effect on the next superstep.
Functional API: @entrypoint with pending @task futures Entrypoint and all dispatched tasks complete; drain takes effect after the entrypoint returns.
Graph naturally finishes on the same tick where drain was requested (no more tasks) Treated as done; returns normally. No GraphDrained is raised. The caller can inspect control.drain_requested afterwards to distinguish a drained-but-completed run from a normal one.
More tasks remain Raises GraphDrained(reason). The checkpoint of the last completed superstep is saved (also under durability="exit"). Resume with invoke(None, config) / ainvoke(None, config).
Subgraph requests drain GraphDrained bubbles up through the parent loop and stops it at its own next superstep boundary; the parent's checkpoint is saved and resumable.

Drain does not cancel asyncio tasks or kill threads. Pair it with a graceful timeout + task.cancel() (or process exit) if you need a hard upper bound — see test_drain_then_cancel_after_graceful_timeout for the recommended pattern.

Usage

from langgraph.runtime import RunControl
from langgraph.errors import GraphDrained

control = RunControl()

# In a signal handler, supervisor, etc.:
# control.request_drain("sigterm")

try:
    result = graph.invoke(input, config, control=control)
    if control.drain_requested:
        # finished naturally on the same tick where drain was requested
        ...
except GraphDrained as e:
    # checkpoint saved; resume later with the same config
    log.info("graph drained: %s", e.reason)

Test plan

  • Sync + async drain stops the next superstep (test_run_control_request_drain_stops_future_steps[_async])
  • Drain on the terminal step finishes normally (test_drain_requested_in_terminal_step_finishes_normally[_async])
  • durability=\"exit\" persists a resumable checkpoint on drain (test_drain_with_exit_durability_persists_resume_checkpoint)
  • Subgraph drain bubbles up and parent resumes correctly (test_drain_from_subgraph_can_resume_parent)
  • External thread / task triggering drain mid-run (test_external_drain_concurrent_sync / _async)
  • Drain + hard cancel after graceful timeout (test_drain_then_cancel_after_graceful_timeout)
  • Functional API: in-flight @task futures still resolve after request_drain() (test_request_drain_allows_inflight_[a]call_scheduling)
  • control kwarg wired through stream_v2 (test_stream_v2_accepts_control_for_drain)
  • Runtime.merge preserves control (test_merge_runtime_preserves_run_control)

@longquanzheng Quanzheng Long (longquanzheng) changed the title Wfh/graceful feat: allow graph to graceful shutdown/drain by request Mar 27, 2026
@longquanzheng Quanzheng Long (longquanzheng) marked this pull request as ready for review March 28, 2026 00:02
Calling a `@task` from inside an async `@entrypoint` requires Python 3.11+
contextvars support to propagate the runnable config; on 3.10 it raises
`Called get_config outside of a runnable context`. Mark the test with the
existing NEEDS_CONTEXTVARS skip, matching the convention used by every other
async-entrypoint+task test in this file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hinthornw William FH (hinthornw) merged commit 40ab009 into main Apr 29, 2026
67 checks passed
@hinthornw William FH (hinthornw) deleted the wfh/graceful branch April 29, 2026 22:23
Christian Bromann (christian-bromann) added a commit to langchain-ai/langgraphjs that referenced this pull request Jun 10, 2026
## Summary

Ports Python PR
[langchain-ai/langgraph#7274](langchain-ai/langgraph#7274)
("allow graph to graceful shutdown/drain by request") to LangGraphJS.
Adds cooperative, between-superstep draining so a run can be asked to
stop at the next superstep boundary, persist its checkpoint, and surface
a resumable terminal error.

This is the JS PR for the **Graph draining / graceful shutdown** parity
unit.

## What's added

- **`RunControl`** (new `pregel/runtime.ts`, exported from
`@langchain/langgraph`): a run-scoped handle with `requestDrain(reason =
"shutdown")` and read-only `drainRequested` / `drainReason`.
- **`GraphDrained`** (`errors.ts`): a `GraphBubbleUp` subclass carrying
`reason`, thrown when a run exits early due to drain. Plus an
`isGraphDrained` guard.
- **`control` option** on `invoke` / `stream` / `streamEvents` /
`invoke`'s functional-API equivalents. It is surfaced on
`runtime.control` (nodes can read it or call `requestDrain()`), and
propagated into subgraphs. A fresh `RunControl` is provided per run when
none is passed.

## Semantics (cooperative, between-superstep)

`requestDrain()` flips a flag. The Pregel loop checks it at the top of
each `tick()`, **after** the previous superstep's writes have been
applied and checkpointed and the next tasks have been prepared. It never
preempts work that is already running.

| Scenario | Behavior |
|---|---|
| Node mid-execution | Runs to completion; drain takes effect at the
next superstep. |
| Graph naturally finishes on the same tick where drain was requested |
Returns normally (status `done`). No `GraphDrained`. Caller can inspect
`control.drainRequested`. |
| More tasks remain | Saves the last completed superstep's checkpoint
(also under `durability: "exit"`) and throws `GraphDrained(reason)`.
Resume with `invoke(null, config)`. |
| Subgraph requests drain | `GraphDrained` bubbles up through the parent
loop and stops it at its own next boundary; the parent's checkpoint is
saved and resumable. |

Draining does **not** cancel async work. Pair it with an `AbortSignal`
if you need a hard upper bound (see the `drain then cancel after a
graceful timeout` test).

## Files

- `errors.ts` — `GraphDrained` + `isGraphDrained`
- `pregel/runtime.ts` — `RunControl`
- `pregel/runnable_types.ts` — `control?: RunControl` on `Runtime`
- `pregel/types.ts` — `control` on `PregelOptions`
- `pregel/utils/config.ts`, `constants.ts` — config-key wiring
- `pregel/loop.ts` — `"draining"` status + drain check at the tick
boundary
- `pregel/index.ts` — option wiring + raising `GraphDrained`
- `pregel/runner.ts` — subgraph drain bubble-up handling

## Tests

`libs/langgraph-core/src/tests/run_control.test.ts` (14 tests, all sync
+ async where applicable):
drain stops the next step (sync/async), terminal-step drain finishes
normally, exit- and default-durability resume, pre-drained control,
subgraph → parent bubble + resume, external concurrent drain,
drain-then-cancel via `AbortSignal`, reading/`requestDrain()` via
`runtime.control`, `stream()` accepts control, and functional-API
in-flight `task` futures still resolve. Full package suite passes (1358
+ 14, 0 failures); lint and format are clean.

## Notable divergence from Python

Python added `"drained"` to a local `SubgraphStatus` literal. The JS v3
stream lifecycle uses `AgentStatus` from the external
`@langchain/protocol` package, which has no `"drained"` member, so
`GraphDrained` propagates through streams as the terminal error rather
than as a new lifecycle status. The parity-relevant signal — the
`GraphDrained` exception — is what consumers catch. Noted in the
changeset.

## Source

- Python PR: langchain-ai/langgraph#7274
- Parity plan section: Graph draining / graceful shutdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants