fix(worker): emit error chunks when a runner dies mid-command by pandego · Pull Request #1645 · exo-explore/exo

pandego · 2026-03-02T12:58:26Z

Summary

track in-flight tasks in RunnerSupervisor (not only unacknowledged pending tasks)
when _check_runner() detects a crashed runner, emit ChunkGenerated(ErrorChunk) for each in-flight command task (TextGeneration, ImageGeneration, ImageEdits)
keep existing RunnerStatusUpdated(RunnerFailed) emission so planner/state still transition correctly
add a unit test for supervisor crash path to ensure an error chunk is emitted before failed runner status

Why

#1586 reports streams that can hang forever when runners crash during warmup/loading. This keeps failure signaling at the runner-supervisor layer, matching maintainer guidance in the issue thread.

Validation

attempted: uv run pytest src/exo/worker/tests/unittests/test_runner/test_runner_supervisor.py
blocked locally by environment disk exhaustion while uv tried to materialize heavy CUDA wheels (No space left on device during nvidia-cudnn-cu13 extraction)

I kept the change scoped and added a targeted unit test for the failure path.

Evanev7 · 2026-03-03T10:39:11Z

looks good, waiting on #1642 before this i reckon

pandego · 2026-03-03T14:33:39Z

Quick transparency update on this push:

I fixed the concrete nix flake check typecheck errors reported in CI for the new regression test (test_runner_supervisor.py) and rebased on latest main.
I ran targeted local validation, but full local env parity is limited on this runner (native build toolchain/maturin path), so I could not reproduce the full CI stack end-to-end before push.
This update is intended to unblock CI verification of the typed fix directly in the project’s canonical checks.

If anything else fails in CI, I’ll follow up with a minimal patch quickly.

Evanev7

sweet

pandego force-pushed the fix/1586-runner-supervisor-error-chunk branch from 468c335 to 8ca4f8d Compare March 3, 2026 14:28

Evanev7 force-pushed the fix/1586-runner-supervisor-error-chunk branch 2 times, most recently from 0fdcbb4 to c432bca Compare March 4, 2026 15:51

pandego added 2 commits March 4, 2026 15:53

fix(worker): emit command error chunks when runner crashes

832b94d

test(worker): tighten typing in runner supervisor regression test

a75ce4c

Evanev7 force-pushed the fix/1586-runner-supervisor-error-chunk branch from c432bca to a75ce4c Compare March 4, 2026 15:53

lint fix

45d8b05

Evanev7 approved these changes Mar 4, 2026

View reviewed changes

Evanev7 enabled auto-merge (squash) March 4, 2026 16:09

change wording

9d195d2

Evanev7 force-pushed the fix/1586-runner-supervisor-error-chunk branch from 92a52aa to 9d195d2 Compare March 4, 2026 16:58

Evanev7 merged commit 8485805 into exo-explore:main Mar 4, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker): emit error chunks when a runner dies mid-command#1645

fix(worker): emit error chunks when a runner dies mid-command#1645
Evanev7 merged 4 commits intoexo-explore:mainfrom
pandego:fix/1586-runner-supervisor-error-chunk

pandego commented Mar 2, 2026

Uh oh!

Evanev7 commented Mar 3, 2026

Uh oh!

pandego commented Mar 3, 2026

Uh oh!

Evanev7 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pandego commented Mar 2, 2026

Summary

Why

Validation

Uh oh!

Evanev7 commented Mar 3, 2026

Uh oh!

pandego commented Mar 3, 2026

Uh oh!

Evanev7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants