Skip to content

fix(worker): emit error chunks when a runner dies mid-command#1645

Merged
Evanev7 merged 4 commits intoexo-explore:mainfrom
pandego:fix/1586-runner-supervisor-error-chunk
Mar 4, 2026
Merged

fix(worker): emit error chunks when a runner dies mid-command#1645
Evanev7 merged 4 commits intoexo-explore:mainfrom
pandego:fix/1586-runner-supervisor-error-chunk

Conversation

@pandego
Copy link
Contributor

@pandego pandego commented Mar 2, 2026

Closes #1586

Summary

  • track in-flight tasks in RunnerSupervisor (not only unacknowledged pending tasks)
  • when _check_runner() detects a crashed runner, emit ChunkGenerated(ErrorChunk) for each in-flight command task (TextGeneration, ImageGeneration, ImageEdits)
  • keep existing RunnerStatusUpdated(RunnerFailed) emission so planner/state still transition correctly
  • add a unit test for supervisor crash path to ensure an error chunk is emitted before failed runner status

Why

#1586 reports streams that can hang forever when runners crash during warmup/loading. This keeps failure signaling at the runner-supervisor layer, matching maintainer guidance in the issue thread.

Validation

  • attempted: uv run pytest src/exo/worker/tests/unittests/test_runner/test_runner_supervisor.py
  • blocked locally by environment disk exhaustion while uv tried to materialize heavy CUDA wheels (No space left on device during nvidia-cudnn-cu13 extraction)

I kept the change scoped and added a targeted unit test for the failure path.

@Evanev7
Copy link
Member

Evanev7 commented Mar 3, 2026

looks good, waiting on #1642 before this i reckon

@pandego pandego force-pushed the fix/1586-runner-supervisor-error-chunk branch from 468c335 to 8ca4f8d Compare March 3, 2026 14:28
@pandego
Copy link
Contributor Author

pandego commented Mar 3, 2026

Quick transparency update on this push:

  • I fixed the concrete nix flake check typecheck errors reported in CI for the new regression test (test_runner_supervisor.py) and rebased on latest main.
  • I ran targeted local validation, but full local env parity is limited on this runner (native build toolchain/maturin path), so I could not reproduce the full CI stack end-to-end before push.
  • This update is intended to unblock CI verification of the typed fix directly in the project’s canonical checks.

If anything else fails in CI, I’ll follow up with a minimal patch quickly.

@Evanev7 Evanev7 force-pushed the fix/1586-runner-supervisor-error-chunk branch 2 times, most recently from 0fdcbb4 to c432bca Compare March 4, 2026 15:51
@Evanev7 Evanev7 force-pushed the fix/1586-runner-supervisor-error-chunk branch from c432bca to a75ce4c Compare March 4, 2026 15:53
Copy link
Member

@Evanev7 Evanev7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet

@Evanev7 Evanev7 enabled auto-merge (squash) March 4, 2026 16:09
@Evanev7 Evanev7 force-pushed the fix/1586-runner-supervisor-error-chunk branch from 92a52aa to 9d195d2 Compare March 4, 2026 16:58
@Evanev7 Evanev7 merged commit 8485805 into exo-explore:main Mar 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API streams hang forever when runner process crashes during warmup/loading

2 participants