fix(worker): emit error chunks when a runner dies mid-command#1645
Merged
Evanev7 merged 4 commits intoexo-explore:mainfrom Mar 4, 2026
Merged
Conversation
Member
|
looks good, waiting on #1642 before this i reckon |
468c335 to
8ca4f8d
Compare
Contributor
Author
|
Quick transparency update on this push:
If anything else fails in CI, I’ll follow up with a minimal patch quickly. |
0fdcbb4 to
c432bca
Compare
c432bca to
a75ce4c
Compare
92a52aa to
9d195d2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1586
Summary
RunnerSupervisor(not only unacknowledged pending tasks)_check_runner()detects a crashed runner, emitChunkGenerated(ErrorChunk)for each in-flight command task (TextGeneration,ImageGeneration,ImageEdits)RunnerStatusUpdated(RunnerFailed)emission so planner/state still transition correctlyWhy
#1586reports streams that can hang forever when runners crash during warmup/loading. This keeps failure signaling at the runner-supervisor layer, matching maintainer guidance in the issue thread.Validation
uv run pytest src/exo/worker/tests/unittests/test_runner/test_runner_supervisor.pyNo space left on deviceduringnvidia-cudnn-cu13extraction)I kept the change scoped and added a targeted unit test for the failure path.