Skip to content

fail_hard on BaseExceptions in user tasks#7330

Closed
gjoseph92 wants to merge 1 commit intodask:mainfrom
gjoseph92:baseexceptions-in-tasks-2
Closed

fail_hard on BaseExceptions in user tasks#7330
gjoseph92 wants to merge 1 commit intodask:mainfrom
gjoseph92:baseexceptions-in-tasks-2

Conversation

@gjoseph92
Copy link
Collaborator

If user code raises a BaseException, it kills the whole worker (via fail_hard).

Semantically, I'm still back and forth whether this makes more sense or #5997. But this is probably less controversial.

Closes #5958

  • Tests added / passed
  • Passes pre-commit run --all-files

@gjoseph92
Copy link
Collaborator Author

I was curious why this test passed on main with SystemExit, but not BaseException. I'm guessing it's because asyncio special-cases SystemExit in lots of places. Particularly here, when running a callback on the event loop.

@gjoseph92
Copy link
Collaborator Author

@graingert curious if you have any ideas of better ways to do this. The path is basically:

task = asyncio.create_task(
self.execute(inst.key, stimulus_id=inst.stimulus_id),
name=f"execute({inst.key})",
)

task.add_done_callback(self._handle_stimulus_from_task)

The _handle_stimulus_from_task callback just does this:

try:
# This *should* never raise any other exceptions
stim = task.result()
except asyncio.CancelledError:
# This should exclusively happen in Worker.close()
return

So the problem is that task.result() raises a BaseException inside the _handle_stimulus_from_task callback. Then this exception raises up to asyncio and just gets dropped.

Broadly, the problem is:

# This *should* never raise any other exceptions

Any unhandled exception in a worker async instruction would be dropped, and probably cause a deadlock. If feels to me like handling these errors should be the responsibility of the WorkerStateMachine that's creating the tasks in the first place.

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     18 files  +         3       18 suites  +3   0s ⏱️ - 6h 36m 59s
   520 tests  -   2 701     502 ✔️  -   2 635    14 💤  -   69    4 +  3 
4 672 runs   - 19 145  4 322 ✔️  - 18 586  323 💤  - 585  27 +26 

For more details on these failures, see this check.

Results for commit 65c78f0. ± Comparison against base commit 69b74f9.

@gjoseph92
Copy link
Collaborator Author

As you can see, this breaks CI because it kills the pytest process at a couple of tests:

distributed/tests/test_client.py::test_scatter_compute_lose
distributed/deploy/tests/test_adaptive.py::test_adapt_quickly

I think this is indicating some weakness in how the worker state machine design handles errors in its async tasks.

For test_scatter_compute_lose, it turns out that gather_dep is raising a CancelledError (from get_data_from_worker) when it tries to fetch data from the closing worker. This should probably be a GatherDepNetworkFailureEvent. Instead, the CancelledError is ignored and no stimulus happens. This kinda feels like a potential deadlock.

For test_adapt_quickly, there's some interaction between the worker closing and fail_hard that I'm not understanding yet.

But broadly, since we don't have a structured way of handling exceptions in async tasks in the worker state machine (such as set_exception_handler), it makes it hard to predict what small changes along this code path will do.

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Nov 18, 2022
@gjoseph92
Copy link
Collaborator Author

I think #5997 may be a better approach. Closing for now.

@gjoseph92 gjoseph92 closed this Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BaseException in task leads to task never completing

1 participant