`fail_hard` on `BaseException`s in user tasks by gjoseph92 · Pull Request #7330 · dask/distributed

gjoseph92 · 2022-11-18T00:52:27Z

If user code raises a BaseException, it kills the whole worker (via fail_hard).

Semantically, I'm still back and forth whether this makes more sense or #5997. But this is probably less controversial.

Tests added / passed
Passes pre-commit run --all-files

gjoseph92 · 2022-11-18T01:09:04Z

I was curious why this test passed on main with SystemExit, but not BaseException. I'm guessing it's because asyncio special-cases SystemExit in lots of places. Particularly here, when running a callback on the event loop.

gjoseph92 · 2022-11-18T01:14:40Z

@graingert curious if you have any ideas of better ways to do this. The path is basically:

distributed/distributed/worker_state_machine.py

Lines 3651 to 3654 in 69b74f9

    
           task = asyncio.create_task( 
        
               self.execute(inst.key, stimulus_id=inst.stimulus_id), 
        
               name=f"execute({inst.key})", 
        
           )

distributed/distributed/worker_state_machine.py

Line 3667 in 69b74f9

task.add_done_callback(self._handle_stimulus_from_task)

The _handle_stimulus_from_task callback just does this:

distributed/distributed/worker_state_machine.py

Lines 3608 to 3613 in 69b74f9

    
           try: 
        
               # This *should* never raise any other exceptions 
        
               stim = task.result() 
        
           except asyncio.CancelledError: 
        
               # This should exclusively happen in Worker.close() 
        
               return

So the problem is that task.result() raises a BaseException inside the _handle_stimulus_from_task callback. Then this exception raises up to asyncio and just gets dropped.

Broadly, the problem is:

distributed/distributed/worker_state_machine.py

Line 3609 in 69b74f9

# This *should* never raise any other exceptions

Any unhandled exception in a worker async instruction would be dropped, and probably cause a deadlock. If feels to me like handling these errors should be the responsibility of the WorkerStateMachine that's creating the tasks in the first place.

github-actions · 2022-11-18T01:16:47Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    18 files +        3     18 suites +3 0s ⏱️ - 6h 36m 59s
  520 tests -   2 701   502 ✔️ -   2 635   14 💤 -   69   4 ❌ +  3
4 672 runs - 19 145 4 322 ✔️ - 18 586 323 💤 - 585 27 ❌ +26

For more details on these failures, see this check.

Results for commit 65c78f0. ± Comparison against base commit 69b74f9.

gjoseph92 · 2022-11-18T02:02:51Z

As you can see, this breaks CI because it kills the pytest process at a couple of tests:

distributed/tests/test_client.py::test_scatter_compute_lose
distributed/deploy/tests/test_adaptive.py::test_adapt_quickly

I think this is indicating some weakness in how the worker state machine design handles errors in its async tasks.

For test_scatter_compute_lose, it turns out that gather_dep is raising a CancelledError (from get_data_from_worker) when it tries to fetch data from the closing worker. This should probably be a GatherDepNetworkFailureEvent. Instead, the CancelledError is ignored and no stimulus happens. This kinda feels like a potential deadlock.

For test_adapt_quickly, there's some interaction between the worker closing and fail_hard that I'm not understanding yet.

But broadly, since we don't have a structured way of handling exceptions in async tasks in the worker state machine (such as set_exception_handler), it makes it hard to predict what small changes along this code path will do.

gjoseph92 · 2022-11-18T03:34:57Z

I think #5997 may be a better approach. Closing for now.

fail_hard on BaseExceptions in user tasks

65c78f0

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Nov 18, 2022

replace with test from dask#7330

989394e

gjoseph92 closed this Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`fail_hard` on `BaseException`s in user tasks#7330

`fail_hard` on `BaseException`s in user tasks#7330
gjoseph92 wants to merge 1 commit intodask:mainfrom
gjoseph92:baseexceptions-in-tasks-2

gjoseph92 commented Nov 18, 2022

Uh oh!

gjoseph92 commented Nov 18, 2022

Uh oh!

gjoseph92 commented Nov 18, 2022

Uh oh!

github-actions bot commented Nov 18, 2022

Uh oh!

gjoseph92 commented Nov 18, 2022

Uh oh!

gjoseph92 commented Nov 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gjoseph92 commented Nov 18, 2022

Uh oh!

gjoseph92 commented Nov 18, 2022

Uh oh!

gjoseph92 commented Nov 18, 2022

Uh oh!

github-actions bot commented Nov 18, 2022

Unit Test Results

Uh oh!

gjoseph92 commented Nov 18, 2022

Uh oh!

gjoseph92 commented Nov 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant