Add thread interrupt logic by martindurant · Pull Request #4726 · dask/distributed

martindurant · 2021-04-21T14:46:40Z

Closes Allow workers to cancel running task? #4694
Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

martindurant · 2021-04-21T14:51:07Z

This shows the basic working. The following script runs successfully:

import dask.distributed
import time
import threading
client = dask.distributed.Client(processes=False, n_workers=1, threads_per_worker=1)
client.wait_for_workers(1)

def runme(N):
    for i in range(N):
        time.sleep(1)

fut = client.submit(runme, 10000000)
time.sleep(0.01)  # make sure fut is allocated first
fut0 = client.submit(lambda x: x, True)

fut.cancel()
fut0.result()

As coded, this happens whenever an executing key is released for whatever reason, and there is no config for it. Also, if the task in question captures exceptions, it will keep running.

(please ignore print statements)

mrocklin · 2021-04-21T15:46:14Z

This is cool to see. I know that this is still just a draft status, and I'm sure that this is in your plan, but I would encourage you to write down tests to demonstrate that some of the concerns raised in the issue are handled by this, for example that finally blocks are respected and such.

martindurant · 2021-04-23T19:42:42Z

@mrocklin : first pass at a pair of tests, but we will still need to be very careful and thorough here. btw: it took me ages to realist that fut.cancel() is a coroutine needing await.

I would still like some feedback here on whether this is a good idea for the general case, or should be a config, or only exposed via some explicit client call. The latter is hard, since the user may not have a handle to the particular task that is hanging (or just taking longer than expected).

martindurant · 2021-04-23T20:43:48Z

(windows error is in test_str, not obviously related)

jrbourbeau · 2021-04-27T15:43:59Z

cc @crusaderky

fjetter · 2021-04-28T14:05:36Z

FWIW, I would be comfortable enough with calling C API directly for this to be merged. The API seems to be stable since py3.7 (https://docs.python.org/3/c-api/init.html#c.PyThreadState_SetAsyncExc)
If we add more tests and document this behaviour, I'd be fine with it. In terms of complexity, this is also justified considering the benefits. If this were to make trouble down the road, we could think about toggles but for now I'd like to avoid more options.

I'm wondering, would this be something we could contribute to CPython directly? I understand that we had the need to vendor the threadpool for the seceding feature but every modification raises maintenance cost (or the risk of it)

alexandervaneck · 2021-10-11T14:14:53Z

Hello 👋 I stumbled upon this issue/PR while working with Prefect. (See PrefectHQ/prefect#5043).
I've also tested the proposed PR and it seems to fix this issue, great work @martindurant 🙌

@martindurant : Do you have the intention to finish this PR and make it part of distributed?

fjetter · 2021-10-13T11:43:05Z

distributed/worker.py

+            if ts.state == "executing":
+                self.executing_count -= 1
+                th = [th for th, k in self.active_threads.items() if k == key]
+                if th:
+                    logger.info("Interrupting thread %i for task %s", th[0], key)
+                    self.executor.interrupt(th[0])


This section should nowadays live in transition_executing_released

Tied to config variable distributed.worker.thread_auto_interrupt, assumed False to start (not yet set in config schema).

martindurant · 2021-10-15T19:53:06Z

provisionally ready for review - we need to decide whether this should be default or not, and how to describe it in the conf and docs.

Added distributed.worker.thread_auto_interrupt to config schema

martindurant · 2021-11-02T13:49:49Z

I have made the default False for now (no interrupt) and added the key to the config schema. It is pretty hidden! We could add this to the docs somewhere, or softly launch it by suggesting some users try it when facing long-running released tasks.

distributed/worker.py

fjetter · 2021-11-02T14:25:07Z

distributed/worker.py

+            if th:
+                logger.info("Interrupting thread %i for task %s", th[0], ts.key)
+                self.executor.interrupt(th[0])


I would argue it is not possible for th to be empty. If that is true, there should probably be a else: RuntimeError. Although I'm a bit ambivalent since on the one hand I'd like us to avoid implicit fail cases but on the other hand, we are not raising exceptions anywhere in the transition engine, so far.

I imagine it's possible for the task to end at the same time as it is released for another reason. If it turns out there is no longer a thread associated, that doesn't seem like an error to me.

Good point. Can we test this?

At the very least, we should add a comment here. It's not straight forward to figure out and it feels like an important detail.

I'm also wondering now what happens if the thread finishes properly while we're trying to interrupt vs the task raised an exception before and we're trying to interrupt. What will be the final task state if either one happens? (I'm mostly concerned about invalid states. If we rerun anything, I don't mind as long as the state machine isn't messed up)

Can we test this?

I'm don't see a way we can emulate it without horrible race problems.

what happens if the thread finishes properly while we're trying to interrupt

The thread doesn't finish as such, it goes into the loop to pick more work, which is what this is for. Having said that, the _WorkItem s don't have a state beyond what's encoded in the corresponding future.

task raised an exception before and we're trying to interrupt

If the task is already in an Except block, you would expect to see "During handling of the above exception, another exception occurred:" as the return, but finally: blocks would still run.

If we rerun anything, I don't mind as long as the state machine isn't messed up

Fundamentally, we are cleaning up a task which is set for release, and so the output of the task and the Item corresponding object's state don't matter.

distributed/_concurrent_futures_thread.py

distributed/tests/test_worker.py

distributed/worker.py

martindurant · 2021-11-03T21:31:27Z

Anything left here?

martindurant · 2021-11-19T15:52:43Z

ping?

fjetter · 2021-12-16T14:10:39Z

My final concern here is that we're not using the stdlib ThreadPoolExecutor interface any longer but require an extended interface which breaks some assumption about our compatibility, see

distributed/distributed/worker.py

Lines 3433 to 3445 in 96d4fd4

    
           elif "ThreadPoolExecutor" in str(type(e)): 
        
               result = await self.loop.run_in_executor( 
        
                   e, 
        
                   apply_function, 
        
                   function, 
        
                   args2, 
        
                   kwargs2, 
        
                   self.execution_state, 
        
                   ts.key, 
        
                   self.active_threads, 
        
                   self.active_threads_lock, 
        
                   self.scheduler_delay, 
        
               )

Discussion around this https://github.com/dask/distributed/pull/5063/files#r670260505

FWIW, we're not consistent with this and I believe we should require a strict isinstance check, e.g.

distributed/distributed/worker.py

Lines 1610 to 1613 in 96d4fd4

    
           if isinstance(executor, ThreadPoolExecutor): 
        
               executor._work_queue.queue.clear() 
        
               executor.shutdown(wait=executor_wait, timeout=timeout) 
        
           else:

Add thread interrupt logic

ac5cee7

Martin Durant added 3 commits April 23, 2021 15:27

Add tests

001eff9

lint

d37b544

Merge branch 'main' into interrupt

685faa5

Merge artefact

c9d2b2a

martindurant mentioned this pull request Aug 5, 2021

Hanging during store_chunk loop pangeo-forge/pangeo-forge-recipes#177

Closed

Martin Durant added 2 commits October 11, 2021 11:02

Merge branch 'main' into interrupt

3235171

remove debug print

34927af

fjetter reviewed Oct 13, 2021

View reviewed changes

Move code, fix tests, make optional

a96e4c1

Tied to config variable distributed.worker.thread_auto_interrupt, assumed False to start (not yet set in config schema).

martindurant marked this pull request as ready for review October 15, 2021 19:50

type for mypy

ba6f82f

Merge branch 'main' into interrupt

fd17222

Added distributed.worker.thread_auto_interrupt to config schema

fjetter reviewed Nov 2, 2021

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

fjetter reviewed Nov 2, 2021

View reviewed changes

move and update comment

9f93c32

fjetter reviewed Nov 2, 2021

View reviewed changes

martindurant added 2 commits November 2, 2021 11:07

suggestions

2af7667

Add config key to defaults

5f50182

fjetter mentioned this pull request Mar 3, 2022

[WIP] Update threadpool #5893

Closed

fjetter mentioned this pull request Jun 22, 2022

WIP / RFC Remove custom threadpoolexecutor #6607

Closed

fjetter mentioned this pull request Nov 21, 2022

Remove timeout and executor_wait from Worker.close #7320

Closed

Uh oh!

Conversation

martindurant commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Apr 21, 2021

Uh oh!

mrocklin commented Apr 21, 2021

Uh oh!

martindurant commented Apr 23, 2021

Uh oh!

martindurant commented Apr 23, 2021

Uh oh!

jrbourbeau commented Apr 27, 2021

Uh oh!

fjetter commented Apr 28, 2021

Uh oh!

alexandervaneck commented Oct 11, 2021

Uh oh!

fjetter Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

martindurant commented Oct 15, 2021

Uh oh!

martindurant commented Nov 2, 2021

Uh oh!

Uh oh!

fjetter Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

martindurant Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

fjetter Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

fjetter Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

martindurant Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martindurant commented Nov 3, 2021

Uh oh!

martindurant commented Nov 19, 2021

Uh oh!

fjetter commented Dec 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

martindurant commented Apr 21, 2021 •

edited

Loading