Always return `ws.address` from `_remove_from_processing` by hendrikmakait · Pull Request #6884 · dask/distributed

hendrikmakait · 2022-08-12T14:26:54Z

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2022-08-12T15:48:58Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 17m 38s ⏱️ - 48m 17s
  3 000 tests ±0   2 911 ✔️ +3     89 💤 ±0 0 ❌ - 3
22 246 runs ±0 21 194 ✔️ +2 1 052 💤 +1 0 ❌ - 3

Results for commit 3ffcf71. ± Comparison against base commit 1d0701b.

♻️ This comment has been updated with latest results.

fjetter · 2022-08-15T13:51:25Z

distributed/scheduler.py

            w = _remove_from_processing(self, ts)

-            ts.erred_on.add(w or worker)  # type: ignore
+            ts.erred_on.add(w)


IIUC this is what motivated your change. However, if _remove_from_processing returns None, we'd use worker instead. Is worker also None in this scenario? If so, is the kwarg even still in use?

Since we encountered a None in ts.erred_on in #6874, there needs to be some scenario where both w and worker_ are None. I dropped the worker kwarg since it's not being used anywhere and will get collected in **kwargs.

I'm a bit worried here about consistency.

The only way I am aware of to trigger this transition is by receiving a task-erred message from a worker

distributed/distributed/scheduler.py

Line 3093 in 1d0701b

"task-erred": self.handle_task_erred,

which is handled by handle_worker which will always have a worker extra kwarg

distributed/distributed/scheduler.py

Line 4888 in 1d0701b

await self.handle_stream(comm=comm, extra={"worker": worker})

distributed/distributed/core.py

Line 842 in 1d0701b

handler(**merge(extra, msg))

i.e. calling handle_task_erred will always have worker as part of msg (proper signatures + mypy would help here)

distributed/distributed/scheduler.py

Lines 4756 to 4757 in 1d0701b

def handle_task_erred(self, key: str, stimulus_id: str, **msg) -> None:

r: tuple = self.stimulus_task_erred(key=key, stimulus_id=stimulus_id, **msg)

which leads us to the actual stimulus handler which I think should verify if the task-erred message actually originated from the worker the task is currently processing on which it doesn't

distributed/distributed/scheduler.py

Lines 4161 to 4190 in 1d0701b

def stimulus_task_erred(

self,

key=None,

worker=None,

exception=None,

stimulus_id=None,

traceback=None,

**kwargs,

):

"""Mark that a task has erred on a particular worker"""

logger.debug("Stimulus task erred %s, %s", key, worker)

ts: TaskState = self.tasks.get(key)

if ts is None or ts.state != "processing":

return {}, {}, {}

if ts.retries > 0:

ts.retries -= 1

return self._transition(key, "waiting", stimulus_id)

else:

return self._transition(

key,

"erred",

stimulus_id,

cause=key,

exception=exception,

traceback=traceback,

worker=worker,

**kwargs,

)

The whole point of this is that if I read all of this correctly, w == worker is always True unless something went wrong earlier.

How worker=None can happen is something I do not understand

cc @crusaderky I believe you've been working on something similar recently?

Ok, the worker=None is coming from

distributed/distributed/scheduler.py

Lines 4308 to 4315 in 1d0701b

if ts.suspicious > self.allowed_failures:

del recommendations[k]

e = pickle.dumps(

KilledWorker(task=k, last_worker=ws.clean()), protocol=4

)

r = self.transition(

k, "erred", exception=e, cause=k, stimulus_id=stimulus_id

)

therefore, the scenario of an empty ts.erred_on comes from tasks that have been transitioned to error because of a KilledWorker exception.

therefore, the scenario of an empty ts.erred_on comes from tasks that have been transitioned to error because of a KilledWorker exception.

That makes sense, #6874 includes a bunch of KilledWorker exceptions.

fjetter · 2022-08-17T10:22:47Z

I think there are a couple of things we should verify here.

We should change stimulus_task_erred to be a no-op if the task-erred is not coming from the worker the task is actually processing on. We're already verifying that the task is still in processing so this check needs to be expanded

distributed/distributed/scheduler.py

Lines 4174 to 4175 in 1d0701b

if ts is None or ts.state != "processing":

return {}, {}, {}

should really be

        if ts is None or ts.state != "processing" or ts.processing_on != worker:
            return {}, {}, {}

it's very likely this is not tested

Either drop the worker argument here or keep the worker, keep the worker in the signature and add an validation assert in transition_processing_erred to ensure everything is in order
I think dropping it all is fine if stimulus_task_erred implements 1.)

…rail

hendrikmakait · 2022-08-17T11:05:09Z

I think there are a couple of things we should verify here.

Done.

I think dropping it all is fine if stimulus_task_erred implements 1.)

I have added the assertion step for now, better safe than sorry. It's odd enough we ran into the None case in the first place.

gjoseph92 · 2022-08-23T18:41:05Z

distributed/scheduler.py


            w = _remove_from_processing(self, ts)
-            if w:
+            if w in self.workers:


I'm a little concerned about this. It feels like another instance of #6392.

In #6614 (comment) I have a driveby refactor fixing a possible stale WorkerState in _remove_from_processing.

If we go ahead and return the address even if we know it's stale, then transition_processing_released here has no ability to verify whether w is stale or not, because it just has an address, not a WorkerState instance. It's possible that a worker left, then reconnected with the same address and port. If that's the case, we'd then be sending a free-keys to the wrong worker instance.

I'm wondering if instead of the approach in this PR, we should go back to the old behavior and instead explicitly handle the case in transition_processing_erred where both w and worker are None (i.e. the KilledWorker case)?

Two thoughts:

EDIT: Look at Withhold root tasks [no co assignment] #6614 (comment) & Always return ws.address from _remove_from_processing #6884 (comment)
IIUC, since we're not awaiting anywhere in

distributed/distributed/scheduler.py

Lines 2261 to 2262 in c15a10e

w = _remove_from_processing(self, ts)

if w in self.workers:

, it should never happen that

distributed/distributed/scheduler.py

Line 7337 in c15a10e

if ws.address not in state.workers: # may have been removed

is True while

distributed/distributed/scheduler.py

Line 2262 in c15a10e

if w in self.workers:

is False. Thus, this piece of code should behave the same way as before. Am I missing something here?

Instead of handling the None case, should we maybe return the entire WorkerState object to ensure that users of _remove_from_processing can check for stale worker states as you described? From your description, it sounds like returning a worker adress might generally not be all too helpful.

Note: I wrote the previous response without looking at the linked comment. That drive-by fix makes sense to me. To solve the issue at hand, I'd return whatever

distributed/distributed/scheduler.py

Line 7333 in c15a10e

ws = ts.processing_on

spits out. This should allow downstream users to differentiate between stale and current worker states.

I guess I don't like downstream users having to differentiate themselves (more points to mess that up). Returning None felt pretty sensible to me in that case. In the rare case where callers need to know the worker we thought it was processing on, even if that worker might be stale—I believe transition_processing_erred is the only case where this applies?—why not retrieve that explicitly first?

gjoseph92 · 2022-08-23T20:04:19Z

We should change stimulus_task_erred to be a no-op if the task-erred is not coming from the worker the task is actually processing on

Did this happen? I'm not seeing it in the diff?

I also disagree that it should be a no-op. We currently support an unexpected worker successfully competing a task:

distributed/distributed/scheduler.py

Lines 1987 to 1993 in c15a10e

    
           if ws != ts.processing_on:  # someone else has this task 
        
               logger.info( 
        
                   "Unexpected worker completed task. Expected: %s, Got: %s, Key: %s", 
        
                   ts.processing_on, 
        
                   ws, 
        
                   key, 
        
               )

so we should also support an unexpected worker encountering an error with a task. (I do think both cases probably indicate something is going wrong and should eventually be removed, I just don't see why we'd be inconsistent about the memory vs erred.)

hendrikmakait · 2022-08-24T05:49:19Z

Did this happen? I'm not seeing it in the diff?

It looks like I forgot to push that commit, it's on my local branch.

I also disagree that it should be a no-op. We currently support an unexpected worker successfully competing a task: [...]
so we should also support an unexpected worker encountering an error with a task.

I think being inconsistent between memory and erred is fine. As you said, in both cases something is probably off. So if we were strict about handling here, we should probably ignore the task in both cases or treat these as errors. With memory, it's basically an optimization to say "Sweet, someone already did the work for us even though they were not supposed to, let's just take it". In the error case,, however, we already know something is off if a worker who isn't supposed to process a task keeps working on it, so something going wrong with the task could be expected and this shouldn't count toward the retry-limit on healthy workers.

hendrikmakait marked this pull request as ready for review August 12, 2022 16:29

fjetter reviewed Aug 15, 2022

View reviewed changes

hendrikmakait added 2 commits August 17, 2022 13:00

Always return worker address from _remove_from_processing for paper t…

b31ad11

…rail

Drop worker keyword arg

3ffcf71

hendrikmakait force-pushed the remove-from-processing-always-returns-worker-address branch from d410146 to 3ffcf71 Compare August 17, 2022 11:00

hendrikmakait requested review from crusaderky, fjetter and gjoseph92 August 17, 2022 12:16

fjetter approved these changes Aug 23, 2022

View reviewed changes

fjetter merged commit 2a2c3bb into dask:main Aug 23, 2022

gjoseph92 reviewed Aug 23, 2022

View reviewed changes

gjoseph92 mentioned this pull request Aug 23, 2022

Withhold root tasks [no co assignment] #6614

Merged

2 tasks

gjoseph92 mentioned this pull request Aug 23, 2022

Alternative handling of killed/unexpected worker in transition_processing_erred #6939

Closed

2 tasks

This was referenced Aug 24, 2022

stimulus_task_erred is no-op when task-erred is not coming from the worker the task is actually processing on #6941

Closed

Improve handling of tasks failing/succeeding on unexpected workers #6956

Open

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Always return ws.address from _remove_from_processing (dask#6884)

d84e5a8

	def handle_task_erred(self, key: str, stimulus_id: str, **msg) -> None:
	r: tuple = self.stimulus_task_erred(key=key, stimulus_id=stimulus_id, **msg)

	def stimulus_task_erred(
	self,
	key=None,
	worker=None,
	exception=None,
	stimulus_id=None,
	traceback=None,
	**kwargs,
	):
	"""Mark that a task has erred on a particular worker"""
	logger.debug("Stimulus task erred %s, %s", key, worker)

	ts: TaskState = self.tasks.get(key)
	if ts is None or ts.state != "processing":
	return {}, {}, {}

	if ts.retries > 0:
	ts.retries -= 1
	return self._transition(key, "waiting", stimulus_id)
	else:
	return self._transition(
	key,
	"erred",
	stimulus_id,
	cause=key,
	exception=exception,
	traceback=traceback,
	worker=worker,
	**kwargs,
	)

	if ts.suspicious > self.allowed_failures:
	del recommendations[k]
	e = pickle.dumps(
	KilledWorker(task=k, last_worker=ws.clean()), protocol=4
	)
	r = self.transition(
	k, "erred", exception=e, cause=k, stimulus_id=stimulus_id
	)

Uh oh!

Conversation

hendrikmakait commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

fjetter Aug 15, 2022

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Aug 15, 2022

Choose a reason for hiding this comment

Uh oh!

fjetter Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

fjetter commented Aug 17, 2022

Uh oh!

hendrikmakait commented Aug 17, 2022

Uh oh!

gjoseph92 Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Aug 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmakait Aug 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjoseph92 Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

gjoseph92 commented Aug 23, 2022

Uh oh!

hendrikmakait commented Aug 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hendrikmakait commented Aug 12, 2022 •

edited

Loading

github-actions bot commented Aug 12, 2022 •

edited

Loading

fjetter Aug 17, 2022 •

edited

Loading

hendrikmakait Aug 23, 2022 •

edited

Loading

hendrikmakait Aug 23, 2022 •

edited

Loading