Scheduler task transition tracing by sjperkins · Pull Request #5954 · dask/distributed

sjperkins · 2022-03-17T11:09:02Z

Closes Transition tracing for scheduler task transitions #5849
Tests added / passed
Passes pre-commit run --all-files

…| None'

github-actions · 2022-03-17T11:48:08Z

Unit Test Results

      18 files ±0       18 suites ±0 10h 15m 31s ⏱️ + 20m 12s
  2 703 tests ±0   2 616 ✔️ - 5     83 💤 +2 4 ❌ +3
24 163 runs ±0 22 869 ✔️ - 5 1 290 💤 +3 4 ❌ +2

For more details on these failures, see this check.

Results for commit 7bf0117. ± Comparison against base commit 2ff681c.

♻️ This comment has been updated with latest results.

distributed/client.py

distributed/utils_comm.py

distributed/worker.py

distributed/scheduler.py

sjperkins · 2022-03-22T14:25:08Z

Thanks for the comments @fjetter. I think this is ready for further review.

One observation I'd like to make is that there are a number of members that follow the following pattern:

def do_something(..., stimulus_id=None):
  stimulus_id = stimulus_id of f"do-something-{time()}"

For example, reschedule: #5954 (comment).

These patterns exist to ensure that None stimulus_id's aren't added to the transition_log. Note that valid stimulus_id's are only enforced if SchedulerState._validate is True: see #5954 (comment). I decided that scheduler stability was more important than an invalid transition_log, but I'm interested in hearing other viewpoints and given that stimulus_id=None is a kwarg in many member functions.

~~A stack-based approach to generating stimulus_id's might be a more robust option for a future PR~~. Edit: Although this may be impractical given the async paradigm.

fjetter

A few minor comments. The biggest thing is that I would like us try my suggestion about keeping the stimulus_id high level and not pass it down to every transition method. That would be really nice, I think. A similar thing could be done on worker side, I believe but I suggest to not change anything about the worker signatures in this PR.

distributed/scheduler.py

fjetter · 2022-03-24T13:40:25Z

distributed/scheduler.py

+            key, finish, *args, stimulus_id=stimulus_id, **kwargs
+        )
        recommendations, client_msgs, worker_msgs = a
        self.send_all(client_msgs, worker_msgs)


I'm wondering if a cleaner approach to this would be to not add stimulus_id to every transition_X_Y method but instead deal with the required mutations here.

the only thing a transition_X_Y method can (should) do with the stimulus_id is to attach it to a worker or client message. However, why don't we attach this to every worker message here and save ourselves these dirty signature?

e.g.

def transition_memory_forgotten(self, key): ... # This is the only place we're actually using the stim ID. _propagate_forgotten only adds it to the worker_msgs. we can add this on a higher level and don't need to pass it down into every method. _propagate_forgotten( self, ts, recommendations, worker_msgs ) return recommendations, client_msgs, worker_msgs

I haven't verified if this works but it would be much less invasive.
Adding the stim ID could be performed as part of send_all where we're iterating over the messages anyhow.

I had a look at the 15 transitions in SchedulerState, of which 6 take stimulus_id's, while 9 do not. I think the two numbers are close enough that either approach might be ugly and my vote would be for the ugliness of extra stimulus kwargs in the 15 transition functions.

I was thinking about this a bit more and two other approaches occurred to me. Here's the flavour of the first:

import inspect class TransitionFunction: def __init__(self, fn): assert callable(fn) self.sig = inspect.signature(fn) def stimulus_in_sig(self): pass # implement def __callable__(self, *args, **kw): if not self.stimulus_in_sig(): kw.pop("stimulus_id", None) return self.fn(*args, **kw) return self.fn(*args, **kw) class SchedulerState: self._transitions_table = { ("released", "waiting"): TransitionFunction(self.transition_released_waiting), ("waiting", "released"): TransitionFunction(self.transition_waiting_released), ... ("released", "erred"): TransitionFunction(self.transition_released_erred), }

The second idea is not yet fully formed but it might be possible to automatically generate and inject stimulus_id's into the distributed.core.Server class handlers. Then, it might be possible to store the stimulus_id's in ContextVars that can be passed through async/sync call frames and inspected at the point where we need the stimulus_id's. This could be combined with Tensorflow's ideas around variable scoping (e.g. see https://www.tensorflow.org/api_docs/python/tf/name_scope)

It also might be possible to track the call frames inspect.currentframe() back to distributed.core.Server handlers and automatically derive stimulus_id's.

I think the nice thing about this approach is it discard's the need to generate and pass stimulus_id's around the code base -- one could simply retrieve an appropriately generated stimulus. On the other hand, the logic might be too complicated and magical.

Pinging @graingert in case there's some problem with this approach and I spend to much time down this rabbithole.

Stated in a simpler way, I'm thinking of an ExitStack-like construct which

Would be automatically initialised with supplied or generated stimulus_id's at the Server handlers

Supports overriding of existing stimulus_id's throughout Server sub-classes.

Is safe for use with asyncio (I think contextvars gives us this).

Stated in a simpler way, I'm thinking of an ExitStack-like construct which

Looks like AsyncExitStack is a possibility here.

distributed/scheduler.py

distributed/worker.py

This reverts commit 925c2bd.

sjperkins · 2022-04-08T15:08:25Z

Closed in favour of #6095

sjperkins added 4 commits March 16, 2022 17:21

Initial commit

2957f38

Defer to parent stimulus in transition_processing_memory

e2f54a3

Change exception_blame type to TaskState | None

5307c0e

Change exception_blame back to TaskState and make stimulus_id's 'str …

c229a65

…| None'

sjperkins force-pushed the scheduler-task-transition-tracing branch from 30dba93 to c229a65 Compare March 17, 2022 11:11

sjperkins added 9 commits March 17, 2022 15:45

Remove stimulus_id type annotations

d010b2d

Add stimulus_id's in various places

6cfc1ef

Merge branch 'main' into scheduler-task-transition-tracing

2ced7c1

Add stimulus_id's in state mutating handlers

cb15e74

Reintroduce log handlers

21302bd

Merge branch 'main' into scheduler-task-transition-tracing

2cf4a8a

Checkpoint running test suite

1670906

Remove cruft

8c577f8

Add stimulus_id to scheduler transition_log

dabfefe

fjetter reviewed Mar 22, 2022

View reviewed changes

sjperkins added 5 commits March 22, 2022 13:55

Support worker + client stimuli, stimuli in http

be43e63

assert_worker_story -> assert_story and test scheduler story

45bf7de

Remove stimulus_id's from code paths that don't change TaskState

80ffddf

Merge branch 'main' into scheduler-task-transition-tracing

a246197

Further stimulus_id changes in worker.py

1a72c88

sjperkins commented Mar 22, 2022

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

sjperkins commented Mar 22, 2022

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

sjperkins self-assigned this Mar 22, 2022

Add missing stimulus_id to free-keys worker msg

7c30cfa

sjperkins mentioned this pull request Mar 24, 2022

Client story #5987

Closed

3 tasks

fjetter reviewed Mar 24, 2022

View reviewed changes

sjperkins added 2 commits March 25, 2022 16:01

Review changes

b0fd952

Merge branch 'main' into scheduler-task-transition-tracing

90039fb

sjperkins added 3 commits March 25, 2022 16:41

Fix stray assert_worker_story -> assert_story

3affd07

Used new_stimulus_id in Worker.execute

925c2bd

Revert "Used new_stimulus_id in Worker.execute"

6e58f91

This reverts commit 925c2bd.

sjperkins requested a review from fjetter March 25, 2022 16:30

sjperkins added 2 commits March 31, 2022 14:53

Merge branch 'main' into scheduler-task-transition-tracing

6d6a0ee

Different stimulus_id's success/failure in Worker.execute

7bf0117

sjperkins mentioned this pull request Apr 4, 2022

Support Stimulus ID's in Scheduler with ContextVars #6046

Closed

3 tasks

sjperkins closed this Apr 8, 2022

sjperkins deleted the scheduler-task-transition-tracing branch April 8, 2022 15:20

Uh oh!

Conversation

sjperkins commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjperkins commented Mar 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fjetter Mar 24, 2022

Choose a reason for hiding this comment

Uh oh!

sjperkins Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

sjperkins Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

sjperkins Mar 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjperkins commented Apr 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sjperkins commented Mar 17, 2022 •

edited

Loading

github-actions bot commented Mar 17, 2022 •

edited

Loading

sjperkins commented Mar 22, 2022 •

edited

Loading