Track task-prefix duration variance by TomAugspurger · Pull Request #4028 · dask/distributed

TomAugspurger · 2020-08-07T20:23:19Z

Part-1 of #4023: collect the statistics necessary to visualize.

This doesn't actually give use the stats needed for a box plot (25th, 50th, and 75th percentiles), but is probably good enough for a start.

mrocklin · 2020-08-07T22:18:07Z

Neat!

So. How intense do we want to get here? We currently have a couple of ways in which we record timing data

We record it on the TaskGroups, and then aggregate on demand on the TaskPrefix
We record it on the TaskPrefix itself, particularly for durations (see the all_durations field) which captures both compute time, as well as communication and deserialization time. This isn't dependent on the TaskGroups because they are ephemeral while we need to keep track of duration information long term for scheduling purposes.

Now we're adding a third way, variance-enabled data for compute times. Do we want to roll this into 2 so that we compute variances of transfer times and such? Do we also want to track things like number of bytes processed over time? Alternatively do we want to keep this information on the TaskGroups and make them somewhat less ephemeral (weakdeque? :) )

mrocklin · 2020-08-07T22:21:00Z

The other thing we could do, if you really do want quantiles, would be to bring in crick (cc @jcrist ) but that seems like maybe overkill for this (and I'm not sure what the performance implications would be).

Speaking of which, I'm assuming that this is pretty cheap, but can you provide a sense for how much added cost this adds? We're comparing against an ideal budget of around 200us per task. If this is sub-microsecond then great! If not then we probably need to do some thinking about it (at least until there is some cythonization happening)

quasiben · 2020-08-12T13:34:07Z

This is really cool to see! @TomAugspurger if you want me to handle the viz part once the measuring is in I'd be happy to take that on

TomAugspurger · 2020-10-27T19:44:36Z

If this is sub-microsecond then great!

Barely

In [1]: from distributed.scheduler import TaskPrefix
   ...:
   ...: prefix = TaskPrefix("foo")
   ...: prefix
   ...:
   ...: duration = 1.5
   ...:
   ...:

In [2]: %timeit prefix._update_duration_variance(duration)
952 ns ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

For reference, the other things we do here are setting attributes, which is about 1/10th the time.

In [3]: %timeit prefix.duration_average = duration
85.3 ns ± 8.94 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

And inplace addition, which is about 1/6th the time:

In [4]: from distributed.scheduler import TaskGroup

In [5]: group = TaskGroup("foo")

In [7]: %timeit group.duration += duration
138 ns ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Having help on the vis part would be most welcome.

quasiben · 2020-10-29T19:15:23Z

@TomAugspurger Do you have any thoughts about how you think think this should be visualized ? For reference we are currently tracking/visualizing time per key here:
#3933 (comment)

TomAugspurger · 2020-10-29T19:17:38Z

IMO, just a vertical line at the top of each bar from the mean to +/- the variance would be fine. (Dunno how hard that would be to do in bokeh)

quasiben · 2020-10-29T19:28:32Z

I think we can do that with Whiskers . Do you want me to that after this PR ?

TomAugspurger · 2020-10-29T19:34:53Z

That'd be great!

mrocklin · 2020-10-30T15:02:12Z

Two comments:

Maybe gaussian isn't the right distribution here. Someone off-list mentioned that maybe something like a powerlaw distribution would be better to model here. Durations tend to be always positive, and also skewed.
Should we wait for cython before doing this?

TomAugspurger · 2020-10-30T17:58:35Z

Yeah, measuring a couple percentiles is going to be much more informative here.

fjetter · 2021-06-30T09:48:26Z

FWIW I ran the benchmarks on my machine (M1 OSX) and am clocking in on ~320ns. This can be reduced by another 10% if we reduce a few attribute accesses. Cythonization is also on its way so performance is probably no longer a big deal for this?

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index cc9b3550..15775183 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -880,13 +880,14 @@ class TaskPrefix:
     def _update_duration_variance(self, x):
         # Welford rolling variance algorithm
         # https://www.johndcook.com/blog/standard_deviation/ for background.
-        self._count += 1
-        if self._count == 1:
+        self._count = count = self._count + 1
+        if count == 1:
             m = x
             s = 0.0
         else:
-            m = self._var_m + (x - self._var_m) / self._count
-            s = self._var_s + (x - self._var_m) * (x - m)
+            var_m = self._var_m
+            m = var_m + (x - var_m) / count
+            s = self._var_s + (x - var_m) * (x - m)
         self._var_m = m
         self._var_s = s

I could imagine many places where uncertainties could improve our decisions (bandwidth and byte size measurements, adaptive targets, worker objectives, steal ratios, etc.)

Track tastk-prefix duration variance

9846b38

TomAugspurger changed the title ~~Track tast-prefix duration variance~~ Track task-prefix duration variance Aug 7, 2020

Base automatically changed from master to main March 8, 2021 19:04

TomAugspurger mentioned this pull request Jun 29, 2021

ENH: Add subclass of Adaptive that works with unknown tasks #4973

Open

fjetter mentioned this pull request Jun 20, 2022

Use cases for work stealing #6600

Open

TomAugspurger requested a review from fjetter as a code owner January 23, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Track task-prefix duration variance#4028

Track task-prefix duration variance#4028
TomAugspurger wants to merge 1 commit intodask:mainfrom
TomAugspurger:duration-variance

TomAugspurger commented Aug 7, 2020

Uh oh!

mrocklin commented Aug 7, 2020

Uh oh!

mrocklin commented Aug 7, 2020

Uh oh!

quasiben commented Aug 12, 2020

Uh oh!

TomAugspurger commented Oct 27, 2020

Uh oh!

quasiben commented Oct 29, 2020

Uh oh!

TomAugspurger commented Oct 29, 2020 •

edited

Loading

Uh oh!

quasiben commented Oct 29, 2020

Uh oh!

TomAugspurger commented Oct 29, 2020

Uh oh!

mrocklin commented Oct 30, 2020

Uh oh!

TomAugspurger commented Oct 30, 2020

Uh oh!

fjetter commented Jun 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

TomAugspurger commented Aug 7, 2020

Uh oh!

mrocklin commented Aug 7, 2020

Uh oh!

mrocklin commented Aug 7, 2020

Uh oh!

quasiben commented Aug 12, 2020

Uh oh!

TomAugspurger commented Oct 27, 2020

Uh oh!

quasiben commented Oct 29, 2020

Uh oh!

TomAugspurger commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quasiben commented Oct 29, 2020

Uh oh!

TomAugspurger commented Oct 29, 2020

Uh oh!

mrocklin commented Oct 30, 2020

Uh oh!

TomAugspurger commented Oct 30, 2020

Uh oh!

fjetter commented Jun 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TomAugspurger commented Oct 29, 2020 •

edited

Loading