Add table of metric statistics to Workers tab#4614
Add table of metric statistics to Workers tab#4614charlesbluca wants to merge 24 commits intodask:mainfrom
Conversation
|
A few practical observations:
|
|
Thanks for the comments!
I agree, in this context "min" doesn't seem to offer any insights; could it be useful if implemented as you've described "mean," where it's specifically targeting time periods when tasks are processing?
This sounds like a great approach - is there a way we can check whether or not the cluster is currently idle? If not, could we add a nullable attribute like
Sounds good - I was also thinking of controlling this with arguments to the |
The scheduler keeps track of which works are currently underutilized distributed/distributed/scheduler.py Lines 1569 to 1570 in 8942856 and also keeps an distributed/distributed/scheduler.py Line 3128 in 8942856 |
|
Thanks @jrbourbeau! Shouldn't be too difficult to conditionally update the mean if the worker isn't in Bokeh's MultiChoice seems like a good option for the dropdown menu, but it overlaps with the data table: There might be a way to expand/collapse the dropdown's container using CSS, or we could orient the dropdown to be side-by-side with the table. An easier solution might be to use the CheckboxButtonGroup: |
|
Revisiting - do we still think something like this would be of use in the dashboard / perf reports? If I recall, I still need to figure out a reliable way to only track worker metrics when the workers are not idle. |
This would probably be a quite significant change to what we're doing. While I 100% agree that the metrics would be more useful this way, I'm hesitant to introduce full timeseries metrics for performance and memory usage reasons. If all of that turns out to not be a big deal, all the better but I'm skeptical. Considering that we can toggle of unwanted columns to reduce clutter, I think we can merge this regardless and follow up with "better metrics" later on. Since I haven't been involved in this entire discussion, yet, I would appreciate a second opinion, though. |
Do these metrics qualify as timeseries metrics? I tried to compute max and mean using only the the previous and next timestamps so that memory usage / performance hopefully wouldn't be impacted too heavily. I'm not sure how this might change if we were to only look at metrics when the workers are not idle. Thanks for pointing out crick, I'll look into that. |
I honestly didn't realize we were already keeping these metrics in memory already, see This creates a, albeit slowly, continuously increasing, unbound data sink and swallows a lot of memory over time. We typically use python deques for this task instead of plain lists to manage the memory footprint and prohibit this to grow without limit. That would possibly require the introduction of another config option, therefore I suggest to not intertwine this with this PR. This would of course not be a problem if this data is pruned somewhere else but I'm not aware of such a mechansim.
That's the typical algorithm to calculate these values for streaming data so this is perfectly fine. Crick is mostly useful to calculate metrics which are not that easy to calculate in streaming applications. For instance, how would you calculate the median in this situation without always looking at all vaules? This is where algorithms like TDigest come in (crick)>
Short answer is that we'll need to either implement a more complex, probably approximate algorithm or pay the price for always looking at all the data. This is more or less the problem I was mentioning with "I'm hesitant to introduce full timeseries data" since keeping all data and managing it efficiently can be complicated and/or costly and this problem might be just out of scope for this dashboard and better deferred to external application (e.g. prometheus) built for this kind of thing. |
|
Sorry for the late follow up here - I'm interested in how that line would cause a continuously increasing sink here? From my understand of its role in the |
|
@charlesbluca sorry for the late follow up Yes, I believe you are right and my concerns were unfounded. Sorry for holding this up for so long. Can you rebase, please? |
|
Rebasing shouldn't be necessary. a Merge commit would be welcome as well
(and may be easier)
…On Tue, Jun 29, 2021 at 4:06 AM Florian Jetter ***@***.***> wrote:
@charlesbluca <https://github.com/charlesbluca> sorry for the late follow
up
Yes, I believe you are right and my concerns were unfounded. Sorry for
holding this up for so long. Can you rebase, please?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4614 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAGFNHNKI3SHQB7QYLTVGEJZANCNFSM4ZPTTJHA>
.
|
|
I would hold off on committing, I noticed some Bokeh errrors from iPython console while revisiting this. Can post those in a second EDIT: Looks like deserialization errors occur when using the checkbox widget: Might need some additional Bokeh support to resolve this, but wondering if we can add a test that would catch something like this, since it went undetected by CI. |
|
After posting about these issues on Bokeh's discourse, there are a couple solutions for the serialization errors:
|
|
With bokeh/bokeh#11501 merged, there's now a cleaner way to show/hide table columns by using their |
This reverts commit c2c7064.
|
@quasiben @jacobtomlinson would there be objection to bumping Dask's bokeh dependency to 2.4.2? That's the first version to roll out a fix for the We could also try just 2.4.0 and make use of the |
Unit Test Results 12 files ±0 12 suites ±0 7h 13m 30s ⏱️ + 6m 47s For more details on these failures, see this check. Results for commit c5aff0b. ± Comparison against base commit de94b40. ♻️ This comment has been updated with latest results. |
|
@charlesbluca |
|
I can't think of any reason why we shouldn't bump the minimum version. We should ensure the conda-forge recipe is updated too though. |
Yeah we'll need to remember to follow up there - for now also updated the nightly recipe to reflect this change, and will open up a PR in Dask as I assume we'll also want
Cool - I can look into putting together an MRE for this case. |
|
My only comment re 2.4.2 is just to note that 2.4.2 is probably the last 2.x release that will be made. I am committed making sure it can demonstrated that Dask (in particular) is works solidly with 3.0 out of the box, but I could definitely use some help with that testing and evaluation once we start cutting 3.0 release candidates. (So far all my casual testing with Dask and |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 26 files + 1 26 suites +1 13h 43m 36s ⏱️ - 43m 23s For more details on these failures, see this check. Results for commit e867ded. ± Comparison against base commit 4f0d173. ♻️ This comment has been updated with latest results. |
|
Sorry to bump this after so long, @fjetter are you still generally okay with the idea of merging in these metrics and iterating on things moving forward? |
|
@fjetter seemed to withdraw his concerns in his last comment here
@crusaderky if you agree with that sentiment I'm happy to take over this review. |
|
@jacobtomlinson I haven't been involved at all. Very happy for you to take over review |
|
Bumping this since it's been a while - is this still something we think would be useful within the dashboard / performance reports? If so, I've gone ahead and resolved upstream conflicts so we can check if this is breaking anything new in CI |



This PR adds a table of basic statistics (max, min, mean) for the worker metrics (CPU, memory, # of file descriptors, etc.) to the Workers tab of the dashboard:
It also adds a Workers tab to the performance report containing this table, attempting to tackle some of the resource tracking desired in #4494:
Some questions:
This table is pretty crowded since I've included (to my knowledge) all dynamic metrics that could be aggregated; are there any metrics or statistics we don't really need to see in this table?
Are there any statistics we would like to see here, maybe for specific metrics?
Tests added / passed
Passes
black distributed/flake8 distributed