Fix total aggregated metrics in dashboard by lesteve · Pull Request #3897 · dask/distributed

lesteve · 2020-06-15T16:51:30Z

Originally reported in dask/dask-jobqueue#440.

To reproduce:

import webbrowser
from distributed import LocalCluster

cluster = LocalCluster(n_workers=4)
webbrowser.open(cluster.dashboard_link.replace("status", "workers"))

The total memory usage is 250MiB, the total memory limit is 16GiB, so the total memory percentage should be 250/16e3 ~= 1.6% and not 6.4% (the memory percentage gets summed across all the workers rather than averaged).

I believe memory_percent is the only column affected by this.

Not sure how to add a test for this and I was not able to find some a good inspiration in the existing tests, but suggestions more than welcome!

mrocklin · 2020-06-15T16:58:08Z

cc @jsignell who I think worked on this before

jsignell · 2020-06-15T17:22:25Z

Wow, yeah. I guess it shouldn't be summing that column. CPU is wrong as well right?

distributed/dashboard/components/scheduler.py

lesteve · 2020-06-16T13:52:25Z

I fixed the computation for cpu, cpu_fraction and memory_percent when workers have different memory_limit. I used ws.metrics because it felt simpler than using data (one of the problem with data is that the column you need may not have been inserted yet)

jsignell

This looks right to me. Thanks @lesteve for taking this on :)

distributed/dashboard/components/scheduler.py

lesteve · 2020-06-16T14:57:50Z

It seems like the CI failures are not related to this PR and should be fixed by #3904:

ImportError: cannot import name 'create_hosts_whitelist' from 'bokeh.server.util'

mrocklin · 2020-06-16T15:02:07Z

Merged and re-triggered CI here

Fix memory_percent as well for workers with different memory_limit values.

lesteve · 2020-06-16T21:00:58Z

I am seeing ZeroDivisionError errors when I exit my IPython session. I guess they happen because at the beginning there are no workers so the computation for some columns fails.

Should I try to get rid of these (maybe by catching ZeroDivisionError in the try/catch a few lines below or by testing for len(self.scheduler.workers)), or should I ignore them ?

Here is the full error:

bokeh.util.tornado - ERROR - Error thrown from periodic callback:
bokeh.util.tornado - ERROR - Traceback (most recent call last):
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/server/session.py", line 67, in _needs_document_lock_wrapper
    result = func(self, *args, **kwargs)
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/server/session.py", line 195, in with_document_locked
    return func(*args, **kwargs)
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/document/document.py", line 1164, in wrapper
    return doc._with_self_as_curdoc(invoke)
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/document/document.py", line 1150, in _with_self_as_curdoc
    return f()
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/document/document.py", line 1163, in invoke
    return f(*args, **kwargs)
  File "/home/lesteve/dev/distributed/distributed/dashboard/components/__init__.py", line 77, in <lambda>
    doc.add_periodic_callback(lambda: update(ref), interval)
  File "/home/lesteve/dev/distributed/distributed/dashboard/components/__init__.py", line 84, in update
    comp.update()
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/core/property/validation.py", line 93, in func
    return input_function(*args, **kwargs)
  File "/home/lesteve/dev/distributed/distributed/dashboard/components/scheduler.py", line 1867, in update
    / len(self.scheduler.workers.values())
ZeroDivisionError: float division by zero

mrocklin · 2020-06-16T23:08:34Z

Yes, I think that we should handle the no-workers case, even if the warnings/logs are delayed.

lesteve · 2020-06-17T04:39:11Z

Yes, I think that we should handle the no-workers case, even if the warnings/logs are delayed.

Fixed. Actually this was causing a HTTP 500 in the dashboard when there was no workers which I did not realise until I tested my change.

mrocklin · 2020-06-17T13:17:11Z

Great. Thanks @lesteve !

The WorkerTable's Total CPU % divided by len(workers) instead of sum(nthreads), producing incorrect values when workers have multiple threads (e.g., 400% instead of 50%). This aligns the "cpu" column with the already-correct "cpu_fraction" column, matching the same fix pattern as PR dask#3897 for memory_percent. Closes dask#8490

lesteve mentioned this pull request Jun 15, 2020

Confused by memory reports using SLURMCluster and the dashboard dask/dask-jobqueue#440

Closed

jsignell reviewed Jun 15, 2020

View reviewed changes

distributed/dashboard/components/scheduler.py Outdated Show resolved Hide resolved

jsignell approved these changes Jun 16, 2020

View reviewed changes

jsignell reviewed Jun 16, 2020

View reviewed changes

distributed/dashboard/components/scheduler.py Outdated Show resolved Hide resolved

lesteve changed the title ~~Fix total memory_percent usage in dashboard~~ Fix total aggregated metrics in dashboard Jun 16, 2020

lesteve added 3 commits June 16, 2020 22:35

Fix total memory_percent usage in dashboard.

eeddc5a

Fix computation for cpu and cpu_fraction.

3dafec4

Fix memory_percent as well for workers with different memory_limit values.

Improve readability.

21c4dce

lesteve force-pushed the fix-dashboard-total-memory-usage branch from 938a59d to 21c4dce Compare June 16, 2020 20:36

Tackle case where there is no workers

150ba5a

mrocklin merged commit 2407e64 into dask:master Jun 17, 2020

lesteve deleted the fix-dashboard-total-memory-usage branch July 24, 2020 14:14

lesteve mentioned this pull request Jul 24, 2020

Handle sum of memory percentage when memory_limit is 0 #3984

Merged

crusaderky mentioned this pull request Feb 5, 2024

Total CPU % on /workers tab makes little sense #8490

Closed

ernestprovo23 mentioned this pull request Feb 22, 2026

Fix Total CPU % on /workers tab to normalize by total nthreads #9195

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix total aggregated metrics in dashboard#3897

Fix total aggregated metrics in dashboard#3897
mrocklin merged 4 commits intodask:masterfrom
lesteve:fix-dashboard-total-memory-usage

lesteve commented Jun 15, 2020 •

edited

Loading

Uh oh!

mrocklin commented Jun 15, 2020

Uh oh!

jsignell commented Jun 15, 2020

Uh oh!

Uh oh!

lesteve commented Jun 16, 2020 •

edited

Loading

Uh oh!

jsignell left a comment

Uh oh!

Uh oh!

lesteve commented Jun 16, 2020 •

edited

Loading

Uh oh!

mrocklin commented Jun 16, 2020

Uh oh!

lesteve commented Jun 16, 2020 •

edited

Loading

Uh oh!

mrocklin commented Jun 16, 2020

Uh oh!

lesteve commented Jun 17, 2020

Uh oh!

mrocklin commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lesteve commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jun 15, 2020

Uh oh!

jsignell commented Jun 15, 2020

Uh oh!

Uh oh!

lesteve commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lesteve commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jun 16, 2020

Uh oh!

lesteve commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jun 16, 2020

Uh oh!

lesteve commented Jun 17, 2020

Uh oh!

mrocklin commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lesteve commented Jun 15, 2020 •

edited

Loading

lesteve commented Jun 16, 2020 •

edited

Loading

lesteve commented Jun 16, 2020 •

edited

Loading

lesteve commented Jun 16, 2020 •

edited

Loading