Skip to content

Fix total aggregated metrics in dashboard#3897

Merged
mrocklin merged 4 commits intodask:masterfrom
lesteve:fix-dashboard-total-memory-usage
Jun 17, 2020
Merged

Fix total aggregated metrics in dashboard#3897
mrocklin merged 4 commits intodask:masterfrom
lesteve:fix-dashboard-total-memory-usage

Conversation

@lesteve
Copy link
Copy Markdown
Member

@lesteve lesteve commented Jun 15, 2020

Originally reported in dask/dask-jobqueue#440.

To reproduce:

import webbrowser
from distributed import LocalCluster

cluster = LocalCluster(n_workers=4)
webbrowser.open(cluster.dashboard_link.replace("status", "workers"))

image

The total memory usage is 250MiB, the total memory limit is 16GiB, so the total memory percentage should be 250/16e3 ~= 1.6% and not 6.4% (the memory percentage gets summed across all the workers rather than averaged).

I believe memory_percent is the only column affected by this.

Not sure how to add a test for this and I was not able to find some a good inspiration in the existing tests, but suggestions more than welcome!

@mrocklin
Copy link
Copy Markdown
Member

cc @jsignell who I think worked on this before

@jsignell
Copy link
Copy Markdown
Member

Wow, yeah. I guess it shouldn't be summing that column. CPU is wrong as well right?

@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Jun 16, 2020

I fixed the computation for cpu, cpu_fraction and memory_percent when workers have different memory_limit. I used ws.metrics because it felt simpler than using data (one of the problem with data is that the column you need may not have been inserted yet)

Copy link
Copy Markdown
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me. Thanks @lesteve for taking this on :)

@lesteve lesteve changed the title Fix total memory_percent usage in dashboard Fix total aggregated metrics in dashboard Jun 16, 2020
@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Jun 16, 2020

It seems like the CI failures are not related to this PR and should be fixed by #3904:

ImportError: cannot import name 'create_hosts_whitelist' from 'bokeh.server.util'

@mrocklin
Copy link
Copy Markdown
Member

Merged and re-triggered CI here

@lesteve lesteve force-pushed the fix-dashboard-total-memory-usage branch from 938a59d to 21c4dce Compare June 16, 2020 20:36
@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Jun 16, 2020

I am seeing ZeroDivisionError errors when I exit my IPython session. I guess they happen because at the beginning there are no workers so the computation for some columns fails.

Should I try to get rid of these (maybe by catching ZeroDivisionError in the try/catch a few lines below or by testing for len(self.scheduler.workers)), or should I ignore them ?

Here is the full error:

bokeh.util.tornado - ERROR - Error thrown from periodic callback:
bokeh.util.tornado - ERROR - Traceback (most recent call last):
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/server/session.py", line 67, in _needs_document_lock_wrapper
    result = func(self, *args, **kwargs)
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/server/session.py", line 195, in with_document_locked
    return func(*args, **kwargs)
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/document/document.py", line 1164, in wrapper
    return doc._with_self_as_curdoc(invoke)
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/document/document.py", line 1150, in _with_self_as_curdoc
    return f()
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/document/document.py", line 1163, in invoke
    return f(*args, **kwargs)
  File "/home/lesteve/dev/distributed/distributed/dashboard/components/__init__.py", line 77, in <lambda>
    doc.add_periodic_callback(lambda: update(ref), interval)
  File "/home/lesteve/dev/distributed/distributed/dashboard/components/__init__.py", line 84, in update
    comp.update()
  File "/home/lesteve/miniconda3/lib/python3.7/site-packages/bokeh/core/property/validation.py", line 93, in func
    return input_function(*args, **kwargs)
  File "/home/lesteve/dev/distributed/distributed/dashboard/components/scheduler.py", line 1867, in update
    / len(self.scheduler.workers.values())
ZeroDivisionError: float division by zero

@mrocklin
Copy link
Copy Markdown
Member

Yes, I think that we should handle the no-workers case, even if the warnings/logs are delayed.

@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Jun 17, 2020

Yes, I think that we should handle the no-workers case, even if the warnings/logs are delayed.

Fixed. Actually this was causing a HTTP 500 in the dashboard when there was no workers which I did not realise until I tested my change.

@mrocklin mrocklin merged commit 2407e64 into dask:master Jun 17, 2020
@mrocklin
Copy link
Copy Markdown
Member

Great. Thanks @lesteve !

@lesteve lesteve deleted the fix-dashboard-total-memory-usage branch July 24, 2020 14:14
ernestprovo23 added a commit to ernestprovo23/distributed that referenced this pull request Feb 22, 2026
The WorkerTable's Total CPU % divided by len(workers) instead of
sum(nthreads), producing incorrect values when workers have multiple
threads (e.g., 400% instead of 50%). This aligns the "cpu" column
with the already-correct "cpu_fraction" column, matching the same
fix pattern as PR dask#3897 for memory_percent.

Closes dask#8490
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants