Fix total aggregated metrics in dashboard#3897
Conversation
|
cc @jsignell who I think worked on this before |
|
Wow, yeah. I guess it shouldn't be summing that column. CPU is wrong as well right? |
|
I fixed the computation for cpu, cpu_fraction and memory_percent when workers have different memory_limit. I used |
|
It seems like the CI failures are not related to this PR and should be fixed by #3904: |
|
Merged and re-triggered CI here |
Fix memory_percent as well for workers with different memory_limit values.
938a59d to
21c4dce
Compare
|
I am seeing ZeroDivisionError errors when I exit my IPython session. I guess they happen because at the beginning there are no workers so the computation for some columns fails. Should I try to get rid of these (maybe by catching ZeroDivisionError in the try/catch a few lines below or by testing for Here is the full error: |
|
Yes, I think that we should handle the no-workers case, even if the warnings/logs are delayed. |
Fixed. Actually this was causing a HTTP 500 in the dashboard when there was no workers which I did not realise until I tested my change. |
|
Great. Thanks @lesteve ! |
The WorkerTable's Total CPU % divided by len(workers) instead of sum(nthreads), producing incorrect values when workers have multiple threads (e.g., 400% instead of 50%). This aligns the "cpu" column with the already-correct "cpu_fraction" column, matching the same fix pattern as PR dask#3897 for memory_percent. Closes dask#8490
Originally reported in dask/dask-jobqueue#440.
To reproduce:
The total memory usage is
250MiB, the total memory limit is16GiB, so the total memory percentage should be250/16e3 ~= 1.6%and not6.4%(the memory percentage gets summed across all the workers rather than averaged).I believe
memory_percentis the only column affected by this.Not sure how to add a test for this and I was not able to find some a good inspiration in the existing tests, but suggestions more than welcome!