Conversation
| @@ -34,20 +34,6 @@ def __init__(self, server: Worker): | |||
| def collect(self) -> Iterator[Metric]: | |||
There was a problem hiding this comment.
I gutted most of this class away; eventually it will be deleted completely.
There was a problem hiding this comment.
Are the metrics you've refactored here actually showing up in a prometheus scrape? Is some magic making that happen?
There was a problem hiding this comment.
Yes they're showing up.
The magic is that they are registered automatically on prometheus_client.REGISTRY.
In the final thing we'll actually have to use a local registry so that the scheduler doesn't publish the worker metrics and vice versa.
| ["state"], | ||
| ) | ||
|
|
||
| compute = Summary( |
There was a problem hiding this comment.
Summary metrics replace pairs of _bytes_total and _count_total Counters
| unit="seconds", | ||
| ) | ||
|
|
||
| latency = Histogram( |
There was a problem hiding this comment.
Histogram metrics also embed a Summary; they replace the _max metrics and all crick stuff.
| "Current number of bytes worth of data transfers with peer workers", | ||
| ["direction"], | ||
| unit="bytes", | ||
| ) |
There was a problem hiding this comment.
Sadly there's no single metric for count+bytes for current metrics, unlike Summary which does the job for total metrics. We could (should?) create a class for it though.
| @@ -0,0 +1,80 @@ | |||
| from __future__ import annotations | |||
|
|
|||
| from prometheus_client import Counter, Gauge, Histogram, Summary | |||
There was a problem hiding this comment.
TODO: from distributed import prometheus; there create stub classes when prometheus_client is not available
There was a problem hiding this comment.
I plan to evaluate switching to opentelemetry as brought up by @jacobtomlinson. From what I understand, this would allow us to introduce a hard dependency on the opentelemetry-api only (which contains a bunch of NoOpMetrics). I think this should solve your problem here.
There was a problem hiding this comment.
@crusaderky I assume the reason all these metrics are defined in one file (as opposed to being defined in the place they're used) is to have one place to deal with the potential import error?
I was a little surprised to see the metrics created in one place and used in another, when they're only used in exactly one place.
There was a problem hiding this comment.
No, it's because many metrics are contributed to by different modules.
For example for dask_worker_event_loop_blocked:
- The
disk-write-targetlabel is set byworker_state_machine - The
disk-write-spilllabel is set byworker_memory - The
disk-read-executelabel is set byworker - The
disk-read-get-datalabel is set byworker
Besides that, it also felt nice to have everything in a nice, self-contained module and call the metrics through a metrics.XXX domain.
94f8b74 to
2634014
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 22 files ± 0 22 suites ±0 9h 52m 49s ⏱️ - 15m 8s For more details on these failures, see this check. Results for commit 2634014. ± Comparison against base commit f830259. |
|
cc @ntabris |
Closes #7364
This for now is just a POC to showcase the look&feel suggested in the parent issue.