WIP refactor prometheus metrics by crusaderky · Pull Request #7411 · dask/distributed

crusaderky · 2022-12-16T13:21:34Z

Closes #7364
This for now is just a POC to showcase the look&feel suggested in the parent issue.

crusaderky · 2022-12-16T13:22:20Z

distributed/http/worker/prometheus/core.py

@@ -34,20 +34,6 @@ def __init__(self, server: Worker):
    def collect(self) -> Iterator[Metric]:


I gutted most of this class away; eventually it will be deleted completely.

Are the metrics you've refactored here actually showing up in a prometheus scrape? Is some magic making that happen?

Yes they're showing up.

The magic is that they are registered automatically on prometheus_client.REGISTRY.
In the final thing we'll actually have to use a local registry so that the scheduler doesn't publish the worker metrics and vice versa.

crusaderky · 2022-12-16T13:25:54Z

distributed/worker_metrics.py

+    ["state"],
+)
+
+compute = Summary(


Summary metrics replace pairs of _bytes_total and _count_total Counters

crusaderky · 2022-12-16T13:26:39Z

distributed/worker_metrics.py

+    unit="seconds",
+)
+
+latency = Histogram(


Histogram metrics also embed a Summary; they replace the _max metrics and all crick stuff.

crusaderky · 2022-12-16T13:27:45Z

distributed/worker_metrics.py

+    "Current number of bytes worth of data transfers with peer workers",
+    ["direction"],
+    unit="bytes",
+)


Sadly there's no single metric for count+bytes for current metrics, unlike Summary which does the job for total metrics. We could (should?) create a class for it though.

crusaderky · 2022-12-16T13:28:33Z

distributed/worker_metrics.py

@@ -0,0 +1,80 @@
+from __future__ import annotations
+
+from prometheus_client import Counter, Gauge, Histogram, Summary


TODO: from distributed import prometheus; there create stub classes when prometheus_client is not available

I plan to evaluate switching to opentelemetry as brought up by @jacobtomlinson. From what I understand, this would allow us to introduce a hard dependency on the opentelemetry-api only (which contains a bunch of NoOpMetrics). I think this should solve your problem here.

@crusaderky I assume the reason all these metrics are defined in one file (as opposed to being defined in the place they're used) is to have one place to deal with the potential import error?

I was a little surprised to see the metrics created in one place and used in another, when they're only used in exactly one place.

No, it's because many metrics are contributed to by different modules.

For example for dask_worker_event_loop_blocked:

The disk-write-target label is set by worker_state_machine

The disk-write-spill label is set by worker_memory

The disk-read-execute label is set by worker

The disk-read-get-data label is set by worker

Besides that, it also felt nice to have everything in a nice, self-contained module and call the metrics through a metrics.XXX domain.

github-actions · 2022-12-16T14:23:46Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      22 files ±  0       22 suites ±0 9h 52m 49s ⏱️ - 15m 8s
  3 268 tests -   1   3 174 ✔️ -   5     85 💤 ±0   9 ❌ +  4
35 879 runs - 11 34 297 ✔️ - 65 1 519 💤 - 2 63 ❌ +56

For more details on these failures, see this check.

Results for commit 2634014. ± Comparison against base commit f830259.

mrocklin · 2022-12-16T14:26:49Z

cc @ntabris

crusaderky commented Dec 16, 2022

View reviewed changes

WIP refactor prometheus

2634014

crusaderky force-pushed the prometheus_v2 branch from 94f8b74 to 2634014 Compare December 16, 2022 13:35

crusaderky mentioned this pull request Dec 16, 2022

Improve configuration/composition of Prometheus metric collection #7364

Open

crusaderky mentioned this pull request Dec 16, 2022

Prometheus: "other" tasks count is confusing #7412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP refactor prometheus metrics#7411

WIP refactor prometheus metrics#7411
crusaderky wants to merge 1 commit intodask:mainfrom
crusaderky:prometheus_v2

crusaderky commented Dec 16, 2022

Uh oh!

crusaderky Dec 16, 2022

Uh oh!

gjoseph92 Dec 16, 2022

Uh oh!

crusaderky Dec 16, 2022

Uh oh!

crusaderky Dec 16, 2022

Uh oh!

crusaderky Dec 16, 2022

Uh oh!

crusaderky Dec 16, 2022 •

edited

Loading

Uh oh!

crusaderky Dec 16, 2022

Uh oh!

hendrikmakait Dec 16, 2022

Uh oh!

gjoseph92 Dec 16, 2022

Uh oh!

crusaderky Dec 16, 2022

Uh oh!

github-actions bot commented Dec 16, 2022

Uh oh!

mrocklin commented Dec 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -34,20 +34,6 @@ def __init__(self, server: Worker):
		def collect(self) -> Iterator[Metric]:

		@@ -0,0 +1,80 @@
		from __future__ import annotations

		from prometheus_client import Counter, Gauge, Histogram, Summary

Uh oh!

Conversation

crusaderky commented Dec 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Dec 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 16, 2022

Unit Test Results

Uh oh!

mrocklin commented Dec 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crusaderky Dec 16, 2022 •

edited

Loading