-
-
Notifications
You must be signed in to change notification settings - Fork 757
Closed
Description
In Prometheus, we export the number of tasks on the worker using
distributed/distributed/worker_state_machine.py
Lines 3299 to 3324 in f830259
| @property | |
| def task_counts(self) -> dict[TaskStateState | Literal["other"], int]: | |
| # Actors can be in any state other than {fetch, flight, missing} | |
| n_actors_in_memory = sum( | |
| self.tasks[key].state == "memory" for key in self.actors | |
| ) | |
| out: dict[TaskStateState | Literal["other"], int] = { | |
| # Key measure for occupancy. | |
| # Also includes cancelled(executing) and resumed(executing->fetch) | |
| "executing": len(self.executing), | |
| # Also includes cancelled(long-running) and resumed(long-running->fetch) | |
| "long-running": len(self.long_running), | |
| "memory": len(self.data) + n_actors_in_memory, | |
| "ready": len(self.ready), | |
| "constrained": len(self.constrained), | |
| "waiting": len(self.waiting), | |
| "fetch": self.fetch_count, | |
| "missing": len(self.missing_dep_flight), | |
| # Also includes cancelled(flight) and resumed(flight->waiting) | |
| "flight": len(self.in_flight_tasks), | |
| } | |
| # released | error | |
| out["other"] = other = len(self.tasks) - sum(out.values()) | |
| assert other >= 0 | |
| return out |
Erred tasks can be an important indicator of something very wrong going on. I just went through a real use case where "other" was very large, and I don't know for sure what I'm looking at - huge amount of erred tasks, released tasks, a bug in the code which hides a third state?
Solution 1: add a new set to the WorkerStateMachine, containing all erred tasks, just for the purpose of this counting
Solution 2: rewrite the task count as an inc/dec counter whenever there is a transition
Solution 3: #7411 (which is the same as 2, but natively with prometheus_client)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels