Skip to content

Metrics cluster working state and last update#7670

Merged
timvisee merged 6 commits intodevfrom
extended-cluster-metrics
Dec 5, 2025
Merged

Metrics cluster working state and last update#7670
timvisee merged 6 commits intodevfrom
extended-cluster-metrics

Conversation

@JojiiOfficial
Copy link
Contributor

@JojiiOfficial JojiiOfficial commented Dec 2, 2025

Depends on #7479

Adds the following two new metric families to the metrics API:

# HELP cluster_last_update_delta time since last update
# TYPE cluster_last_update_delta gauge
cluster_last_update_delta 1.223929901

# HELP cluster_working_state working state of the cluster
# TYPE cluster_working_state gauge
cluster_working_state{state="working"} 1
cluster_working_state{state="stopped"} 0

cluster_working_state always has two metrics, one with state=working and one state=stopped.
The active state exports the value 1.

@JojiiOfficial JojiiOfficial force-pushed the extended-cluster-metrics branch from b016eea to 92751b4 Compare December 2, 2025 12:26
@timvisee
Copy link
Member

timvisee commented Dec 2, 2025

cluster_working_state always has two metrics, one with state=working and one state=stopped.
The active state exports the value 1.
If the working state is equal to "stopped", an optional error is provided with an empty string indicating no error.
To make the metrics consistent, an empty value for error is always provided, similarly to a non active state having the value 0.

I'm not sure if this is a conventional way of doing things? Maybe it's better to only report the current one?

cluster_last_update_delta

I suggest to rename this to cluster_last_update_seconds, or cluster_last_update_delta_seconds or cluster_last_update_age_seconds

https://prometheus.io/docs/practices/naming/

Base automatically changed from metrics-avoid-panic to dev December 2, 2025 14:03
@JojiiOfficial JojiiOfficial force-pushed the extended-cluster-metrics branch from 92751b4 to 621c90c Compare December 3, 2025 11:12
coderabbitai[bot]

This comment was marked as off-topic.

@JojiiOfficial
Copy link
Contributor Author

JojiiOfficial commented Dec 3, 2025

I'm not sure if this is a conventional way of doing things? Maybe it's better to only report the current one?

It probably is not a conventional way. It seemed to be more suitable for plotting to me.

Checking with ChatGPT, it seems to confirm my approach:

✅ 1. Is it conventional to represent states like this?

Yes — for multi-state enumerations.

Example (good):
    my_service_state{state="running"} 1
    my_service_state{state="stopped"} 0
    my_service_state{state="error"} 0

⚠️ 2. What about one time series that changes its state label value over time?

    This is **not recommended**.

    Prometheus treats each unique label set as a separate time series.
    If the value of the state label changes (e.g. xy → yz → qq), you are:

    - creating a new time series every time the state changes
    - causing churn, which is bad for performance
    - making long-term trend queries messy
    - losing the ability to track how long each state existed (unless you manually reconstruct it)

Prometheus best practices:
👉 Label values should not have high cardinality or change frequently.

An alternative could be to include the state in the metrics name:

cluster_working_state_working{} 1
cluster_working_state_stopped{error=""} 0

I know ChatGPT could be wrong here so please let me know if you want me to look deeper into the conventions here!

Edit: On a second thought, maybe we shouldn't include the error itself for the same reason. I assume the error, if exists, can be read somewhere else too.

@qdrant qdrant deleted a comment from coderabbitai bot Dec 3, 2025
@timvisee
Copy link
Member

timvisee commented Dec 4, 2025

Edit: On a second thought, maybe we shouldn't include the error itself for the same reason. I assume the error, if exists, can be read somewhere else too.

Yeah, that sounds reasonable!

I'm fine with that ChatGPT explanation.

Let's exclude the error message, but just denote the succesful and error states. That would be enough to detect 'an error'. And then a user is responsible for looking into the cluster to see the actual error.

@qdrant qdrant deleted a comment from coderabbitai bot Dec 5, 2025
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change I just noticed. Other than that all good, thanks!

coderabbitai[bot]

This comment was marked as resolved.

@JojiiOfficial JojiiOfficial force-pushed the extended-cluster-metrics branch from ad788f6 to d3cb8ab Compare December 5, 2025 12:23
@qdrant qdrant deleted a comment from coderabbitai bot Dec 5, 2025
@timvisee timvisee merged commit 55e5702 into dev Dec 5, 2025
15 checks passed
@timvisee timvisee deleted the extended-cluster-metrics branch December 5, 2025 14:47
timvisee pushed a commit that referenced this pull request Dec 18, 2025
* cluster working state and last update in metrics

* Rename metric

* Remove error string

* Use timestamp instead

* Fix prometheus help text

* Change metric type to counter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants