Metrics cluster working state and last update by JojiiOfficial · Pull Request #7670 · qdrant/qdrant

JojiiOfficial · 2025-12-02T12:18:09Z

Depends on #7479

Adds the following two new metric families to the metrics API:

# HELP cluster_last_update_delta time since last update
# TYPE cluster_last_update_delta gauge
cluster_last_update_delta 1.223929901

# HELP cluster_working_state working state of the cluster
# TYPE cluster_working_state gauge
cluster_working_state{state="working"} 1
cluster_working_state{state="stopped"} 0

cluster_working_state always has two metrics, one with state=working and one state=stopped.
The active state exports the value 1.

timvisee · 2025-12-02T14:01:04Z

cluster_working_state always has two metrics, one with state=working and one state=stopped.
The active state exports the value 1.
If the working state is equal to "stopped", an optional error is provided with an empty string indicating no error.
To make the metrics consistent, an empty value for error is always provided, similarly to a non active state having the value 0.

I'm not sure if this is a conventional way of doing things? Maybe it's better to only report the current one?

cluster_last_update_delta

I suggest to rename this to cluster_last_update_seconds, or cluster_last_update_delta_seconds or cluster_last_update_age_seconds

https://prometheus.io/docs/practices/naming/

src/common/metrics.rs

JojiiOfficial · 2025-12-03T11:25:44Z

I'm not sure if this is a conventional way of doing things? Maybe it's better to only report the current one?

It probably is not a conventional way. It seemed to be more suitable for plotting to me.

Checking with ChatGPT, it seems to confirm my approach:

✅ 1. Is it conventional to represent states like this?

Yes — for multi-state enumerations.

Example (good):
    my_service_state{state="running"} 1
    my_service_state{state="stopped"} 0
    my_service_state{state="error"} 0

⚠️ 2. What about one time series that changes its state label value over time?

    This is **not recommended**.

    Prometheus treats each unique label set as a separate time series.
    If the value of the state label changes (e.g. xy → yz → qq), you are:

    - creating a new time series every time the state changes
    - causing churn, which is bad for performance
    - making long-term trend queries messy
    - losing the ability to track how long each state existed (unless you manually reconstruct it)

Prometheus best practices:
👉 Label values should not have high cardinality or change frequently.

An alternative could be to include the state in the metrics name:

cluster_working_state_working{} 1
cluster_working_state_stopped{error=""} 0

I know ChatGPT could be wrong here so please let me know if you want me to look deeper into the conventions here!

Edit: On a second thought, maybe we shouldn't include the error itself for the same reason. I assume the error, if exists, can be read somewhere else too.

timvisee · 2025-12-04T15:14:59Z

Edit: On a second thought, maybe we shouldn't include the error itself for the same reason. I assume the error, if exists, can be read somewhere else too.

Yeah, that sounds reasonable!

I'm fine with that ChatGPT explanation.

Let's exclude the error message, but just denote the succesful and error states. That would be enough to detect 'an error'. And then a user is responsible for looking into the cluster to see the actual error.

src/common/metrics.rs

timvisee

One minor change I just noticed. Other than that all good, thanks!

src/common/metrics.rs

* cluster working state and last update in metrics * Rename metric * Remove error string * Use timestamp instead * Fix prometheus help text * Change metric type to counter

JojiiOfficial requested a review from timvisee December 2, 2025 12:18

JojiiOfficial force-pushed the extended-cluster-metrics branch from b016eea to 92751b4 Compare December 2, 2025 12:26

Base automatically changed from metrics-avoid-panic to dev December 2, 2025 14:03

timvisee mentioned this pull request Dec 3, 2025

Fix vector count in metrics showing minus zero #7678

Merged

3 tasks

timvisee reviewed Dec 3, 2025

View reviewed changes

src/common/metrics.rs Show resolved Hide resolved

JojiiOfficial force-pushed the extended-cluster-metrics branch from 92751b4 to 621c90c Compare December 3, 2025 11:12

This comment was marked as off-topic.

Sign in to view

qdrant deleted a comment from coderabbitai bot Dec 3, 2025

JojiiOfficial requested a review from timvisee December 4, 2025 10:03

timvisee requested changes Dec 4, 2025

View reviewed changes

src/common/metrics.rs Outdated Show resolved Hide resolved

qdrant deleted a comment from coderabbitai bot Dec 5, 2025

JojiiOfficial requested a review from timvisee December 5, 2025 09:18

timvisee requested changes Dec 5, 2025

View reviewed changes

src/common/metrics.rs Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

timvisee approved these changes Dec 5, 2025

View reviewed changes

JojiiOfficial mentioned this pull request Dec 5, 2025

Metrics incoming and outgoing shard transfers #7701

Merged

qdrant deleted a comment from coderabbitai bot Dec 5, 2025

JojiiOfficial added 6 commits December 5, 2025 13:23

cluster working state and last update in metrics

712a8e1

Rename metric

b91bf55

Remove error string

1f7bcf6

Use timestamp instead

58249a5

Fix prometheus help text

dd9ccf9

Change metric type to counter

d3cb8ab

JojiiOfficial force-pushed the extended-cluster-metrics branch from ad788f6 to d3cb8ab Compare December 5, 2025 12:23

qdrant deleted a comment from coderabbitai bot Dec 5, 2025

timvisee merged commit 55e5702 into dev Dec 5, 2025
15 checks passed

timvisee deleted the extended-cluster-metrics branch December 5, 2025 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics cluster working state and last update#7670

Metrics cluster working state and last update#7670
timvisee merged 6 commits intodevfrom
extended-cluster-metrics

JojiiOfficial commented Dec 2, 2025 •

edited

Loading

Uh oh!

timvisee commented Dec 2, 2025

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

JojiiOfficial commented Dec 3, 2025 •

edited

Loading

Uh oh!

timvisee commented Dec 4, 2025

Uh oh!

Uh oh!

timvisee left a comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JojiiOfficial commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisee commented Dec 2, 2025

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

JojiiOfficial commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisee commented Dec 4, 2025

Uh oh!

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JojiiOfficial commented Dec 2, 2025 •

edited

Loading

JojiiOfficial commented Dec 3, 2025 •

edited

Loading