Skip to content

Metrics for snapshots#7497

Merged
timvisee merged 8 commits intodevfrom
metrics-running-snapshots
Nov 11, 2025
Merged

Metrics for snapshots#7497
timvisee merged 8 commits intodevfrom
metrics-running-snapshots

Conversation

@JojiiOfficial
Copy link
Contributor

@JojiiOfficial JojiiOfficial commented Nov 5, 2025

Depends on #7482

Adds the following new snapshot metrics:

# HELP snapshot_creation_running amount of snapshot creations that are currently running
# TYPE snapshot_creation_running gauge
snapshot_creation_running{id="benchmark"} 1

# HELP snapshot_recovery_running amount of snapshot recovery operations currently running
# TYPE snapshot_recovery_running gauge
snapshot_recovery_running{id="benchmark"} 0

# HELP snapshot_created_total total amount of snapshots created
# TYPE snapshot_created_total counter
snapshot_created_total{id="benchmark"} 1

Design decision

There are already a few shard level information about shard snapshots in the telemetry API.
However they are only enabled in certain scenarios (if a snapshot manifest is provided) and only for streaming snapshots.
We had to manually add new metrics anyways (like snapshot_creations_total) in order to expose all metrics discussed, so I decided to implement these consistently uncoupled from the existing telemetry data. This also ensures that these changes don't introduce any unexpected behavior or mis-measurements.

As always, I'm open for any suggestions on these decisions and happy about any feedback!

@JojiiOfficial JojiiOfficial marked this pull request as draft November 5, 2025 17:05
@JojiiOfficial JojiiOfficial force-pushed the metrics-running-snapshots branch from b4477f6 to 2642dd7 Compare November 5, 2025 17:16
@@ -0,0 +1,41 @@
use common::types::{DetailsLevel, TelemetryDetail};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This small refactor is not necessary anymore, but I decided to keep it anyways since it declusters the mod.rs file.
If you prefer it to be reverted, let me know!

}

#[derive(Debug, Default)]
pub struct RequestTracker {
Copy link
Contributor Author

@JojiiOfficial JojiiOfficial Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a bit too late that there was already some sort of ScopeTracker. I decided to keep my implementation because it is located at a better suited place, has tests and is implemented a bit more generic.

/// Collects various telemetry handled by ToC.
#[derive(Default)]
pub(super) struct TocTelemetryCollector {
snapshots: DashMap<String, SnapshotTelemetryCollector>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to let ToC handle these for 2 reasons:

  • Creating new collections from snapshots can't be fully measured inside Collection because this type
    only exists for the second half of the recovery process (archive extracting, collection creation are skipped)
  • To keep snapshot metrics consistent and in one place. (An alternative way would be to only measure snapshot recovery here or in a similar 'global' location).

Additionally we can extend this later for other metrics that must be handled 'globally'.

If you don't agree with 1. and measuring tar extracting, or collection creation isn't important, I can move this to another place like Collection!

@JojiiOfficial JojiiOfficial changed the title [WIP] Metrics for snapshots Metrics for snapshots Nov 7, 2025
@JojiiOfficial JojiiOfficial marked this pull request as ready for review November 7, 2025 10:19
@timvisee timvisee added the release:1.16.0 Pull requests that should be merged for the Qdrant 1.16.0 release. label Nov 10, 2025
@timvisee timvisee self-assigned this Nov 10, 2025
Base automatically changed from metrics-aggregate-page-faults to dev November 11, 2025 12:38
@timvisee timvisee force-pushed the metrics-running-snapshots branch from 155ff4b to 307bcc8 Compare November 11, 2025 12:39
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@timvisee timvisee force-pushed the metrics-running-snapshots branch from fa8f07a to cdb6421 Compare November 11, 2025 13:34
@qdrant qdrant deleted a comment from coderabbitai bot Nov 11, 2025
@qdrant qdrant deleted a comment from coderabbitai bot Nov 11, 2025
@qdrant qdrant deleted a comment from coderabbitai bot Nov 11, 2025
Comment on lines +37 to +40
let _telemetry_scope_guard = toc
.snapshot_telemetry_collector(&collection_name)
.running_snapshots
.measure_scope();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we hit this, it might still take some time for the snapshot to actually be created due to locks deeper down. But I think it's fine to still count it from here on.

@timvisee timvisee merged commit f6b715b into dev Nov 11, 2025
15 checks passed
@timvisee timvisee deleted the metrics-running-snapshots branch November 11, 2025 14:00
timvisee added a commit that referenced this pull request Nov 14, 2025
* Add atomic counter for running snapshots

* Refactor collection telemetry + export prometheus metric

* Handle collection in ToC + current recovery measurements

* Fix openapi tests

* Fix tests

* Add snapshot metrics for streaming and partial snapshots.

* Fix typo in metrics name

* Adjust metrics names

---------

Co-authored-by: timvisee <tim@visee.me>
@timvisee timvisee mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release:1.16.0 Pull requests that should be merged for the Qdrant 1.16.0 release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants