Conversation
b4477f6 to
2642dd7
Compare
| @@ -0,0 +1,41 @@ | |||
| use common::types::{DetailsLevel, TelemetryDetail}; | |||
There was a problem hiding this comment.
This small refactor is not necessary anymore, but I decided to keep it anyways since it declusters the mod.rs file.
If you prefer it to be reverted, let me know!
| } | ||
|
|
||
| #[derive(Debug, Default)] | ||
| pub struct RequestTracker { |
There was a problem hiding this comment.
I noticed a bit too late that there was already some sort of ScopeTracker. I decided to keep my implementation because it is located at a better suited place, has tests and is implemented a bit more generic.
| /// Collects various telemetry handled by ToC. | ||
| #[derive(Default)] | ||
| pub(super) struct TocTelemetryCollector { | ||
| snapshots: DashMap<String, SnapshotTelemetryCollector>, |
There was a problem hiding this comment.
I decided to let ToC handle these for 2 reasons:
- Creating new collections from snapshots can't be fully measured inside
Collectionbecause this type
only exists for the second half of the recovery process (archive extracting, collection creation are skipped) - To keep snapshot metrics consistent and in one place. (An alternative way would be to only measure snapshot recovery here or in a similar 'global' location).
Additionally we can extend this later for other metrics that must be handled 'globally'.
If you don't agree with 1. and measuring tar extracting, or collection creation isn't important, I can move this to another place like Collection!
155ff4b to
307bcc8
Compare
fa8f07a to
cdb6421
Compare
| let _telemetry_scope_guard = toc | ||
| .snapshot_telemetry_collector(&collection_name) | ||
| .running_snapshots | ||
| .measure_scope(); |
There was a problem hiding this comment.
When we hit this, it might still take some time for the snapshot to actually be created due to locks deeper down. But I think it's fine to still count it from here on.
* Add atomic counter for running snapshots * Refactor collection telemetry + export prometheus metric * Handle collection in ToC + current recovery measurements * Fix openapi tests * Fix tests * Add snapshot metrics for streaming and partial snapshots. * Fix typo in metrics name * Adjust metrics names --------- Co-authored-by: timvisee <tim@visee.me>
Depends on #7482
Adds the following new snapshot metrics:
Design decision
There are already a few shard level information about shard snapshots in the telemetry API.
However they are only enabled in certain scenarios (if a snapshot manifest is provided) and only for streaming snapshots.
We had to manually add new metrics anyways (like
snapshot_creations_total) in order to expose all metrics discussed, so I decided to implement these consistently uncoupled from the existing telemetry data. This also ensures that these changes don't introduce any unexpected behavior or mis-measurements.As always, I'm open for any suggestions on these decisions and happy about any feedback!