-
Notifications
You must be signed in to change notification settings - Fork 19
Add various object store cache Prometheus metrics #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| self.inner.abort_multipart(location, multipart_id).await | ||
| } | ||
|
|
||
| async fn get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(get is implemented in terms of get_opts nowadays, so just doing get_opts is sufficient)
| ) -> BoxStream<'_, object_store::Result<ObjectMeta>> { | ||
| self.inner.list(prefix) | ||
| let start = Instant::now(); | ||
| let result = self.inner.list(prefix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. the p50 latency here is 3.31 microseconds, surely it's not connecting to Minio this quickly (although who knows)
src/object_store/cache.rs
Outdated
| "Re-downloading cache value for {key:?}: {err}" | ||
| ); | ||
|
|
||
| self.metrics.redownload_errors.increment(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"redownload_errors" doesn't seem appropriate here, more like "cache_file_missing", "file_read_error" or "eviction_race".
| warn!("Invalidating cache entry for {key:?}; failed writing to a file: {err}"); | ||
| cache.invalidate(&key).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also use a counter for these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By these I mean counter for cache file write errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, good idea, I missed it somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
213dbae to
5445458
Compare
20264b4 to
aa8d012
Compare
[skip ci]
get and put now just call get/put_opts (so does get_range, but we actually override it) and the S3 object store only implements _opts, so delete the basic get/put methods to simplify things.
Use Prometheus conventions for metrics and merge some of them to make them easier to manage, also add basic latencies.
aa8d012 to
28dd821
Compare
Add a bunch of various timing / sizing metrics to the object store cache:
Limitations:
This is based on top of #518 which has some DataFusion MemoryManager metrics
Sample metrics scrape from a TPC-H/DS run: