-
Notifications
You must be signed in to change notification settings - Fork 594
feat(general): Increase maintenance observability #4848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(general): Increase maintenance observability #4848
Conversation
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
aef645b to
e6876a6
Compare
5eec623 to
968d893
Compare
968d893 to
e38ff52
Compare
9d2e527 to
6c0fc67
Compare
3e54516 to
a058028
Compare
julio-lopez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has become too large to review.
@Lyndon-Li please split it into multiple, smaller PRs. A good strategy would be a PR per affected maintenance phases / stat struct.
Please include the reason for making this change in the PR description with an answer to the question: "why is this change needed?"
The naming for a the stat structs and their fields needs to be refined. See inline comments with a subset of the requested changes, we will iterate in the individual PRs to ensure that the names make sense and correspond to the stat being reported. Note, that while we are making an effort upfront to choose good names for these structs, it does not imply that any form of backwards or forward compatibility will be maintained for these fields (whether implicitly or explicitly).
The function comments need to be simplified as well.
Please address the requested changes, and run run the corresponding workflows either locally or in your repository to reduce the number of CI job failures in the changes submitted to the pull requests.
Overall, this is going in the right direction, and the changes should prove useful in the context of maintenance runs and in diagnosing the reported repository corruption cases.
Thank you.
repo/maintenance/cleanup_logs.go
Outdated
| UnusedCount: uint32(max(len(allLogBlobs)-deletePosition-1, 0)), | ||
| UnusedSize: unusedSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the other two fields are the stats for the preserved or retained logs, then these should be the stats for the "deleted" (or discarded / reclaimed) logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn't use the word "deleted" is that there is a dryRun option, when it is true, nothing will be deleted.
So to better cover the two cases, I just picked "unused".
Let me know your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for pointing that out.
For now, let's go with "deleted". Not sure there is a word that would work for both 'to-be deleted' and 'was actually deleted'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julio-lopez
Could you share your concerns on "unusedxxx"? I think it right works for both 'to-be deleted' and 'was actually deleted', doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does "unusedxxx" reflect what the values are about?
Please split the PR into a PR per struct and its corresponding use case (that is, the maintenance phase where it is used). Then we can see what each of the individual fields mean.
At the moment, this PR is too large to review as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me consider this and continue this discussion in the split PR.
repo/maintenance/cleanup_logs.go
Outdated
| PreservedCount: uint32(deletePosition), | ||
| PreservedSize: preservedSize, | ||
| UnusedCount: uint32(max(len(allLogBlobs)-deletePosition-1, 0)), | ||
| UnusedSize: unusedSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the other two fields are the stats for the preserved or retained logs, then these should be the stats for the "deleted" (or discarded / reclaimed) logs.
| PreservedCount: uint32(deletePosition), | |
| PreservedSize: preservedSize, | |
| UnusedCount: uint32(max(len(allLogBlobs)-deletePosition-1, 0)), | |
| UnusedSize: unusedSize, | |
| RetainedCount: uint32(deletePosition), | |
| RetainedSize: preservedSize, | |
| DeletedCount: uint32(max(len(allLogBlobs)-deletePosition-1, 0)), | |
| DeletedSize: unusedSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason with #4848 (comment) for deleted vs. unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for RetainedCount and RetainedSize
| "Keeping logs", | ||
| logparam.Int("count", deletePosition), | ||
| logparam.Int64("bytes", totalSize)) | ||
| var unusedSize int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| var unusedSize int64 | |
| var deletedSize int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason with #4848 (comment) for deleted vs. unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements improved maintenance observability by adding structured statistics reporting for maintenance tasks. The changes enable maintenance tasks to return detailed stats that can be displayed by the maintenance info command, providing better visibility into maintenance operations.
Key changes:
- Introduction of a new
maintenancestatspackage with structured statistics types for various maintenance tasks - Modification of maintenance functions to return statistics objects alongside error values
- Integration of stats collection into the maintenance scheduling and reporting infrastructure
Reviewed Changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| repo/maintenancestats/types.go | New package defining structured statistics types for all maintenance tasks |
| snapshot/snapshotgc/gc.go | Updates snapshot GC to return structured statistics instead of logging individual metrics |
| repo/maintenance/maintenance_schedule.go | Adds stats collection to the maintenance run reporting infrastructure |
| repo/maintenance/maintenance_run.go | Updates all maintenance task runners to handle and return statistics |
| cli/command_maintenance_info.go | Enhances maintenance info display to show task statistics summaries |
| Multiple test and implementation files | Updates function signatures and call sites to handle new statistics return values |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Sure, will need some time to complete these and will let you know. |
b3d00ca to
63c182a
Compare
63c182a to
f0e7206
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4848 +/- ##
==========================================
+ Coverage 75.86% 76.75% +0.88%
==========================================
Files 470 536 +66
Lines 37301 41249 +3948
==========================================
+ Hits 28299 31659 +3360
- Misses 7071 7527 +456
- Partials 1931 2063 +132 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
One split PR #4890 is created. |
julio-lopez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See additional comment about renaming the field to Extra in the maintenance schedule and making it a slice instead of a pointer.
And then renaming Raw to Data in the new 'Extra' struct.
Please break it down into a PR per data struct so we can look at the use cases for each struct and ensure the names are appropriate. Thanks.
|
And please add unit tests that (un)marshal the new structs, to ensure they are what is expected. |
@julio-lopez Could you explain why we want to use a slice? I think there is always one piece of |
Yes, for now there will be at most one entry per task run. Please make it a slice nevertheless. Thanks |
|
I am closing this PR as all the changes has been included in the splited PRs. |
|
Thanks for your contribution! |
Deliver a message for each maintenance task which could be displayed by maintenance info command