Skip to content

Conversation

@Lyndon-Li
Copy link
Contributor

Deliver a message for each maintenance task which could be displayed by maintenance info command

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
@Lyndon-Li Lyndon-Li force-pushed the increase-maintenance-observability-2 branch from aef645b to e6876a6 Compare September 28, 2025 07:30
@Lyndon-Li Lyndon-Li requested a review from jkowalski October 11, 2025 11:12
@Lyndon-Li Lyndon-Li force-pushed the increase-maintenance-observability-2 branch 5 times, most recently from 5eec623 to 968d893 Compare October 13, 2025 05:25
@Lyndon-Li Lyndon-Li force-pushed the increase-maintenance-observability-2 branch from 968d893 to e38ff52 Compare October 13, 2025 05:52
@Lyndon-Li Lyndon-Li force-pushed the increase-maintenance-observability-2 branch from 9d2e527 to 6c0fc67 Compare October 14, 2025 06:12
@Lyndon-Li Lyndon-Li requested a review from jkowalski October 14, 2025 06:19
@Lyndon-Li Lyndon-Li force-pushed the increase-maintenance-observability-2 branch 4 times, most recently from 3e54516 to a058028 Compare October 15, 2025 05:38
Copy link
Collaborator

@julio-lopez julio-lopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has become too large to review.
@Lyndon-Li please split it into multiple, smaller PRs. A good strategy would be a PR per affected maintenance phases / stat struct.

Please include the reason for making this change in the PR description with an answer to the question: "why is this change needed?"

The naming for a the stat structs and their fields needs to be refined. See inline comments with a subset of the requested changes, we will iterate in the individual PRs to ensure that the names make sense and correspond to the stat being reported. Note, that while we are making an effort upfront to choose good names for these structs, it does not imply that any form of backwards or forward compatibility will be maintained for these fields (whether implicitly or explicitly).

The function comments need to be simplified as well.

Please address the requested changes, and run run the corresponding workflows either locally or in your repository to reduce the number of CI job failures in the changes submitted to the pull requests.

Overall, this is going in the right direction, and the changes should prove useful in the context of maintenance runs and in diagnosing the reported repository corruption cases.

Thank you.

Comment on lines 100 to 101
UnusedCount: uint32(max(len(allLogBlobs)-deletePosition-1, 0)),
UnusedSize: unusedSize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the other two fields are the stats for the preserved or retained logs, then these should be the stats for the "deleted" (or discarded / reclaimed) logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't use the word "deleted" is that there is a dryRun option, when it is true, nothing will be deleted.
So to better cover the two cases, I just picked "unused".

Let me know your thoughts on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for pointing that out.
For now, let's go with "deleted". Not sure there is a word that would work for both 'to-be deleted' and 'was actually deleted'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julio-lopez
Could you share your concerns on "unusedxxx"? I think it right works for both 'to-be deleted' and 'was actually deleted', doesn't it?

Copy link
Collaborator

@julio-lopez julio-lopez Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does "unusedxxx" reflect what the values are about?

Please split the PR into a PR per struct and its corresponding use case (that is, the maintenance phase where it is used). Then we can see what each of the individual fields mean.

At the moment, this PR is too large to review as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me consider this and continue this discussion in the split PR.

Comment on lines 98 to 101
PreservedCount: uint32(deletePosition),
PreservedSize: preservedSize,
UnusedCount: uint32(max(len(allLogBlobs)-deletePosition-1, 0)),
UnusedSize: unusedSize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the other two fields are the stats for the preserved or retained logs, then these should be the stats for the "deleted" (or discarded / reclaimed) logs.

Suggested change
PreservedCount: uint32(deletePosition),
PreservedSize: preservedSize,
UnusedCount: uint32(max(len(allLogBlobs)-deletePosition-1, 0)),
UnusedSize: unusedSize,
RetainedCount: uint32(deletePosition),
RetainedSize: preservedSize,
DeletedCount: uint32(max(len(allLogBlobs)-deletePosition-1, 0)),
DeletedSize: unusedSize,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason with #4848 (comment) for deleted vs. unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for RetainedCount and RetainedSize

"Keeping logs",
logparam.Int("count", deletePosition),
logparam.Int64("bytes", totalSize))
var unusedSize int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var unusedSize int64
var deletedSize int64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason with #4848 (comment) for deleted vs. unused.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements improved maintenance observability by adding structured statistics reporting for maintenance tasks. The changes enable maintenance tasks to return detailed stats that can be displayed by the maintenance info command, providing better visibility into maintenance operations.

Key changes:

  • Introduction of a new maintenancestats package with structured statistics types for various maintenance tasks
  • Modification of maintenance functions to return statistics objects alongside error values
  • Integration of stats collection into the maintenance scheduling and reporting infrastructure

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
repo/maintenancestats/types.go New package defining structured statistics types for all maintenance tasks
snapshot/snapshotgc/gc.go Updates snapshot GC to return structured statistics instead of logging individual metrics
repo/maintenance/maintenance_schedule.go Adds stats collection to the maintenance run reporting infrastructure
repo/maintenance/maintenance_run.go Updates all maintenance task runners to handle and return statistics
cli/command_maintenance_info.go Enhances maintenance info display to show task statistics summaries
Multiple test and implementation files Updates function signatures and call sites to handle new statistics return values

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Lyndon-Li
Copy link
Contributor Author

This PR has become too large to review. @Lyndon-Li please split it into multiple, smaller PRs. A good strategy would be a PR per affected maintenance phases / stat struct.

Please include the reason for making this change in the PR description with an answer to the question: "why is this change needed?"

The naming for a the stat structs and their fields needs to be refined. See inline comments with a subset of the requested changes, we will iterate in the individual PRs to ensure that the names make sense and correspond to the stat being reported. Note, that while we are making an effort upfront to choose good names for these structs, it does not imply that any form of backwards or forward compatibility will be maintained for these fields (whether implicitly or explicitly).

The function comments need to be simplified as well.

Please address the requested changes, and run run the corresponding workflows either locally or in your repository to reduce the number of CI job failures in the changes submitted to the pull requests.

Overall, this is going in the right direction, and the changes should prove useful in the context of maintenance runs and in diagnosing the reported repository corruption cases.

Thank you.

Sure, will need some time to complete these and will let you know.

@Lyndon-Li Lyndon-Li force-pushed the increase-maintenance-observability-2 branch 2 times, most recently from b3d00ca to 63c182a Compare October 17, 2025 07:36
@Lyndon-Li Lyndon-Li force-pushed the increase-maintenance-observability-2 branch from 63c182a to f0e7206 Compare October 17, 2025 07:44
@codecov
Copy link

codecov bot commented Oct 18, 2025

Codecov Report

❌ Patch coverage is 72.56461% with 138 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.75%. Comparing base (cb455c6) to head (f0e7206).
⚠️ Report is 687 commits behind head on master.

Files with missing lines Patch % Lines
repo/maintenancestats/types.go 61.88% 64 Missing and 13 partials ⚠️
internal/epoch/epoch_manager.go 84.21% 11 Missing and 1 partial ⚠️
repo/maintenance/content_index_to_pack_check.go 10.00% 9 Missing ⚠️
repo/maintenance/blob_gc.go 76.00% 6 Missing ⚠️
repo/maintenance/content_rewrite.go 81.48% 4 Missing and 1 partial ⚠️
snapshot/snapshotgc/gc.go 86.84% 5 Missing ⚠️
cli/command_maintenance_info.go 55.55% 3 Missing and 1 partial ⚠️
repo/content/content_manager_indexes.go 42.85% 3 Missing and 1 partial ⚠️
repo/content/indexblob/index_blob_manager_v0.go 50.00% 4 Missing ⚠️
repo/maintenance/blob_retain.go 75.00% 4 Missing ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4848      +/-   ##
==========================================
+ Coverage   75.86%   76.75%   +0.88%     
==========================================
  Files         470      536      +66     
  Lines       37301    41249    +3948     
==========================================
+ Hits        28299    31659    +3360     
- Misses       7071     7527     +456     
- Partials     1931     2063     +132     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Lyndon-Li
Copy link
Contributor Author

One split PR #4890 is created.
I will keep the current PR open until all the opening comments are resolved.

Copy link
Collaborator

@julio-lopez julio-lopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See additional comment about renaming the field to Extra in the maintenance schedule and making it a slice instead of a pointer.

And then renaming Raw to Data in the new 'Extra' struct.

Please break it down into a PR per data struct so we can look at the use cases for each struct and ensure the names are appropriate. Thanks.

@julio-lopez
Copy link
Collaborator

And please add unit tests that (un)marshal the new structs, to ensure they are what is expected.

@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Oct 21, 2025

See additional comment about renaming the field to Extra in the maintenance schedule and making it a slice instead of a pointer.

@julio-lopez Could you explain why we want to use a slice? I think there is always one piece of Extra for each run, so isn't a pointer enough?

@julio-lopez
Copy link
Collaborator

See additional comment about renaming the field to Extra in the maintenance schedule and making it a slice instead of a pointer.

@julio-lopez Could you explain why we want to use a slice? I think there is always one piece of Extra for each run, so isn't a pointer enough?

Yes, for now there will be at most one entry per task run. Please make it a slice nevertheless. Thanks

@Lyndon-Li
Copy link
Contributor Author

I am closing this PR as all the changes has been included in the splited PRs.

@Lyndon-Li Lyndon-Li closed this Nov 11, 2025
@julio-lopez
Copy link
Collaborator

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants