Add repository metadata integrity check API#92373
Add repository metadata integrity check API#92373DaveCTurner wants to merge 13 commits intoelastic:mainfrom
Conversation
9b6c512 to
f861f21
Compare
henningandersen
left a comment
There was a problem hiding this comment.
This looks mostly good to me.
I wonder if we risk a concurrent SLM (or other) deletion of files causing false negatives? And perhaps also a concurrent snapshot creation could cause the same?
I am not entirely sure if we could perhaps disregard such explicit concurrent changes somehow? Happy to see it as a refinement too, but then we need this to be clear in documentation.
server/src/main/java/org/elasticsearch/repositories/blobstore/MetadataVerifier.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| protected void masterOperation(Task task, Request request, ClusterState state, ActionListener<Response> listener) throws Exception { | ||
| // TODO add mechanism to block blob deletions while this is running |
There was a problem hiding this comment.
And yes you are right that this will yield false errors if blobs are deleted out from underneath us. I need advice from @original-brownbear about whether there's a nice way to prevent that. I mean we could make it a stop-the-world thing like repo cleanup but that seems overkill. In principle we should even be able to let snapshotting continue as long as we just held back any finalisation-time deletes. I just worry that this could be a bigger change to the state machine, which IMO isn't worth making any more complex just for this.
There was a problem hiding this comment.
Yeah, I also think modifying the state machine for this seems like the wrong choice.
For deletions, I wonder if we could recheck that any missing blobs are still referenced/expected after finding that they are missing? I am not entirely sure of the order, but I think we delete the reference to the file before deleting the file?
There was a problem hiding this comment.
The number of potential anomalies is huge, and not all of them relate to missing blobs, so it's not clear that we can go back through the list and drop all the spurious ones.
I've made it so that it checks (and reports) the repo generation at the end of the analysis, so that at least you can see that the repo changed underneath us and therefore that some anomalies may be expected. I think this plus some docs about it will be sufficient for now.
There was a problem hiding this comment.
:thu: sounds like a pragmatic solution.
1c96e71 to
a42310e
Compare
a42310e to
f9024eb
Compare
|
Having now used this in anger a bit, I have concluded that reporting problems in logs and the REST response is not the right approach. The volume of information just becomes unmanageable on PiB-scale repositories. I think we should write the results to an index, and it's also super-useful to report some summary information which we could do using the same index. |
|
My WIP implementation has drifted sufficiently far from this branch that I think it best to open a new one at #93735 and close this. |
Relates #52622