Cleanup Concurrent RepositoryData Loading#48329
Cleanup Concurrent RepositoryData Loading#48329original-brownbear merged 5 commits intoelastic:masterfrom original-brownbear:48122
Conversation
The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
|
I didn't yet get around to a full test run here, so if this doesn't yet go green please hold off reviewing till I can fix the tests I may have missed :) |
| public RepositoryData getRepositoryData() { | ||
| try { | ||
| return getRepositoryData(latestIndexBlobId()); | ||
| } catch (NoSuchFileException ex) { |
There was a problem hiding this comment.
There is no good reason to catch here or below. We break out on generation -1 and return empty data so whenever we fail to find data for gen. >= 0 it's a problem and we mustn't ignore it.
|
Jup, this is not entirely unsurprisingly failing some REST tests: -> WIP EDIT: #46250 fixed the tests here sort of by accident by changing the order of shard and root level meta updates :) -> should be good to review now |
| repositoryData = RepositoryData.snapshotsFromXContent(parser, indexGen); | ||
| } | ||
| return repositoryData; | ||
| } catch (NoSuchFileException ex) { |
There was a problem hiding this comment.
nit: while we are at it, there are unneeded Long.toString() and local variable repositoryData usages
There was a problem hiding this comment.
I can't find it right now, but others disagreed with removing the Long.toString in another PR :( But yea let's clean up the redundant locals :)
|
Thanks Tanguy! |
The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122
The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122
The loading of
RepositoryDatais not an atomic operation.It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.
I would suggest back porting this all the way since it can theoretically have
repository corrupting effects on S3.
Fixes #48122