Skip to content

Check latest repoData before applying workaround for missing shardGen file#112979

Merged
elasticsearchmachine merged 3 commits intoelastic:mainfrom
ywangd:es-112811-fix
Sep 18, 2024
Merged

Check latest repoData before applying workaround for missing shardGen file#112979
elasticsearchmachine merged 3 commits intoelastic:mainfrom
ywangd:es-112811-fix

Conversation

@ywangd
Copy link
Copy Markdown
Member

@ywangd ywangd commented Sep 17, 2024

It is expected that the old master may attempt to read a shardGen file that is deleted by the new master. This PR checks the latest repo data before applying the workaround (or throwing AssertionError) for missing shardGen files.

Relates: #112337
Resolves: #112811

It is expected that the old master may attempt to read a shardGen file
that is deleted by the new master. This PR suppresses the assertion.

Resolves: elastic#112811
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.16.0 labels Sep 17, 2024
@ywangd ywangd requested a review from DaveCTurner September 17, 2024 02:56
@elasticsearchmachine elasticsearchmachine added Team:Distributed Meta label for distributed team. v9.0.0 labels Sep 17, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm a master failover is an "expected" situation, we can't have a by-default tripping assertion here.

Could we enhance the check to verify that the root blob is unchanged before running the workaround?

@ywangd ywangd changed the title [Test] Suppress missing shardGen assertion in master failover test Check latest repoData before applying workaround for missing shardGen file Sep 18, 2024
@ywangd ywangd added >non-issue and removed >test Issues or PRs that are addressing/adding tests labels Sep 18, 2024
Comment on lines +3896 to +3902
final long latestGeneration = latestIndexBlobId();
final RepositoryData currentRepositoryData = getRepositoryData(latestGeneration);
if (currentRepositoryData.shardGenerations().hasShardGen(new RepositoryShardId(indexId, shardId)) == false) {
// Master has concurrently mutated the shard generation. This can happen when master fails over
// which is "expected". We do not need the following workaround in this case.
throw noSuchFileException;
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to simply compare the last known repo generation against the latest generation by listing root blobs. But it seems last known repo generation is not easily available especially when this runs in the context of shard snapshotting. Since this is a rare case, it feels ok to just read the repo data directly and check whether the shard gen still exists. It's slower but we are on a slow path already more or less. Hopefully that works for you.

That said, BlobStoreRepositoy class has many method related to fetching and caching repository data. Maybe one (some) of them can be reliably leveraged. Please let me know. Thank you!

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Sep 18, 2024

@DaveCTurner Thanks for the feedback! I have updated the PR along the line of your suggestion. It's good for another look.

@ywangd ywangd requested a review from DaveCTurner September 18, 2024 07:06
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, thanks Yang. Couple of small comments.

Comment on lines +3898 to +3899
if (currentRepositoryData.shardGenerations().hasShardGen(new RepositoryShardId(indexId, shardId)) == false) {
// Master has concurrently mutated the shard generation. This can happen when master fails over
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're only checking whether the shard has been totally removed here? If the shard still exists then hasShardGen() should return true but getShardGen() will return a different generation from the one we were trying to load.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I used the wrong method. Got confused by its name. It's now obvious that it does not take a shardGen parameter 🤦
Pushed fe1cf86
Thanks for spotting it.

Comment on lines +3896 to +3897
final long latestGeneration = latestIndexBlobId();
final RepositoryData currentRepositoryData = getRepositoryData(latestGeneration);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines can also fail with an exception - if they do, I'd rather we suppressed the exception they throw and went back to throwing the original noSuchFileException.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point as well. See fe1cf86

@ywangd ywangd requested a review from DaveCTurner September 18, 2024 08:34
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd ywangd added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge labels Sep 18, 2024
@elasticsearchmachine elasticsearchmachine merged commit 99b5ed8 into elastic:main Sep 18, 2024
@ywangd ywangd deleted the es-112811-fix branch September 18, 2024 09:32
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 112979

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Sep 19, 2024

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 19, 2024
… file (elastic#112979)

It is expected that the old master may attempt to read a shardGen file
that is deleted by the new master. This PR checks the latest repo data
before applying the workaround (or throwing AssertionError) for missing
shardGen files.

Relates: elastic#112337 Resolves: elastic#112811
(cherry picked from commit 99b5ed8)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Sep 19, 2024
… file (#112979) (#113155)

It is expected that the old master may attempt to read a shardGen file
that is deleted by the new master. This PR checks the latest repo data
before applying the workaround (or throwing AssertionError) for missing
shardGen files.

Relates: #112337 Resolves: #112811
(cherry picked from commit 99b5ed8)

# Conflicts:
#	muted-tests.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport pending :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team. v8.16.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] ConcurrentSnapshotsIT testMasterFailoverOnFinalizationLoop failing

3 participants