Skip to content

Add workaround for missing shard gen blob#112337

Merged
elasticsearchmachine merged 20 commits intoelastic:mainfrom
DaveCTurner:2024/08/29/workaround-missing-shard-gen
Sep 9, 2024
Merged

Add workaround for missing shard gen blob#112337
elasticsearchmachine merged 20 commits intoelastic:mainfrom
DaveCTurner:2024/08/29/workaround-missing-shard-gen

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

It is currently very painful to recover from a bug which incorrectly
removes a shard-level index-UUID blob from the repository. This commit
introduces a fallback mechanism that attempts to reconstruct the missing
data from other blobs in the repository. It's still a bug to need this
mechanism for sure, but in many cases this mechanism will allow the
repository to keep working without any need for manual surgery on its
contents.

It is currently very painful to recover from a bug which incorrectly
removes a shard-level `index-UUID` blob from the repository. This commit
introduces a fallback mechanism that attempts to reconstruct the missing
data from other blobs in the repository. It's still a bug to need this
mechanism for sure, but in many cases this mechanism will allow the
repository to keep working without any need for manual surgery on its
contents.
@DaveCTurner DaveCTurner added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs team-discuss Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. v8.16.0 labels Aug 29, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Aug 29, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I wonder if we can add more testing though - also to demonstrate that it does not lead to further corruption.

for (final var shardSnapshotBlobName : shardSnapshotBlobs.keySet()) {
if (shardSnapshotBlobName.startsWith("snap-")
&& shardSnapshotBlobName.endsWith(".dat")
&& shardSnapshotBlobName.length() == "snap-".length() + 22 + ".dat".length()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make 22 a named constant?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, see #112353

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f661720.

return new ElasticsearchException("create-snapshot failed as expected");
}));
});
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if validating following is relevant:

  1. That we can restore without the missing file?
  2. That taking a new snapshot that includes changes to the shard repairs the situation (I think it would?) for new snapshots, i.e., we no longer need the fallback reading?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah actually we already have a test for the behaviour when this file is corrupt, see testSnapshotWithCorruptedShardIndexFile. So in c13297f I reverted back to having this test delete the file again, and made it try more interesting combinations of snapshot creations and restores to make sure it still works as we expect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, both cloning and deleting snapshot should also write a new shard generation file and fix the issue. Should we randomly test for them as well?

Copy link
Copy Markdown
Contributor

@henningandersen henningandersen Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious now why this works for restore - but I suppose we somehow avoid reading that file? (ahh, it is in the code comment, thanks for answering my question before I posted it)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both cloning and deleting snapshot should also write a new shard generation file and fix the issue

++ see b6434df

@ywangd
Copy link
Copy Markdown
Member

ywangd commented Aug 30, 2024

This change is much simpler than I expected. Looking great! I have two high level questions:

  1. Does it make sense to have a dedicate API for this so that the procedure is more explicit? Since it is rare, the overhead for calling an API does not seem too bad?
  2. Is it viable to remove the broken shard from affected snapshots? It may require updating many metadata files, including the root index-N file. So likely much more work than the current approach. But I wonder whether removal is either safer or more flexible, e.g. what if there are also missing snap-xx files?

@DaveCTurner
Copy link
Copy Markdown
Member Author

Does it make sense to have a dedicate API for this so that the procedure is more explicit? Since it is rare, the overhead for calling an API does not seem too bad?

I considered this and decided I prefer the implicit behaviour proposed here. Although rare, this kind of bug leads to a potentially pageable situation, but there is no sense in paging someone just to execute an API.

Is it viable to remove the broken shard from affected snapshots? [...] But I wonder whether removal is either safer or more flexible, e.g. what if there are also missing snap-xx files?

In practice this kind of bug affects no existing snapshots anyway, it only blocks creation of new snapshots of the affected shard. Existing snapshots should all be ok. But if there were a missing snap-xx file too then that shard snapshot would fail to restore (at restore time), which I think is no different from what would happen if we rewrote all the metadata to remove the shard snapshot anyway. A missing snap-xx file only blocks restores, it doesn't prevent creation of new snapshots, so the impact is much smaller.

@DaveCTurner DaveCTurner requested a review from ywangd September 8, 2024 09:57
Copy link
Copy Markdown
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +3851 to +3859
if (shardSnapshotBlobName.startsWith("snap-")
&& shardSnapshotBlobName.endsWith(".dat")
&& shardSnapshotBlobName.length() == "snap-".length() + UUIDs.RANDOM_BASED_UUID_STRING_LENGTH + ".dat"
.length()) {
final var shardSnapshot = INDEX_SHARD_SNAPSHOT_FORMAT.read(
metadata.name(),
shardContainer,
shardSnapshotBlobName.substring("snap-".length(), "snap-".length() + UUIDs.RANDOM_BASED_UUID_STRING_LENGTH),
namedXContentRegistry
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can replace "snap-" with BlobStoreRepository.SNAPSHOT_PREFIX and maybe extract "snap-".length() as a variable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're really bad at this today. I opened #112653 to clean this area up more thoroughly.

Comment on lines +3834 to +3839
final var message = Strings.format(
"shard generation [%s] in [%s][%s] not found - falling back to reading all shard snapshots",
generation,
metadata.name(),
shardContainer.path()
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful if we can log the index name in the cluster so that it is easier to understand which shard is affected. But unfortunately we don't have the context here. Passing it from all the callsites seems to lead cascading changes that are not really worthwhile just for a logging in edge cases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ makes sense and not too hard I think, see aef8e7c.

Comment on lines +768 to +769
"--> restoring the snapshot, the repository should not have lost any shard data despite deleting index-N, "
+ "because it uses snap-*.data files and not the index-N to determine what files to restore"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we not use index-N for shard generation files? I find it better to reseve it for the root blob. So maybe index-UUID? Old snapshots do use numeric but that is not the case in this test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ eb233ff

createSnapshotResponse.getSnapshotInfo().totalShards(),
createSnapshotResponse.getSnapshotInfo().successfulShards()
);
mockLog.assertAllExpectationsMatched();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that a new shard generation file is written?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ b6434df

return new ElasticsearchException("create-snapshot failed as expected");
}));
});
} else {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, both cloning and deleting snapshot should also write a new shard generation file and fix the issue. Should we randomly test for them as well?

Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2024
@DaveCTurner DaveCTurner removed the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2024
@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2024
@elasticsearchmachine elasticsearchmachine merged commit 98ab7f8 into elastic:main Sep 9, 2024
@DaveCTurner DaveCTurner deleted the 2024/08/29/workaround-missing-shard-gen branch September 9, 2024 14:55
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 12, 2024
Since elastic#112337, missing shard gen files are automatically reconstructed
based on the existing shard snapshot files. If the list of shard
snapshot files are completed, it means the repository is effectively not
corrupted. This PR updates the test to account for this situation.

Resolves: elastic#112769
elasticsearchmachine pushed a commit that referenced this pull request Sep 12, 2024
Since #112337, missing shard gen files are automatically reconstructed
based on the existing shard snapshot files. If the list of shard
snapshot files is complete, it means the repository is effectively not
corrupted. This PR updates the test to account for this situation.

Resolves: #112769
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 12, 2024
Since elastic#112337, missing shard gen files are automatically reconstructed
based on the existing shard snapshot files. If the list of shard
snapshot files is complete, it means the repository is effectively not
corrupted. This PR updates the test to account for this situation.

Resolves: elastic#112769
(cherry picked from commit e1f7814)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Sep 12, 2024
Since #112337, missing shard gen files are automatically reconstructed
based on the existing shard snapshot files. If the list of shard
snapshot files is complete, it means the repository is effectively not
corrupted. This PR updates the test to account for this situation.

Resolves: #112769
(cherry picked from commit e1f7814)

# Conflicts:
#	muted-tests.yml
davidkyle pushed a commit that referenced this pull request Sep 12, 2024
Since #112337, missing shard gen files are automatically reconstructed
based on the existing shard snapshot files. If the list of shard
snapshot files is complete, it means the repository is effectively not
corrupted. This PR updates the test to account for this situation.

Resolves: #112769
elasticsearchmachine pushed a commit that referenced this pull request Sep 18, 2024
… file (#112979)

It is expected that the old master may attempt to read a shardGen file
that is deleted by the new master. This PR checks the latest repo data
before applying the workaround (or throwing AssertionError) for missing
shardGen files.

Relates: #112337 Resolves: #112811
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Sep 19, 2024
… file (elastic#112979)

It is expected that the old master may attempt to read a shardGen file
that is deleted by the new master. This PR checks the latest repo data
before applying the workaround (or throwing AssertionError) for missing
shardGen files.

Relates: elastic#112337 Resolves: elastic#112811
(cherry picked from commit 99b5ed8)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Sep 19, 2024
… file (#112979) (#113155)

It is expected that the old master may attempt to read a shardGen file
that is deleted by the new master. This PR checks the latest repo data
before applying the workaround (or throwing AssertionError) for missing
shardGen files.

Relates: #112337 Resolves: #112811
(cherry picked from commit 99b5ed8)

# Conflicts:
#	muted-tests.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team. team-discuss v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants