Add workaround for missing shard gen blob by DaveCTurner · Pull Request #112337 · elastic/elasticsearch

DaveCTurner · 2024-08-29T09:23:34Z

It is currently very painful to recover from a bug which incorrectly
removes a shard-level index-UUID blob from the repository. This commit
introduces a fallback mechanism that attempts to reconstruct the missing
data from other blobs in the repository. It's still a bug to need this
mechanism for sure, but in many cases this mechanism will allow the
repository to keep working without any need for manual surgery on its
contents.

It is currently very painful to recover from a bug which incorrectly removes a shard-level `index-UUID` blob from the repository. This commit introduces a fallback mechanism that attempts to reconstruct the missing data from other blobs in the repository. It's still a bug to need this mechanism for sure, but in many cases this mechanism will allow the repository to keep working without any need for manual surgery on its contents.

elasticsearchmachine · 2024-08-29T09:23:58Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2024-08-29T09:23:59Z

Hi @DaveCTurner, I've created a changelog YAML for you.

henningandersen

Looks good. I wonder if we can add more testing though - also to demonstrate that it does not lead to further corruption.

henningandersen · 2024-08-29T10:06:55Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                    for (final var shardSnapshotBlobName : shardSnapshotBlobs.keySet()) {
+                        if (shardSnapshotBlobName.startsWith("snap-")
+                            && shardSnapshotBlobName.endsWith(".dat")
+                            && shardSnapshotBlobName.length() == "snap-".length() + 22 + ".dat".length()) {


Can we make 22 a named constant?

sure, see #112353

Done in f661720.

henningandersen · 2024-08-29T12:18:52Z

...internalClusterTest/java/org/elasticsearch/repositories/blobstore/BlobStoreCorruptionIT.java

+                            return new ElasticsearchException("create-snapshot failed as expected");
+                        }));
+                });
+            } else {


I wonder if validating following is relevant:

That we can restore without the missing file?

That taking a new snapshot that includes changes to the shard repairs the situation (I think it would?) for new snapshots, i.e., we no longer need the fallback reading?

Yeah actually we already have a test for the behaviour when this file is corrupt, see testSnapshotWithCorruptedShardIndexFile. So in c13297f I reverted back to having this test delete the file again, and made it try more interesting combinations of snapshot creations and restores to make sure it still works as we expect.

IIUC, both cloning and deleting snapshot should also write a new shard generation file and fix the issue. Should we randomly test for them as well?

~~Curious now why this works for restore - but I suppose we somehow avoid reading that file?~~ (ahh, it is in the code comment, thanks for answering my question before I posted it)

both cloning and deleting snapshot should also write a new shard generation file and fix the issue

++ see b6434df

ywangd · 2024-08-30T00:55:02Z

This change is much simpler than I expected. Looking great! I have two high level questions:

Does it make sense to have a dedicate API for this so that the procedure is more explicit? Since it is rare, the overhead for calling an API does not seem too bad?
Is it viable to remove the broken shard from affected snapshots? It may require updating many metadata files, including the root index-N file. So likely much more work than the current approach. But I wonder whether removal is either safer or more flexible, e.g. what if there are also missing snap-xx files?

DaveCTurner · 2024-09-08T09:39:33Z

Does it make sense to have a dedicate API for this so that the procedure is more explicit? Since it is rare, the overhead for calling an API does not seem too bad?

I considered this and decided I prefer the implicit behaviour proposed here. Although rare, this kind of bug leads to a potentially pageable situation, but there is no sense in paging someone just to execute an API.

Is it viable to remove the broken shard from affected snapshots? [...] But I wonder whether removal is either safer or more flexible, e.g. what if there are also missing snap-xx files?

In practice this kind of bug affects no existing snapshots anyway, it only blocks creation of new snapshots of the affected shard. Existing snapshots should all be ok. But if there were a missing snap-xx file too then that shard snapshot would fail to restore (at restore time), which I think is no different from what would happen if we rewrote all the metadata to remove the shard snapshot anyway. A missing snap-xx file only blocks restores, it doesn't prevent creation of new snapshots, so the impact is much smaller.

ywangd

LGTM

ywangd · 2024-09-09T01:49:52Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                        if (shardSnapshotBlobName.startsWith("snap-")
+                            && shardSnapshotBlobName.endsWith(".dat")
+                            && shardSnapshotBlobName.length() == "snap-".length() + UUIDs.RANDOM_BASED_UUID_STRING_LENGTH + ".dat"
+                                .length()) {
+                            final var shardSnapshot = INDEX_SHARD_SNAPSHOT_FORMAT.read(
+                                metadata.name(),
+                                shardContainer,
+                                shardSnapshotBlobName.substring("snap-".length(), "snap-".length() + UUIDs.RANDOM_BASED_UUID_STRING_LENGTH),
+                                namedXContentRegistry


Nit: we can replace "snap-" with BlobStoreRepository.SNAPSHOT_PREFIX and maybe extract "snap-".length() as a variable.

We're really bad at this today. I opened #112653 to clean this area up more thoroughly.

ywangd · 2024-09-09T02:08:41Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                    final var message = Strings.format(
+                        "shard generation [%s] in [%s][%s] not found - falling back to reading all shard snapshots",
+                        generation,
+                        metadata.name(),
+                        shardContainer.path()
+                    );


I think it would be helpful if we can log the index name in the cluster so that it is easier to understand which shard is affected. But unfortunately we don't have the context here. Passing it from all the callsites seems to lead cascading changes that are not really worthwhile just for a logging in edge cases.

++ makes sense and not too hard I think, see aef8e7c.

ywangd · 2024-09-09T02:22:23Z

...src/internalClusterTest/java/org/elasticsearch/snapshots/CorruptedBlobStoreRepositoryIT.java

+                "--> restoring the snapshot, the repository should not have lost any shard data despite deleting index-N, "
+                    + "because it uses snap-*.data files and not the index-N to determine what files to restore"


Nit: can we not use index-N for shard generation files? I find it better to reseve it for the root blob. So maybe index-UUID? Old snapshots do use numeric but that is not the case in this test.

++ eb233ff

ywangd · 2024-09-09T02:26:44Z

...src/internalClusterTest/java/org/elasticsearch/snapshots/CorruptedBlobStoreRepositoryIT.java

+                createSnapshotResponse.getSnapshotInfo().totalShards(),
+                createSnapshotResponse.getSnapshotInfo().successfulShards()
+            );
+            mockLog.assertAllExpectationsMatched();


Should we also check that a new shard generation file is written?

++ b6434df

ywangd · 2024-09-09T02:31:28Z

...internalClusterTest/java/org/elasticsearch/repositories/blobstore/BlobStoreCorruptionIT.java

+                            return new ElasticsearchException("create-snapshot failed as expected");
+                        }));
+                });
+            } else {


IIUC, both cloning and deleting snapshot should also write a new shard generation file and fix the issue. Should we randomly test for them as well?

henningandersen

LGTM.

...src/internalClusterTest/java/org/elasticsearch/snapshots/CorruptedBlobStoreRepositoryIT.java

Since elastic#112337, missing shard gen files are automatically reconstructed based on the existing shard snapshot files. If the list of shard snapshot files are completed, it means the repository is effectively not corrupted. This PR updates the test to account for this situation. Resolves: elastic#112769

Since #112337, missing shard gen files are automatically reconstructed based on the existing shard snapshot files. If the list of shard snapshot files is complete, it means the repository is effectively not corrupted. This PR updates the test to account for this situation. Resolves: #112769

Since elastic#112337, missing shard gen files are automatically reconstructed based on the existing shard snapshot files. If the list of shard snapshot files is complete, it means the repository is effectively not corrupted. This PR updates the test to account for this situation. Resolves: elastic#112769 (cherry picked from commit e1f7814) # Conflicts: # muted-tests.yml

Since #112337, missing shard gen files are automatically reconstructed based on the existing shard snapshot files. If the list of shard snapshot files is complete, it means the repository is effectively not corrupted. This PR updates the test to account for this situation. Resolves: #112769 (cherry picked from commit e1f7814) # Conflicts: # muted-tests.yml

Since #112337, missing shard gen files are automatically reconstructed based on the existing shard snapshot files. If the list of shard snapshot files is complete, it means the repository is effectively not corrupted. This PR updates the test to account for this situation. Resolves: #112769

… file (#112979) It is expected that the old master may attempt to read a shardGen file that is deleted by the new master. This PR checks the latest repo data before applying the workaround (or throwing AssertionError) for missing shardGen files. Relates: #112337 Resolves: #112811

… file (elastic#112979) It is expected that the old master may attempt to read a shardGen file that is deleted by the new master. This PR checks the latest repo data before applying the workaround (or throwing AssertionError) for missing shardGen files. Relates: elastic#112337 Resolves: elastic#112811 (cherry picked from commit 99b5ed8) # Conflicts: # muted-tests.yml

… file (#112979) (#113155) It is expected that the old master may attempt to read a shardGen file that is deleted by the new master. This PR checks the latest repo data before applying the workaround (or throwing AssertionError) for missing shardGen files. Relates: #112337 Resolves: #112811 (cherry picked from commit 99b5ed8) # Conflicts: # muted-tests.yml

DaveCTurner added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs team-discuss Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. v8.16.0 labels Aug 29, 2024

DaveCTurner requested a review from henningandersen August 29, 2024 09:23

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Aug 29, 2024

Update docs/changelog/112337.yaml

4f34eef

DaveCTurner added 2 commits August 29, 2024 11:27

Supress integrity checks

bb31180

Fix CorruptedBlobStoreRepositoryIT

d9159af

henningandersen reviewed Aug 29, 2024

View reviewed changes

DaveCTurner added 4 commits August 29, 2024 15:49

Merge branch 'main' into 2024/08/29/workaround-missing-shard-gen

e945623

Use UUID length constant

f661720

Revert test back to deleting the file

c13297f

Assert logging

0f87cbe

DaveCTurner requested a review from henningandersen August 29, 2024 15:14

DaveCTurner added 2 commits September 8, 2024 10:40

Merge branch 'main' into 2024/08/29/workaround-missing-shard-gen

a5cdc07

Block comment

afd5b33

DaveCTurner requested a review from ywangd September 8, 2024 09:57

ywangd approved these changes Sep 9, 2024

View reviewed changes

henningandersen approved these changes Sep 9, 2024

View reviewed changes

...src/internalClusterTest/java/org/elasticsearch/snapshots/CorruptedBlobStoreRepositoryIT.java Show resolved Hide resolved

DaveCTurner added 4 commits September 9, 2024 12:32

prefix/suffix symbols

08edd32

Merge branch 'main' into 2024/08/29/workaround-missing-shard-gen

cbc989a

Fix test log message

eb233ff

Repair with clone/delete too

b6434df

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2024

DaveCTurner added 2 commits September 9, 2024 13:04

Merge branch 'main' into 2024/08/29/workaround-missing-shard-gen

c62fb95

Include index name

aef8e7c

DaveCTurner removed the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2024

Fix the other test

fcfb4f1

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2024

DaveCTurner added 3 commits September 9, 2024 13:32

Merge branch 'main' into 2024/08/29/workaround-missing-shard-gen

0ec9e22

Merge branch 'main' into 2024/08/29/workaround-missing-shard-gen

4c56eb9

Merge branch 'main' into 2024/08/29/workaround-missing-shard-gen

c3b7efd

elasticsearchmachine merged commit 98ab7f8 into elastic:main Sep 9, 2024

DaveCTurner deleted the 2024/08/29/workaround-missing-shard-gen branch September 9, 2024 14:55

ywangd mentioned this pull request Sep 12, 2024

[Test] Account for auto-repairing for shard gen file #112778

Merged

This was referenced Sep 12, 2024

[CI] ConcurrentSnapshotsIT testMasterFailoverOnFinalizationLoop failing #112811

Closed

Check latest repoData before applying workaround for missing shardGen file #112979

Merged

breskeby mentioned this pull request Jan 28, 2026

[8.19] Fix bwc tests against last 7.17 release #141477

Merged

		"--> restoring the snapshot, the repository should not have lost any shard data despite deleting index-N, "
		+ "because it uses snap-*.data files and not the index-N to determine what files to restore"

Conversation

DaveCTurner commented Aug 29, 2024

Uh oh!

elasticsearchmachine commented Aug 29, 2024

Uh oh!

elasticsearchmachine commented Aug 29, 2024

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Aug 30, 2024

Uh oh!

DaveCTurner commented Sep 8, 2024

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

henningandersen Sep 9, 2024 •

edited

Loading