-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Partial snapshots when deleted s3 files reappear in bucket #28322
Description
Elasticsearch version: 5.6.5 and 6.1.2
Plugins installed: [discovery-ec2, repository-s3]
JVM version : 1.8
OS version : Ubuntu 16.04
Description of the problem including expected versus actual behavior:
The s3 repository plugin occasionally produces partial snapshots due to throwing an exception while deleting a file from AWS s3 storage. This is a result of s3 only providing eventual consistency for read-after-delete operations. The full documentation on that is here:
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
but the key section says:
A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, Amazon S3 might list the deleted object.
In more concrete terms, I'm using the s3 repository plugin in circumstances where the deleted snapshot files can still show up in the bucket for seconds/minutes after they've been deleted. When the s3 plugin then tries to delete one of these files for the second time, s3 then correctly shows it doesn't exist, and the s3 plugin considers that enough of an error condition to abort that particular part of the snapshot.
Here is a log of such an error (with some redactions). This log shows the problem happens to multiple indices and shards in the snapshot operation, and some of the failures involve a common file in s3.
[2018-01-21T18:01:25,809][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.17][1]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [pending-index-62] during cleanup
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [pending-index-62] does not exist
at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
... 10 more
[2018-01-21T18:02:35,691][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.24][0]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [index-62] during cleanup
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [index-62] does not exist
at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
... 10 more
[2018-01-21T18:02:36,148][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2017.01.24][1]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [index-62] during cleanup
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [index-62] does not exist
at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
... 10 more
[2018-01-21T18:02:39,338][WARN ][o.e.s.SnapshotShardsService] [q8fsCAc] [[INDEX_REDACTED-2016.12.21][0]] [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: error deleting index file [pending-index-63] during cleanup
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1149) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1409) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:976) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: Blob [pending-index-63] does not exist
at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:122) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1145) ~[elasticsearch-5.6.5.jar:5.6.5]
... 10 more
[2018-01-21T18:03:40,905][INFO ][o.e.s.SnapshotShardsService] [q8fsCAc] snapshot [my_snapshot:2018-01-21-18:00:02/bEaUAdWaQ9ip3os5ghGvfQ] is done
This log was produced with elasticsearch and repository-s3 version 5.6.5, but I have recreated the problem in 6.1.2 as well. The logic involved has not changed, so this makes sense.
When listBlobs() is called at the start of a snapshot operation,
https://github.com/elastic/elasticsearch/blob/v6.1.2/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L1133
s3 can return the files which have already been deleted.
Later, during the finalize phase, it will now attempt to delete all of the blobs that were listed above. The deleteBlob() method will verify that each blob exists, but the blobs that were previously deleted will not individually show as existent. So deleteBlob() will throw an exception.
https://github.com/elastic/elasticsearch/blob/v6.1.2/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L1238
Steps to reproduce:
This depends on unpredictable s3 consistency behaviors, and so may not be easy to reproduce naturally. I will include a patch against 6.1.2 which will allow you to simulate this scenario by sleeping at the key points in the snapshot operation. This will allow you to add and delete the files from s3, to simulate its eventual consistency.
All that's needed to reproduce is to take a snapshot and have an s3 deleted file reappear in the bucket contents at the right time. It will be a partial snapshot, and the above errors will be logged.
If using the patch which will be provided:
- Take a snapshot and watch the logs
- When directed in the log message, add a file starting with
index-to the s3 path noted in the logs. For instance,index-DUMMY.txt(This simulates a file which was previously deleted by the plugin showing up again in the bucket contents.) - Keep watching the logs, and when the listBlobs() operation is complete, they will direct you to remove the file you uploaded. (This simulates that the file, though it was in the bucket contents, will fail the blobExists() check.)
- The snapshot will then fail shortly after.
Suggested fix:
Since this is unlikely to occur for the majority of s3 users, a fix which would not affect the majority of s3 users also seems appropriate. An optional configuration property which lets me filter the listBlobs() result for any blobs that I know I have already deleted occurs to me. Keeping track of that state may present a challenge, so being able to configure more lenient deleteBlob() checks (and thus accepting the risk of the s3 repo being concurrently modified by other processes) would also make sense to me.
Patch to simulate the issue:
diff --git a/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java b/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
index e3ee189..e1398d2 100644
--- a/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
+++ b/core/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
@@ -167,6 +167,8 @@ import static java.util.Collections.unmodifiableMap;
*/
public abstract class BlobStoreRepository extends AbstractLifecycleComponent implements Repository {
+ private static Map<SnapshotId, Boolean> slept = new HashMap<>();
+
private BlobContainer snapshotsBlobContainer;
protected final RepositoryMetaData metadata;
@@ -1130,10 +1132,24 @@ public abstract class BlobStoreRepository extends AbstractLifecycleComponent imp
try {
final Map<String, BlobMetaData> blobs;
try {
+ long sleepMinutes = 3;
+ if(!slept.getOrDefault(snapshotId, false)) {
+ logger.info("Sleeping for {} minutes. Add file starting with 'index-' to path {} now so listBlobs() will see it.", sleepMinutes, blobContainer.path().buildAsString());
+ Thread.sleep(sleepMinutes * 60 * 1000);
+ }
+
blobs = blobContainer.listBlobs();
+
+ if(!slept.getOrDefault(snapshotId, false)) {
+ logger.info("listBlobs() complete. Sleeping for {} minutes. Delete the file from path {} now so it won't be present when the delete is attempted.", sleepMinutes, blobContainer.path().buildAsString());
+ Thread.sleep(sleepMinutes * 60 * 1000);
+ slept.put(snapshotId, true);
+ }
} catch (IOException e) {
throw new IndexShardSnapshotFailedException(shardId, "failed to list blobs", e);
- }
+ } catch (InterruptedException e) {
+ throw new RuntimeException(e);
+ }
long generation = findLatestFileNameGeneration(blobs);
Tuple<BlobStoreIndexShardSnapshots, Integer> tuple = buildBlobStoreIndexShardSnapshots(blobs);