Elasticsearch version (bin/elasticsearch --version):
# rpm -qa |grep elasticsearch
elasticsearch-5.6.2-1.noarch
Plugins installed:
discovery-ec2
repository-s3
x-pack
JVM version (java -version):
# java -version
java version "1.8.0_141"
Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)
OS version (uname -a if on a Unix-like system):
Fedora 26
Linux 4.12.14-300.fc26.x86_64 #1 SMP Wed Sep 20 16:28:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
We have had about twenty indexes that are stuck in a red state after trying to restore a snapshot taken from elasticsearch 5.4.1 to a brand new cluster running 5.6.2. For this issue, I will focus on one index logstash-2017.09.20.
You can see here that the index is in a red state:
# curl -XGET 'localhost:9200/_cluster/health/logstash-2017.09.20?level=shards&pretty'
{
"cluster_name" : "redacted",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 11,
"number_of_data_nodes" : 5,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.60064585575888,
"indices" : {
"logstash-2017.09.20" : {
"status" : "red",
"number_of_shards" : 5,
"number_of_replicas" : 0,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"shards" : {
"0" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},
"1" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},
"2" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},
"3" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
},
"4" : {
"status" : "red",
"primary_active" : false,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1
}
}
}
}
}
You can see the restore says it finished with a SUCCESS:
# curl -XGET 'localhost:9200/_snapshot/my_cool_backup/snapshot_0?pretty'
{
"snapshots" : [
{
"snapshot" : "snapshot_0",
"uuid" : "e_wavyGfTD-SwXC-imkF0g",
"version_id" : 5040199,
"version" : "5.4.1",
"indices" : [
** SNIP **
],
"state" : "SUCCESS",
"start_time" : "2017-09-27T07:00:01.807Z",
"start_time_in_millis" : 1506495601807,
"end_time" : "2017-09-27T08:44:35.377Z",
"end_time_in_millis" : 1506501875377,
"duration_in_millis" : 6273570,
"failures" : [ ],
"shards" : {
"total" : 929,
"failed" : 0,
"successful" : 929
}
}
]
}
Looking at the restore process in detail for the example index, you can see that it says this index has been put into the DONE state for each shard.
$ curl -XGET 'localhost:9200/_snapshot/my_cool_backup/snapshot_0/_status?pretty'
"snapshots" : [
{
"snapshot" : "snapshot_0",
"repository" : "my_cool_backup",
"uuid" : "e_wavyGfTD-SwXC-imkF0g",
"state" : "SUCCESS",
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 929,
"failed" : 0,
"total" : 929
},
"stats" : {
"number_of_files" : 2364,
"processed_files" : 2364,
"total_size_in_bytes" : 15393945691,
"processed_size_in_bytes" : 15393945691,
"start_time_in_millis" : 1506495618226,
"time_in_millis" : 6252967
},
"indices" : {
"logstash-2017.09.20" : {
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 5,
"failed" : 0,
"total" : 5
},
"stats" : {
"number_of_files" : 31,
"processed_files" : 31,
"total_size_in_bytes" : 168664,
"processed_size_in_bytes" : 168664,
"start_time_in_millis" : 1506495678150,
"time_in_millis" : 2401656
},
"shards" : {
"0" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 7,
"processed_files" : 7,
"total_size_in_bytes" : 118135,
"processed_size_in_bytes" : 118135,
"start_time_in_millis" : 1506495720316,
"time_in_millis" : 1949
}
},
"1" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 16,
"processed_files" : 16,
"total_size_in_bytes" : 33918,
"processed_size_in_bytes" : 33918,
"start_time_in_millis" : 1506495722992,
"time_in_millis" : 2804
}
},
"2" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 0,
"processed_files" : 0,
"total_size_in_bytes" : 0,
"processed_size_in_bytes" : 0,
"start_time_in_millis" : 1506498067865,
"time_in_millis" : 11941
}
},
"3" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 4,
"processed_files" : 4,
"total_size_in_bytes" : 8434,
"processed_size_in_bytes" : 8434,
"start_time_in_millis" : 1506495678150,
"time_in_millis" : 1206
}
},
"4" : {
"stage" : "DONE",
"stats" : {
"number_of_files" : 4,
"processed_files" : 4,
"total_size_in_bytes" : 8177,
"processed_size_in_bytes" : 8177,
"start_time_in_millis" : 1506495684287,
"time_in_millis" : 1164
}
}
}
}
Looking at /_cat/recovery it says it's done too
# curl -XGET localhost:9200/_cat/recovery|grep logstash-2017.09.20
logstash-2017.09.20 0 7.9s snapshot done n/a n/a redacted data-03 my_cool_backup snapshot_0 1 1 100.0% 109 1699 1699 100.0% 2911728303 0 0 100.0%
logstash-2017.09.20 1 14.5m snapshot done n/a n/a redacted data-04 my_cool_backup snapshot_0 136 136 100.0% 136 2842065772 2842065772 100.0% 2842065772 0 0 100.0%
logstash-2017.09.20 2 1.7s snapshot done n/a n/a redacted data-00 my_cool_backup snapshot_0 1 1 100.0% 109 1699 1699 100.0% 2889504028 0 0 100.0%
logstash-2017.09.20 3 13.9m snapshot done n/a n/a redacted data-02 my_cool_backup snapshot_0 127 127 100.0% 127 2929823683 2929823683 100.0% 2929823683 0 0 100.0%
But if you try to close the index it says that it is still being restored:
$ curl -XPOST 'localhost:9200/logstash-2017.09.20/_close?pretty'
{
"error" : {
"root_cause" : [
{
"type" : "remote_transport_exception",
"reason" : "[master-01][redacted:9300][indices:admin/close]"
}
],
"type" : "illegal_argument_exception",
"reason" : "Cannot close indices that are being restored: [[logstash-2017.09.20/crXjrjtwTEqkK6_ITG1HVQ]]"
},
"status" : 400
}
Looking in the logs it says that it failed to recover the index because the file already exists:
[2017-10-02T19:50:28,790][WARN ][o.e.c.a.s.ShardStateAction] [master-01] [logstash-2017.09.20][4] received shard failed for shard id [[logstash-2017.09.20][4]], allocation id [lW_4BSVGSc6phnI1vLEPWg], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-2017.09.20][4]: Recovery failed on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFLjPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: FileAlreadyExistsException[/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si]; ]
[2017-10-02T19:50:28,790][WARN ][o.e.c.a.s.ShardStateAction] [master-01] [logstash-2017.09.20][4] received shard failed for shard id [[logstash-2017.09.20][4]
], allocation id [lW_4BSVGSc6phnI1vLEPWg], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-2017.09.20][4]: Recovery failed
on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFLjPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=
true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[fa
iled to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: FileAlre
adyExistsException[/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-2017.09.20][4]: Recovery failed on {data-02}{Af43AKvBRf6r-PTr2s9KRg}{O1R6sKwAQK2FyYYmdFL
jPA}{redacted}{redacted:9300}{aws_availability_zone=us-west-2c, ml.max_open_jobs=10, ml.enabled=true}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1511) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.6.2.jar:5.6.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:299) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:405) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [snapshot_0/e_wavyGfTD-SwXC-imkF0g]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:993) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1679) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:991) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
Caused by: java.nio.file.FileAlreadyExistsException: /var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) ~[?:?]
at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434) ~[?:1.8.0_141]
at java.nio.file.Files.newOutputStream(Files.java:216) ~[?:1.8.0_141]
at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:413) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:409) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:253) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.RateLimitedFSDirectory.createOutput(RateLimitedFSDirectory.java:40) ~[elasticsearch-5.6.2.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:73) ~[lucene-core-6.6.1.jar:6.6.1 9aa465a89b64ff2dabe7b4d50c472de32c298683 - varunthacker - 2017-08-29 21:54:39]
at org.elasticsearch.index.store.Store.createVerifyingOutput(Store.java:463) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restoreFile(BlobStoreRepository.java:1734) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1676) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:991) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1243) ~[elasticsearch-5.6.2.jar:5.6.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1507) ~[elasticsearch-5.6.2.jar:5.6.2]
... 4 more
And if you look on for that file it says is already exists, it is not present on the data node:
# ll /var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si
ls: cannot access '/var/lib/elasticsearch/nodes/0/indices/crXjrjtwTEqkK6_ITG1HVQ/4/index/_22g.si': No such file or directory
The only way I have been able to get the cluster out of this hung state is to do a full cluster shutdown and start it back up again. From there I am able to close these red indexes and retry the restore again. When I first encountered this issue, I had ~20 indexes that failed to restore. After retrying to restore these failures with the process above, I was able to get all but seven of them restored. The remaining failures are in the same state.
Elasticsearch version (
bin/elasticsearch --version):Plugins installed:
JVM version (
java -version):OS version (
uname -aif on a Unix-like system):Description of the problem including expected versus actual behavior:
We have had about twenty indexes that are stuck in a red state after trying to restore a snapshot taken from elasticsearch
5.4.1to a brand new cluster running5.6.2. For this issue, I will focus on one indexlogstash-2017.09.20.You can see here that the index is in a red state:
You can see the restore says it finished with a SUCCESS:
Looking at the restore process in detail for the example index, you can see that it says this index has been put into the DONE state for each shard.
Looking at
/_cat/recoveryit says it's done tooBut if you try to close the index it says that it is still being restored:
Looking in the logs it says that it failed to recover the index because the file already exists:
And if you look on for that file it says is already exists, it is not present on the data node:
The only way I have been able to get the cluster out of this hung state is to do a full cluster shutdown and start it back up again. From there I am able to close these red indexes and retry the restore again. When I first encountered this issue, I had ~20 indexes that failed to restore. After retrying to restore these failures with the process above, I was able to get all but seven of them restored. The remaining failures are in the same state.