It seems that when making use of path.data over multiple physical disks, that when a disk is removed, the system should recover automatically. Currently, searches and or indexing requests over missing shards throw exceptions, and no allocation/recovery occurs. The only way to bring the data back online is to restart the node, or to reinsert the original disk with existing data.
It would be great if Elasticsearch could:
- Automatically recover when disks are removed
- Automatically make use of a newly returned empty disk
Steps to Test / Reproduce:
- Set up
path.data over 2 disks, and start 2 elasticsearch nodes locally
path.data: ["/Volumes/KINGSTON", "/Volumes/SDCARD"]
- Index some data over 5 shards.
index shard prirep state docs store ip node
test1003 4 r STARTED 2 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 4 p STARTED 2 10.1kb 127.0.0.1 Vindicator
test1003 3 r STARTED 6 24.4kb 127.0.0.1 Jacqueline Falsworth
test1003 3 p STARTED 6 24.5kb 127.0.0.1 Vindicator
test1003 1 r STARTED 10 40.6kb 127.0.0.1 Jacqueline Falsworth
test1003 1 p STARTED 10 45.5kb 127.0.0.1 Vindicator
test1003 2 r STARTED 2 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 2 p STARTED 2 10.1kb 127.0.0.1 Vindicator
test1003 0 r STARTED 3 10.1kb 127.0.0.1 Jacqueline Falsworth
test1003 0 p STARTED 3 10.1kb 127.0.0.1 Vindicator
- Remove the disk that contains most/all of the data
Exceptions start to show in logs
2016-05-11 11:50:18,961][DEBUG][action.admin.indices.stats] [Vindicator] [indices:monitor/stats] failed to execute operation for shard [[[test1003/01ABN7pTQDCoTa80WMdAvg]][0], node[AMr_NWrVSFCuNV-YCOfsVg], [P], s[STARTED], a[id=IMwYwgWrTLCZYa08WJRNvg]]
ElasticsearchException[failed to refresh store stats]; nested: NoSuchFileException[/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index];
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1411)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1396)
at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
at org.elasticsearch.index.store.Store.stats(Store.java:321)
at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:632)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:137)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:166)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:414)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:393)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:380)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:65)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:468)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
at java.nio.file.Files.newDirectoryStream(Files.java:457)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:135)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1417)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1409)
... 18 more
[2016-05-11 11:50:26,796][WARN ][monitor.fs ] [Vindicator] Failed to fetch fs stats - returning empty instance
but _cat/shards shows everything is OK
index shard prirep state docs store ip node
test1003 4 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 4 p STARTED 127.0.0.1 Vindicator
test1003 3 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 3 p STARTED 127.0.0.1 Vindicator
test1003 1 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 1 p STARTED 127.0.0.1 Vindicator
test1003 2 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 2 p STARTED 127.0.0.1 Vindicator
test1003 0 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 0 p STARTED 127.0.0.1 Vindicator
- Post a
_refresh
No change
- Index some data
{
"error": {
"root_cause": [
{
"type": "index_failed_engine_exception",
"reason": "Index failed for [test1003#AVSghrSCuf6DFWq498vy]",
"index_uuid": "01ABN7pTQDCoTa80WMdAvg",
"shard": "1",
"index": "test1003"
}
],
"type": "index_failed_engine_exception",
"reason": "Index failed for [test1003#AVSghrSCuf6DFWq498vy]",
"index_uuid": "01ABN7pTQDCoTa80WMdAvg",
"shard": "1",
"index": "test1003",
"caused_by": {
"type": "i_o_exception",
"reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/1/index/_a.cfs\") [slice=_a_Lucene50_0.tim]",
"caused_by": {
"type": "i_o_exception",
"reason": "Input/output error"
}
}
},
"status": 500
}
Logs show an exception
[2016-05-11 11:52:26,911][DEBUG][action.admin.indices.stats] [Vindicator] [indices:monitor/stats] failed to execute operation for shard [[[test1003/01ABN7pTQDCoTa80WMdAvg]][0], node[AMr_NWrVSFCuNV-YCOfsVg], [P], s[STARTED], a[id=IMwYwgWrTLCZYa08WJRNvg]]
ElasticsearchException[failed to refresh store stats]; nested: NoSuchFileException[/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index];
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1411)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1396)
at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
at org.elasticsearch.index.store.Store.stats(Store.java:321)
at org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:632)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:137)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:166)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:414)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:393)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:380)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:65)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:468)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:407)
at java.nio.file.Files.newDirectoryStream(Files.java:457)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:135)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1417)
at org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1409)
... 18 more
_cat/shards still show all shards STARTED
index shard prirep state docs store ip node
test1003 4 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 4 p STARTED 127.0.0.1 Vindicator
test1003 3 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 3 p STARTED 127.0.0.1 Vindicator
test1003 1 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 1 p STARTED 127.0.0.1 Vindicator
test1003 2 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 2 p STARTED 127.0.0.1 Vindicator
test1003 0 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 0 p STARTED 127.0.0.1 Vindicator
- Wait 5 minutes, Search some data:
No change
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 3,
"failed": 2,
"failures": [
{
"shard": 0,
"index": "test1003",
"node": "AMr_NWrVSFCuNV-YCOfsVg",
"reason": {
"type": "i_o_exception",
"reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/0/indices/01ABN7pTQDCoTa80WMdAvg/0/index/_0.cfs\") [slice=_0.fdt]",
"caused_by": {
"type": "i_o_exception",
"reason": "Input/output error"
}
}
},
{
"shard": 1,
"index": "test1003",
"node": "wK5mnEIaT82Wz3wdTAjv6Q",
"reason": {
"type": "i_o_exception",
"reason": "Input/output error: NIOFSIndexInput(path=\"/Volumes/KINGSTON/elasticsearch/nodes/1/indices/01ABN7pTQDCoTa80WMdAvg/1/index/_2.cfs\") [slice=_2.fdt]",
"caused_by": {
"type": "i_o_exception",
"reason": "Input/output error"
}
}
}
]
},
"hits": {
"total": 23,
"max_score": 1,
"hits": []
}
}
index shard prirep state docs store ip node
test1003 4 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 4 p STARTED 127.0.0.1 Vindicator
test1003 3 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 3 p STARTED 127.0.0.1 Vindicator
test1003 1 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 1 p STARTED 127.0.0.1 Vindicator
test1003 2 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 2 p STARTED 127.0.0.1 Vindicator
test1003 0 r STARTED 127.0.0.1 Jacqueline Falsworth
test1003 0 p STARTED 127.0.0.1 Vindicator
It seems that when making use of
path.dataover multiple physical disks, that when a disk is removed, the system should recover automatically. Currently, searches and or indexing requests over missing shards throw exceptions, and no allocation/recovery occurs. The only way to bring the data back online is to restart the node, or to reinsert the original disk with existing data.It would be great if Elasticsearch could:
Steps to Test / Reproduce:
path.dataover 2 disks, and start 2 elasticsearch nodes locallyExceptions start to show in logs
but
_cat/shardsshows everything is OK_refreshNo change
Logs show an exception
_cat/shardsstill show all shards STARTEDNo change