Elasticsearch version:
# elasticsearch --version
Version: 2.2.0, Build: 8ff36d1/2016-01-27T13:32:39Z, JVM: 1.8.0_72-internal
JVM version:
# java -version
openjdk version "1.8.0_72-internal"
OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)
OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)
OS version: Debian Jessie on kernel 4.1.3.
Description of the problem including expected versus actual behavior:
Docs say that primary and replica have to have the same synch id in order to achieve immediate recovery and avoid costly relocation. I restart 1 node out of 8 and see that most of indices recover on remaining nodes, even though restarted node rejoined. Week old indices recover too. To be fair, it did not work for me on 1.7.3 either. #6069 is is closed, therefore I'm filing this issue.
Steps to reproduce:
- Restart one node.
- Wait until node rejoins.
Expected: immediate recovery for old indices, translog recovery for active indices. Nice and easy.
Actual: almost all (if not all) recover, active indices recover all data files (terabytes of them). Ingestion is suffering from backpressure, people notice delayed indexing and tell mean things about you. Sadness and disappointment.
It seems that some indices do not have sync_id. I tried checking sync IDs for old indices that were recovering and the field appeared.
Before:
[
{
"routing": {
"state": "STARTED",
"primary": true,
"node": "hOM4Or2fTG-Do4ZkR9jIRQ",
"relocating_node": null
},
"commit": {
"id": "MJ19KOilFLcuYnGni1rE+A==",
"generation": 81,
"user_data": {
"translog_uuid": "OUU730pTTSOGk-07aJEMJw",
"translog_generation": "80"
},
"num_docs": 62732221
},
"shard_path": {
"state_path": "/disk/data6/es/main/main/nodes/0",
"data_path": "/disk/data6/es/main/main/nodes/0",
"is_custom_data_path": false
}
},
{
"routing": {
"state": "STARTED",
"primary": false,
"node": "yNaQ5IGARhGtu5FN8AvGUQ",
"relocating_node": null
},
"commit": {
"id": "4F6/8APNSb40wCqr89bs5g==",
"generation": 82,
"user_data": {
"translog_uuid": "GE4r0UDHTda2aLd-PwJ9Bg",
"translog_generation": "80"
},
"num_docs": 62732221
},
"shard_path": {
"state_path": "/disk/data5/es/main/main/nodes/0",
"data_path": "/disk/data5/es/main/main/nodes/0",
"is_custom_data_path": false
}
}
]
Then I do manual synched flush:
# curl -X POST -s http://myhost/myindex-2016.02.29/_flush/synced | jq .
{
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"www-nginx-error-2016.03.01": {
"total": 2,
"successful": 2,
"failed": 0
}
}
After:
[
{
"routing": {
"state": "STARTED",
"primary": true,
"node": "hOM4Or2fTG-Do4ZkR9jIRQ",
"relocating_node": null
},
"commit": {
"id": "MJ19KOilFLcuYnGni3ExRw==",
"generation": 82,
"user_data": {
"translog_uuid": "OUU730pTTSOGk-07aJEMJw",
"sync_id": "AVNXkFKDdHUamVU5aCvy",
"translog_generation": "80"
},
"num_docs": 62732221
},
"shard_path": {
"state_path": "/disk/data6/es/main/main/nodes/0",
"data_path": "/disk/data6/es/main/main/nodes/0",
"is_custom_data_path": false
}
},
{
"routing": {
"state": "STARTED",
"primary": false,
"node": "yNaQ5IGARhGtu5FN8AvGUQ",
"relocating_node": null
},
"commit": {
"id": "4F6/8APNSb40wCqr8+yqgQ==",
"generation": 83,
"user_data": {
"translog_uuid": "GE4r0UDHTda2aLd-PwJ9Bg",
"sync_id": "AVNXkFKDdHUamVU5aCvy",
"translog_generation": "80"
},
"num_docs": 62732221
},
"shard_path": {
"state_path": "/disk/data5/es/main/main/nodes/0",
"data_path": "/disk/data5/es/main/main/nodes/0",
"is_custom_data_path": false
}
}
]
Elasticsearch version:
JVM version:
OS version: Debian Jessie on kernel 4.1.3.
Description of the problem including expected versus actual behavior:
Docs say that primary and replica have to have the same synch id in order to achieve immediate recovery and avoid costly relocation. I restart 1 node out of 8 and see that most of indices recover on remaining nodes, even though restarted node rejoined. Week old indices recover too. To be fair, it did not work for me on 1.7.3 either. #6069 is is closed, therefore I'm filing this issue.
Steps to reproduce:
Expected: immediate recovery for old indices, translog recovery for active indices. Nice and easy.
Actual: almost all (if not all) recover, active indices recover all data files (terabytes of them). Ingestion is suffering from backpressure, people notice delayed indexing and tell mean things about you. Sadness and disappointment.
It seems that some indices do not have
sync_id. I tried checking sync IDs for old indices that were recovering and the field appeared.Before:
[ { "routing": { "state": "STARTED", "primary": true, "node": "hOM4Or2fTG-Do4ZkR9jIRQ", "relocating_node": null }, "commit": { "id": "MJ19KOilFLcuYnGni1rE+A==", "generation": 81, "user_data": { "translog_uuid": "OUU730pTTSOGk-07aJEMJw", "translog_generation": "80" }, "num_docs": 62732221 }, "shard_path": { "state_path": "/disk/data6/es/main/main/nodes/0", "data_path": "/disk/data6/es/main/main/nodes/0", "is_custom_data_path": false } }, { "routing": { "state": "STARTED", "primary": false, "node": "yNaQ5IGARhGtu5FN8AvGUQ", "relocating_node": null }, "commit": { "id": "4F6/8APNSb40wCqr89bs5g==", "generation": 82, "user_data": { "translog_uuid": "GE4r0UDHTda2aLd-PwJ9Bg", "translog_generation": "80" }, "num_docs": 62732221 }, "shard_path": { "state_path": "/disk/data5/es/main/main/nodes/0", "data_path": "/disk/data5/es/main/main/nodes/0", "is_custom_data_path": false } } ]Then I do manual synched flush:
{ "_shards": { "total": 2, "successful": 2, "failed": 0 }, "www-nginx-error-2016.03.01": { "total": 2, "successful": 2, "failed": 0 } }After:
[ { "routing": { "state": "STARTED", "primary": true, "node": "hOM4Or2fTG-Do4ZkR9jIRQ", "relocating_node": null }, "commit": { "id": "MJ19KOilFLcuYnGni3ExRw==", "generation": 82, "user_data": { "translog_uuid": "OUU730pTTSOGk-07aJEMJw", "sync_id": "AVNXkFKDdHUamVU5aCvy", "translog_generation": "80" }, "num_docs": 62732221 }, "shard_path": { "state_path": "/disk/data6/es/main/main/nodes/0", "data_path": "/disk/data6/es/main/main/nodes/0", "is_custom_data_path": false } }, { "routing": { "state": "STARTED", "primary": false, "node": "yNaQ5IGARhGtu5FN8AvGUQ", "relocating_node": null }, "commit": { "id": "4F6/8APNSb40wCqr8+yqgQ==", "generation": 83, "user_data": { "translog_uuid": "GE4r0UDHTda2aLd-PwJ9Bg", "sync_id": "AVNXkFKDdHUamVU5aCvy", "translog_generation": "80" }, "num_docs": 62732221 }, "shard_path": { "state_path": "/disk/data5/es/main/main/nodes/0", "data_path": "/disk/data5/es/main/main/nodes/0", "is_custom_data_path": false } } ]