Skip to content

Cancellation of shard relocation does not work in 2.2.0 #17019

@bobrik

Description

@bobrik

Elasticsearch version:

# elasticsearch --version
Version: 2.2.0, Build: 8ff36d1/2016-01-27T13:32:39Z, JVM: 1.8.0_72-internal

JVM version:

# java -version
openjdk version "1.8.0_72-internal"
OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)
OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)

OS version: Debian Jessie on kernel 4.1.3.

Description of the problem including expected versus actual behavior:

Docs say that primary and replica have to have the same synch id in order to achieve immediate recovery and avoid costly relocation. I restart 1 node out of 8 and see that most of indices recover on remaining nodes, even though restarted node rejoined. Week old indices recover too. To be fair, it did not work for me on 1.7.3 either. #6069 is is closed, therefore I'm filing this issue.

Steps to reproduce:

  1. Restart one node.
  2. Wait until node rejoins.

Expected: immediate recovery for old indices, translog recovery for active indices. Nice and easy.

Actual: almost all (if not all) recover, active indices recover all data files (terabytes of them). Ingestion is suffering from backpressure, people notice delayed indexing and tell mean things about you. Sadness and disappointment.

It seems that some indices do not have sync_id. I tried checking sync IDs for old indices that were recovering and the field appeared.

Before:

[
  {
    "routing": {
      "state": "STARTED",
      "primary": true,
      "node": "hOM4Or2fTG-Do4ZkR9jIRQ",
      "relocating_node": null
    },
    "commit": {
      "id": "MJ19KOilFLcuYnGni1rE+A==",
      "generation": 81,
      "user_data": {
        "translog_uuid": "OUU730pTTSOGk-07aJEMJw",
        "translog_generation": "80"
      },
      "num_docs": 62732221
    },
    "shard_path": {
      "state_path": "/disk/data6/es/main/main/nodes/0",
      "data_path": "/disk/data6/es/main/main/nodes/0",
      "is_custom_data_path": false
    }
  },
  {
    "routing": {
      "state": "STARTED",
      "primary": false,
      "node": "yNaQ5IGARhGtu5FN8AvGUQ",
      "relocating_node": null
    },
    "commit": {
      "id": "4F6/8APNSb40wCqr89bs5g==",
      "generation": 82,
      "user_data": {
        "translog_uuid": "GE4r0UDHTda2aLd-PwJ9Bg",
        "translog_generation": "80"
      },
      "num_docs": 62732221
    },
    "shard_path": {
      "state_path": "/disk/data5/es/main/main/nodes/0",
      "data_path": "/disk/data5/es/main/main/nodes/0",
      "is_custom_data_path": false
    }
  }
]

Then I do manual synched flush:

# curl -X POST -s http://myhost/myindex-2016.02.29/_flush/synced | jq .
{
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "www-nginx-error-2016.03.01": {
    "total": 2,
    "successful": 2,
    "failed": 0
  }
}

After:

[
  {
    "routing": {
      "state": "STARTED",
      "primary": true,
      "node": "hOM4Or2fTG-Do4ZkR9jIRQ",
      "relocating_node": null
    },
    "commit": {
      "id": "MJ19KOilFLcuYnGni3ExRw==",
      "generation": 82,
      "user_data": {
        "translog_uuid": "OUU730pTTSOGk-07aJEMJw",
        "sync_id": "AVNXkFKDdHUamVU5aCvy",
        "translog_generation": "80"
      },
      "num_docs": 62732221
    },
    "shard_path": {
      "state_path": "/disk/data6/es/main/main/nodes/0",
      "data_path": "/disk/data6/es/main/main/nodes/0",
      "is_custom_data_path": false
    }
  },
  {
    "routing": {
      "state": "STARTED",
      "primary": false,
      "node": "yNaQ5IGARhGtu5FN8AvGUQ",
      "relocating_node": null
    },
    "commit": {
      "id": "4F6/8APNSb40wCqr8+yqgQ==",
      "generation": 83,
      "user_data": {
        "translog_uuid": "GE4r0UDHTda2aLd-PwJ9Bg",
        "sync_id": "AVNXkFKDdHUamVU5aCvy",
        "translog_generation": "80"
      },
      "num_docs": 62732221
    },
    "shard_path": {
      "state_path": "/disk/data5/es/main/main/nodes/0",
      "data_path": "/disk/data5/es/main/main/nodes/0",
      "is_custom_data_path": false
    }
  }
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions