When a shard relocation occurs, in-flight indexing operations wait to acquire a primary permit until the relocation completes. Upon successful relocation, these operations receive a RetryOnPrimaryException, which marks the request as retried. This forces a version lookup during indexing on the new primary, even for operations that could have used the auto-generated ID append-only optimization.
For shards with high indexing throughput, relocations can take a while to complete. During this time, many operations queue up, all marked as retried. When they finally execute on the new primary, they must perform version lookups (requiring at least a terms dictionary lookup) with a likely cold cache. This can overwhelm the new primary shard.
We should detect this situation and avoid marking the request as retried, since we can safely assume the documents were not indexed on the old primary.
When a shard relocation occurs, in-flight indexing operations wait to acquire a primary permit until the relocation completes. Upon successful relocation, these operations receive a
RetryOnPrimaryException, which marks the request as retried. This forces a version lookup during indexing on the new primary, even for operations that could have used the auto-generated ID append-only optimization.For shards with high indexing throughput, relocations can take a while to complete. During this time, many operations queue up, all marked as retried. When they finally execute on the new primary, they must perform version lookups (requiring at least a terms dictionary lookup) with a likely cold cache. This can overwhelm the new primary shard.
We should detect this situation and avoid marking the request as retried, since we can safely assume the documents were not indexed on the old primary.