Reapply "Add an option to return early from an allocate call" #135589
Merged
elasticsearchmachine merged 3 commits intoelastic:mainfrom Sep 29, 2025
Merged
Conversation
Collaborator
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
ywangd
commented
Sep 29, 2025
Comment on lines
+703
to
+713
| if (routingNodes.getRelocatingShardCount() > 0) { | ||
| // ES-12955: Check routingNodes.getRelocatingShardCount() > 0 in case the first relocation is a THROTTLE. | ||
| // This should rarely happen since in most cases, we don't throttle unless there is an existing relocation. | ||
| // But it can happen in production for frozen indices when the cache is still being prepared. It can also | ||
| // happen in tests because we have decider like RandomAllocationDecider that can randomly return THROTTLE | ||
| // when there is no existing relocation. | ||
| shardBalanced = true; | ||
| } | ||
| if (completeEarlyOnShardAssignmentChange && shardBalanced) { | ||
| return true; | ||
| } |
Member
Author
There was a problem hiding this comment.
This is the fix. See also 70b99de for its commit.
There were two issues:
- The main one is the we should only flag
shardBalancedtotruewhen there is shard movement on the cluster instead of just on the model node. - The second issue is that in production, we can actually have balance being throttled without any prior shard movement due to HasFrozenCacheAllocationDecider. I updated the comment to reflect it.
Member
Author
|
Test failure not related, but reproducible on |
nicktindall
reviewed
Sep 29, 2025
Contributor
nicktindall
left a comment
There was a problem hiding this comment.
LGTM, tricky one to catch
nicktindall
approved these changes
Sep 29, 2025
szybia
added a commit
to szybia/elasticsearch
that referenced
this pull request
Sep 29, 2025
* upstream/main: (22 commits) Fix InternalCategorizationAggregationTests.testReduceRandom (elastic#135533) [DOCS] GeoIP processor: add clarification about using a reverse proxy endpoint (elastic#135534) Move `ProjectRoutingInfo` and related classes (elastic#135586) Refactor IndexAbstractionResolver (elastic#135587) Simplify returnLocalAll handling in ES|QL (elastic#135353) Reapply "Add an option to return early from an allocate call" (elastic#135589) Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#134407 Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackIT testAggTooManyMvLongs elastic#135585 Mute org.elasticsearch.multiproject.test.XpackWithMultipleProjectsClientYamlTestSuiteIT test {yaml=esql/60_usage/Basic ESQL usage output (telemetry) snapshot version} elastic#135579 Mute org.elasticsearch.search.ccs.KnnVectorQueryBuilderCrossClusterSearchIT testKnnQueryWithCcsMinimizeRoundTripsFalse elastic#135573 Mute org.elasticsearch.xpack.esql.inference.textembedding.TextEmbeddingOperatorTests testSimpleCircuitBreaking elastic#135569 Add telemetry for `TS` command (elastic#135471) Mute org.elasticsearch.cluster.routing.allocation.decider.RestoreInProgressAllocationDeciderTests testCanAllocatePrimaryExistingInRestoreInProgress elastic#135566 allocation: clarify RestoreInProgressAllocationDecider failure message (elastic#132307) [ES|QL] Register AggregateMetricDoubleLiteral (elastic#135054) Validate Logstash pipeline ID when creating. (elastic#135378) Migrate transport versions 8841050 through 8841041 (elastic#135555) Mute org.elasticsearch.search.ccs.SparseVectorQueryBuilderCrossClusterSearchIT testSparseVectorQueryWithCcsMinimizeRoundTripsFalse elastic#135559 Mute org.elasticsearch.action.admin.cluster.stats.SearchUsageStatsTests testToXContent elastic#135558 Testing indices query cache memory stats (elastic#135298) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR reapply #134786 with a bug fix which previously caused many test failures.
Resolves: #135406
Resolves: #135407
Resolves: #135150
Resolves: #135151
Resolves: #135194
Resolves: #135248
Resolves: #135249
Resolves: #135408
Resolves: #135473
Resolves: #135474
The following is the original commit message
Instead of running simulation all the way to balance or no possible movement, this PR adds an option to finish early based on the number of relocating shards. Note this early return mechanism is enabled only when desired balance is in use, i.e. it affects simulation only.
Resolves: ES-12862