Skip to content

Add thread pool for write coordination#129450

Merged
Tim-Brooks merged 19 commits intoelastic:mainfrom
Tim-Brooks:write_coordination_pool
Jun 18, 2025
Merged

Add thread pool for write coordination#129450
Tim-Brooks merged 19 commits intoelastic:mainfrom
Tim-Brooks:write_coordination_pool

Conversation

@Tim-Brooks
Copy link
Copy Markdown
Contributor

This change adds a thread pool for write coordination to ensure that
bulk coordination does not get stuck on an overloaded primary node.

This change adds a thread pool for write coordination to ensure that
bulk coordination does not get stuck on an overloaded primary node.
@Tim-Brooks Tim-Brooks requested a review from a team as a code owner June 14, 2025 21:16
@Tim-Brooks Tim-Brooks added >non-issue :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v9.1.0 labels Jun 14, 2025
@Tim-Brooks
Copy link
Copy Markdown
Contributor Author

Tim-Brooks commented Jun 14, 2025

WIP / opened for CI.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. label Jun 14, 2025
Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense to me, left a comment, but otherwise think we can move forward with tests etc.

},
executor
// Use the appropriate write executor for actual ingest processing
isOnlySystem ? systemWriteExecutor : writeExecutor
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this, but it seems like for any case where we have ingest processing, we would then still have the coordination happen behind any local write work. At least the PR then avoids some of the wait roundtrips.

We could also decide to have both a system-write-coordination pool and a write-coordination pool and use those here? We can look at this in follow-ups ofc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also decide to have both a system-write-coordination pool and a write-coordination pool and use those here? We can look at this in follow-ups ofc.

Yes I would like to move ingest work to a non-WRITE thread pool in a follow-up as I think there might be a few things to discuss.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jun 17, 2025

🔍 Preview links for changed docs:

🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes.

Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This seems intuitive and simple enough that we can merge. But it seems worth keeping an eye on nightly benchmarks the following day or two, just to double check.

safeAwait(startBarrier);
}

private static void fillWriteQueue(ThreadPool threadPool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename to:

Suggested change
private static void fillWriteQueue(ThreadPool threadPool) {
private static void fillWriteCoordinationQueue(ThreadPool threadPool) {

@@ -532,7 +532,7 @@ public void testShortCircuitShardLevelFailureWithIngestNodeHop() throws Exceptio
}

private static void blockWritePool(ThreadPool threadPool, CountDownLatch finishLatch) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to:

Suggested change
private static void blockWritePool(ThreadPool threadPool, CountDownLatch finishLatch) {
private static void blockWriteCoordinationPool(ThreadPool threadPool, CountDownLatch finishLatch) {

@Tim-Brooks Tim-Brooks merged commit 9ac6576 into elastic:main Jun 18, 2025
28 checks passed
kderusso pushed a commit to kderusso/elasticsearch that referenced this pull request Jun 23, 2025
This change adds a thread pool for write coordination to ensure that
bulk coordination does not get stuck on an overloaded primary node.
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
…c#4494)

This PR includes ingestion loads from both `write_coordination` and `system_write_coordination` threadpools for autoscaling purpose. It also introduces a new setting that can be used to exclude the coordination thread pools in case BWC behaviour is needed. These thread pools are now always sampled for metrics purpose regardless whether they are included in the reports for autoscaling.

Resolves: 
Relates: elastic#129450
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
…c#4494)

This PR includes ingestion loads from both `write_coordination` and `system_write_coordination` threadpools for autoscaling purpose. It also introduces a new setting that can be used to exclude the coordination thread pools in case BWC behaviour is needed. These thread pools are now always sampled for metrics purpose regardless whether they are included in the reports for autoscaling.

Resolves: 
Relates: elastic#129450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >non-issue Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants