Suppress success callback when failing master task by DaveCTurner · Pull Request #142042 · elastic/elasticsearch

DaveCTurner · 2026-02-06T18:38:34Z

If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.

If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.

elasticsearchmachine · 2026-02-06T18:39:05Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2026-02-06T18:39:07Z

Hi @DaveCTurner, I've created a changelog YAML for you.

inespot · 2026-02-07T22:16:17Z

server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

        void onBatchFailure(Exception failure) {
            // if the whole batch resulted in an exception then this overrides any task-level results whether successful or not
            this.failure = Objects.requireNonNull(failure);
+            this.onPublicationSuccess = null;


Oki, so before this fix (if I understand the bug correctly), the problematic flow was:

We have a batch with task1 and task2. executor.execute(batchExecutionContext) runs

task1 succeeds and calls taskContext.success(callback1) which sets onPublicationSuccess to callback1

task2 throws new RuntimeException(...)

executionResult.onBatchFailure(e) is called for all tasks so task1 and task2. But task1.onPublicationSuccess is still set to callback1.

Because of the error, the old cluster state is returned so executionResult.onClusterStateUnchanged(newClusterState) is then called on task1 which the incorrectly runs onPublicationSuccess.

And the practical consequence of this is that we could for example start a snapshot without tracking it in the cluster state.

And based on the TaskContext java doc and some quick research, nothing seems designed or should rely on onPublicationSuccess running when the batch fails.

If all of the above is correct, the fix looks good to me!

nothing seems designed or should rely on onPublicationSuccess running when the batch fails.

Indeed tasks should positively rely on onPublicationSuccess not running when the batch fails. I guess nobody does today or else we'd have spotted this sooner, but that was what I wanted in #141998 and it didn't work as expected, uncovering this bug.

We have a batch with task1 and task2.

It's actually a problem with just a single task. The executor can call task1.success() and then throw an exception.

inespot · 2026-02-07T22:19:02Z

server/src/test/java/org/elasticsearch/cluster/service/MasterServiceTests.java

+                            neverCalledAckListener
+                        );
+                        case 6 -> taskContext.onFailure(
+                            new RuntimeException(randomValueOtherThan(expectedExceptionMessage, ESTestCase::randomIdentifier))


What is the value of randomizing this and expectedExceptionMessage error messages? Would it not be simpler/more readable to hardcode them to distinct values?

There are well over 200 instances of the literal "simulated" in the codebase so one must burn at least a few brain cycles wondering exactly to which of them assertThat(e.getMessage(), equalTo("simulated")); refers, whereas by using a variable your IDE will show you with what we're expecting to match. And then there's no need to use a specific literal value in both places, so we tell the reader that it doesn't matter by randomizing. Similarly here we're saying to the reader that we don't care what the message is as long as it's not the expected one; I don't think ESTestCase::randomIdentifier would return expectedExceptionMessage anyway, not even with astronomically small probability, but that point would be lost to the reader if we just called it directly.

(I could have said randomIdentifier("not-expected-message-") or something instead here I guess)

ywangd

LGTM

I am curious how you spot this error? Did you observe it in action or just by reading the code?

nicktindall

LGTM

ywangd · 2026-02-09T01:02:22Z

I am curious how you spot this error? Did you observe it in action or just by reading the code?

Nevermind. I now see it is extracted from #141998

If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.

elasticsearchmachine · 2026-02-09T08:31:18Z

💚 Backport successful

Status	Branch	Result
✅	9.3
✅	9.1
✅	8.19
✅	9.2

If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.

Suppress success callback when failing master task

dec5e99

If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.

DaveCTurner requested review from inespot and mhl-b February 6, 2026 18:38

DaveCTurner added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. auto-backport Automatically create backport pull requests when merged branch:9.2 branch:9.1 branch:8.19 v9.4.0 branch:9.3 labels Feb 6, 2026

elasticsearchmachine added Team:Distributed Meta label for distributed team. v9.3.1 v9.1.11 v8.19.12 v9.2.6 labels Feb 6, 2026

elasticsearchmachine removed branch:9.2 branch:9.1 branch:8.19 branch:9.3 labels Feb 6, 2026

Update docs/changelog/142042.yaml

a31a5ed

DaveCTurner mentioned this pull request Feb 6, 2026

Batching of snapshot-delete start updates #141998

Merged

DaveCTurner requested review from nicktindall and ywangd February 7, 2026 09:15

inespot reviewed Feb 7, 2026

View reviewed changes

ywangd approved these changes Feb 8, 2026

View reviewed changes

nicktindall approved these changes Feb 8, 2026

View reviewed changes

DaveCTurner merged commit b769727 into elastic:main Feb 9, 2026
35 checks passed

DaveCTurner deleted the 2026/02/06/MasterService-failure-handling branch February 9, 2026 08:29

DaveCTurner mentioned this pull request Feb 9, 2026

[9.3] Suppress success callback when failing master task (#142042) #142098

Merged

DaveCTurner mentioned this pull request Feb 9, 2026

[9.1] Suppress success callback when failing master task (#142042) #142099

Closed

DaveCTurner mentioned this pull request Feb 9, 2026

[8.19] Suppress success callback when failing master task (#142042) #142100

Merged

DaveCTurner mentioned this pull request Feb 9, 2026

[9.2] Suppress success callback when failing master task (#142042) #142101

Merged

DaveCTurner removed the v9.1.11 label Feb 9, 2026

Conversation

DaveCTurner commented Feb 6, 2026

Uh oh!

elasticsearchmachine commented Feb 6, 2026

Uh oh!

elasticsearchmachine commented Feb 6, 2026

Uh oh!

inespot Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inespot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

inespot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

inespot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd commented Feb 9, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 9, 2026

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

inespot Feb 7, 2026 •

edited

Loading