Skip to content

Suppress success callback when failing master task#142042

Merged
DaveCTurner merged 2 commits intoelastic:mainfrom
DaveCTurner:2026/02/06/MasterService-failure-handling
Feb 9, 2026
Merged

Suppress success callback when failing master task#142042
DaveCTurner merged 2 commits intoelastic:mainfrom
DaveCTurner:2026/02/06/MasterService-failure-handling

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.

If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.
@DaveCTurner DaveCTurner requested review from inespot and mhl-b February 6, 2026 18:38
@DaveCTurner DaveCTurner added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. auto-backport Automatically create backport pull requests when merged branch:9.2 branch:9.1 branch:8.19 v9.4.0 branch:9.3 labels Feb 6, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

void onBatchFailure(Exception failure) {
// if the whole batch resulted in an exception then this overrides any task-level results whether successful or not
this.failure = Objects.requireNonNull(failure);
this.onPublicationSuccess = null;
Copy link
Copy Markdown
Contributor

@inespot inespot Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oki, so before this fix (if I understand the bug correctly), the problematic flow was:

  • We have a batch with task1 and task2. executor.execute(batchExecutionContext) runs
  • task1 succeeds and calls taskContext.success(callback1) which sets onPublicationSuccess to callback1
  • task2 throws new RuntimeException(...)
  • executionResult.onBatchFailure(e) is called for all tasks so task1 and task2. But task1.onPublicationSuccess is still set to callback1.
  • Because of the error, the old cluster state is returned so executionResult.onClusterStateUnchanged(newClusterState) is then called on task1 which the incorrectly runs onPublicationSuccess.

And the practical consequence of this is that we could for example start a snapshot without tracking it in the cluster state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And based on the TaskContext java doc and some quick research, nothing seems designed or should rely on onPublicationSuccess running when the batch fails.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all of the above is correct, the fix looks good to me!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing seems designed or should rely on onPublicationSuccess running when the batch fails.

Indeed tasks should positively rely on onPublicationSuccess not running when the batch fails. I guess nobody does today or else we'd have spotted this sooner, but that was what I wanted in #141998 and it didn't work as expected, uncovering this bug.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a batch with task1 and task2.

It's actually a problem with just a single task. The executor can call task1.success() and then throw an exception.

neverCalledAckListener
);
case 6 -> taskContext.onFailure(
new RuntimeException(randomValueOtherThan(expectedExceptionMessage, ESTestCase::randomIdentifier))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of randomizing this and expectedExceptionMessage error messages? Would it not be simpler/more readable to hardcode them to distinct values?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are well over 200 instances of the literal "simulated" in the codebase so one must burn at least a few brain cycles wondering exactly to which of them assertThat(e.getMessage(), equalTo("simulated")); refers, whereas by using a variable your IDE will show you with what we're expecting to match. And then there's no need to use a specific literal value in both places, so we tell the reader that it doesn't matter by randomizing. Similarly here we're saying to the reader that we don't care what the message is as long as it's not the expected one; I don't think ESTestCase::randomIdentifier would return expectedExceptionMessage anyway, not even with astronomically small probability, but that point would be lost to the reader if we just called it directly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I could have said randomIdentifier("not-expected-message-") or something instead here I guess)

Copy link
Copy Markdown
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I am curious how you spot this error? Did you observe it in action or just by reading the code?

Copy link
Copy Markdown
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd
Copy link
Copy Markdown
Member

ywangd commented Feb 9, 2026

I am curious how you spot this error? Did you observe it in action or just by reading the code?

Nevermind. I now see it is extracted from #141998

@DaveCTurner DaveCTurner merged commit b769727 into elastic:main Feb 9, 2026
35 checks passed
@DaveCTurner DaveCTurner deleted the 2026/02/06/MasterService-failure-handling branch February 9, 2026 08:29
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Feb 9, 2026
If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Feb 9, 2026
If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Feb 9, 2026
If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Feb 9, 2026
If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
9.3
9.1
8.19
9.2

elasticsearchmachine pushed a commit that referenced this pull request Feb 9, 2026
If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.
elasticsearchmachine pushed a commit that referenced this pull request Feb 9, 2026
If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.
elasticsearchmachine pushed a commit that referenced this pull request Feb 10, 2026
If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team. v8.19.12 v9.2.6 v9.3.1 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants