Suppress success callback when failing master task#142042
Suppress success callback when failing master task#142042DaveCTurner merged 2 commits intoelastic:mainfrom
Conversation
If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Hi @DaveCTurner, I've created a changelog YAML for you. |
| void onBatchFailure(Exception failure) { | ||
| // if the whole batch resulted in an exception then this overrides any task-level results whether successful or not | ||
| this.failure = Objects.requireNonNull(failure); | ||
| this.onPublicationSuccess = null; |
There was a problem hiding this comment.
Oki, so before this fix (if I understand the bug correctly), the problematic flow was:
- We have a batch with task1 and task2.
executor.execute(batchExecutionContext)runs - task1 succeeds and calls
taskContext.success(callback1)which setsonPublicationSuccesstocallback1 - task2 throws
new RuntimeException(...) executionResult.onBatchFailure(e)is called for all tasks so task1 and task2. Buttask1.onPublicationSuccessis still set tocallback1.- Because of the error, the old cluster state is returned so
executionResult.onClusterStateUnchanged(newClusterState)is then called on task1 which the incorrectly runsonPublicationSuccess.
And the practical consequence of this is that we could for example start a snapshot without tracking it in the cluster state.
There was a problem hiding this comment.
And based on the TaskContext java doc and some quick research, nothing seems designed or should rely on onPublicationSuccess running when the batch fails.
There was a problem hiding this comment.
If all of the above is correct, the fix looks good to me!
There was a problem hiding this comment.
nothing seems designed or should rely on onPublicationSuccess running when the batch fails.
Indeed tasks should positively rely on onPublicationSuccess not running when the batch fails. I guess nobody does today or else we'd have spotted this sooner, but that was what I wanted in #141998 and it didn't work as expected, uncovering this bug.
There was a problem hiding this comment.
We have a batch with task1 and task2.
It's actually a problem with just a single task. The executor can call task1.success() and then throw an exception.
| neverCalledAckListener | ||
| ); | ||
| case 6 -> taskContext.onFailure( | ||
| new RuntimeException(randomValueOtherThan(expectedExceptionMessage, ESTestCase::randomIdentifier)) |
There was a problem hiding this comment.
What is the value of randomizing this and expectedExceptionMessage error messages? Would it not be simpler/more readable to hardcode them to distinct values?
There was a problem hiding this comment.
There are well over 200 instances of the literal "simulated" in the codebase so one must burn at least a few brain cycles wondering exactly to which of them assertThat(e.getMessage(), equalTo("simulated")); refers, whereas by using a variable your IDE will show you with what we're expecting to match. And then there's no need to use a specific literal value in both places, so we tell the reader that it doesn't matter by randomizing. Similarly here we're saying to the reader that we don't care what the message is as long as it's not the expected one; I don't think ESTestCase::randomIdentifier would return expectedExceptionMessage anyway, not even with astronomically small probability, but that point would be lost to the reader if we just called it directly.
There was a problem hiding this comment.
(I could have said randomIdentifier("not-expected-message-") or something instead here I guess)
ywangd
left a comment
There was a problem hiding this comment.
LGTM
I am curious how you spot this error? Did you observe it in action or just by reading the code?
Nevermind. I now see it is extracted from #141998 |
If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.
If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.
If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.
If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.
If the execution of a cluster state update task throws an exception then
all the tasks in the batch must be failed with that exception, even if
some of them have been marked as successfully completed.