Skip to content

Ensure cancellation reason visible when task is canceled (#142079)#144629

Merged
michalborek merged 16 commits intoelastic:mainfrom
michalborek:fix-142079
Mar 23, 2026
Merged

Ensure cancellation reason visible when task is canceled (#142079)#144629
michalborek merged 16 commits intoelastic:mainfrom
michalborek:fix-142079

Conversation

@michalborek
Copy link
Copy Markdown
Contributor

@michalborek michalborek commented Mar 20, 2026

Although, the cancellation reason is set in a synchronized block the access to the reason variable is not synchronized so there operation is not atomic.

By removing an additional field (isCancelled) and replacing its usage by comparing the reason to null we can achieve the same behavior without a risk of reason field being null when the task was canceled.

To efficiently compare and set the reason, java.lang.invoke.VarHandle was used.

IMHO this is a genuine bug caught by a test that is currently muted. This PR fixes #142079 issue.

@cla-checker-service
Copy link
Copy Markdown

cla-checker-service bot commented Mar 20, 2026

💚 CLA has been signed

)

Although, the cancellation reason is set in a synchronised block the access
to the variable is not synchronized so there is no atomicity ensured there.
We could synchronize the getters but that would be slow but since reason is always read
after the isCancelled, this operation is safe and doesn't bring additional synchronization cost.
@michalborek michalborek added the Team:Distributed Meta label for distributed team. label Mar 20, 2026
@michalborek michalborek requested a review from szybia March 20, 2026 08:47
@michalborek michalborek marked this pull request as ready for review March 20, 2026 08:53
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:Distributed Meta label for distributed team. labels Mar 20, 2026
@michalborek michalborek added Team:Distributed Meta label for distributed team. and removed >bug labels Mar 20, 2026
@elasticsearchmachine elasticsearchmachine removed the Team:Distributed Meta label for distributed team. label Mar 20, 2026
@michalborek michalborek added the :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. label Mar 20, 2026
@elasticsearchmachine elasticsearchmachine added Team:Distributed Meta label for distributed team. and removed needs:triage Requires assignment of a team area label labels Mar 20, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, todays order of assignments is deliberate, according to the javadoc for getReasonCancelled:

May also be null if the task was just cancelled since we don't set the reason and the cancellation flag atomically.

Instead, the expectation is that callers don't rely on the reason being set when the cancelled flag is, instead using something like ensureNotCancelled or notifyIfCancelled which should avoid the race.

Can you instead fix up the caller that's trying to access the reason in a racy way?

@michalborek michalborek marked this pull request as draft March 20, 2026 10:30
The isCancelled field was not necessary to determine whether the task was cancelled.
Equally, we can check whether reason field is null. This change will be slightly less readable
but allows us to remove the synchronized block and simplify the cancel() method.
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you adjust CancellableTasksTests to demonstrate the bug that this fixes (i.e. that getReasonCancelled cannot return null after isCancelled returns true)?

@DaveCTurner DaveCTurner dismissed their stale review March 20, 2026 12:54

All addressed

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @michalborek, I've created a changelog YAML for you.

@michalborek michalborek marked this pull request as ready for review March 20, 2026 14:32
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @michalborek !

@michalborek michalborek removed the request for review from szybia March 20, 2026 16:00
Copy link
Copy Markdown
Contributor

@szybia szybia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! TIL more about VarHandle and Thread.onSpinWait :)

Comment on lines +633 to +640
while (task.isCancelled() == false) {
Thread.onSpinWait();
}
assertThat(
"toString should consistently render status and reason",
task.toString(),
anyOf(endsWith("reason='null', isCancelled=false}"), endsWith("reason='test-reason', isCancelled=true}"))
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am i right in saying that after the while loop we can only get reason='test-reason', isCancelled=true}?

and within the while loop we can get both

Suggested change
while (task.isCancelled() == false) {
Thread.onSpinWait();
}
assertThat(
"toString should consistently render status and reason",
task.toString(),
anyOf(endsWith("reason='null', isCancelled=false}"), endsWith("reason='test-reason', isCancelled=true}"))
);
while (task.isCancelled() == false) {
assertThat(
"toString should consistently render status and reason",
task.toString(),
anyOf(endsWith("reason='null', isCancelled=false}"), endsWith("reason='test-reason', isCancelled=true}"))
);
}
assertThat(
"toString should consistently render status and reason",
task.toString(),
endsWith("reason='test-reason', isCancelled=true}")
);

basically same comment as David: #144629 (comment)

i guess not critical but IIUC stronger assertion to have

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the intention was slightly different.

"Until the task isn't cancelled, I don't care. When it gets cancelled, it should be consistent."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah sorry I missed this. We do care while the task is (recently) not cancelled that we do not see an inconsistent result, and once we know the task is definitely cancelled we must not see isCancelled=false again. So yes I'd rather we adjusted it as Szymon suggests.

Copy link
Copy Markdown
Contributor Author

@michalborek michalborek Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I prematurely hit Merge, I've prepared a PR for this: #144716 :)

@michalborek michalborek merged commit 5783103 into elastic:main Mar 23, 2026
36 checks passed
michalborek added a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
)

Although the cancellation reason is set in a synchronized block, access to the reason variable is not synchronized; therefore, the operation is not atomic.

By removing the additional isCancelled field and replacing its usage with a null-check on reason, we achieve the same behavior without the risk of the reason field being null when a task is canceled. Additionally, we can get rid of all the synchronization blocks in this class.

To efficiently compare and set the reason, java.lang.invoke.VarHandle was used. 

IMHO, this is a genuine bug caught by a test that is currently muted. This PR fixes elastic#142079, where a test was randomly failing.
@michalborek michalborek deleted the fix-142079 branch March 24, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed Meta label for distributed team. v9.4.0

Projects

None yet

4 participants