Skip to content

Conversation

@jkotas
Copy link
Member

@jkotas jkotas commented Dec 29, 2025

QueueUserAPC2 can fail when the machine is under stress. Handle the failure gracefully.

Fixes #122763

@jkotas jkotas requested review from VSadov and Copilot December 29, 2025 06:59
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Dec 29, 2025
QueueUserAPC2 can fail when the machine is under stress. Handle the failure gracefully.

Fixes dotnet#122763
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a crash issue where QueueUserAPC2 can fail when the machine is under stress. The fix gracefully handles the failure by resetting the pending activation flag and returning the actual success status instead of always returning true.

Key Changes

  • Remove assertion that QueueUserAPC2 always succeeds, as it can legitimately fail under stress
  • Add proper error handling to reset m_hasPendingActivation flag when the APC queueing fails
  • Return the actual success status instead of always returning true

@VSadov
Copy link
Member

VSadov commented Dec 29, 2025

I think originally we believed that this API does not fail for transient reasons. - Like if we set arguments correctly, have permissions to use it, etc... then it would always work.

We generally can't "handle" a failure in an API that we use to suspend threads. We can hope that some other mechanism, like hijacking, will work.
We can also hope that the reason for API failure is transient and will not repeat (for this GC or for the next one), but if it is because of shortage of memory, the shortage may not resolve until the GC we are trying to do happens.

I guess we can ignore the failure. Not a lot of options. There is a chance the test will switch from occasional failure to occasional timeout.
One way to find out. Perhaps it is indeed because of stressful environment while running many tests at once and the condition does not happen often or does not last long.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jkotas jkotas merged commit 1018c74 into dotnet:main Dec 29, 2025
98 checks passed
@jkotas jkotas deleted the issue-122763 branch December 29, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test failure: GC/LargeMemory/Regressions/largearraytest/largearraytest.cmd

2 participants