Skip to content

Route blocking accessibility calls through the watchdog so a hung app can't freeze NVDA#20170

Merged
seanbudd merged 4 commits into
nvaccess:masterfrom
heath-toby:fix/uia-provider-cache-perf
May 26, 2026
Merged

Route blocking accessibility calls through the watchdog so a hung app can't freeze NVDA#20170
seanbudd merged 4 commits into
nvaccess:masterfrom
heath-toby:fix/uia-provider-cache-perf

Conversation

@heath-toby

@heath-toby heath-toby commented May 18, 2026

Copy link
Copy Markdown
Contributor

Link to issue number:

Fixes #20169

Summary of the issue:

When an application stops responding, NVDA's core thread blocks on synchronous cross-process accessibility calls into it. Some of these calls — notably UiaHasServerSideProvider and the in-process Word object-model RPC — are not routed through any mechanism the watchdog can cancel, so NVDA can stay frozen until the application is killed.

Description of user facing changes:

NVDA recovers more quickly when an application stops responding; in particular, switching away from a hung application returns NVDA to responsiveness immediately.

Description of developer facing changes:

A handful of blocking cross-process accessibility calls are now run via watchdog.cancellableExecute.

Description of development approach:

Reworked from the original proposal per @jcsteh's review: rather than introducing a per-call timeout, a circuit breaker, or a new state module, this relies entirely on the existing watchdog, which already centralises hang detection with a dynamically-adjusted timeout (and crucially drops it to MIN_CORE_ALIVE_TIMEOUT when the user switches apps).

The blocking calls the watchdog cannot otherwise cancel are routed through watchdog.cancellableExecute:

  • UIAHandler._isUIAWindowHelper: UiaHasServerSideProvider.
  • NVDAObjects.UIA.kwargsFromSuper: the UIA client element lookups (ElementFromHandleBuildCache / getFocusedElementBuildCache / ElementFromPointBuildCache).
  • winword getTextWithFields: nvdaInProcUtils_winword_getTextInRange.

On cancellation each caller falls back exactly as it already does on the corresponding failure path. cancellableExecute itself is unchanged.

Testing strategy:

ruff and the unit test suite pass. Manual testing against deliberately-hung and real-world unresponsive applications: NVDA recovers, and switching away from the hung application restores responsiveness immediately.

Known issues with pull request:

  • A hung application's own content still cannot be read while it is hung; this only ensures NVDA itself stays cancellable/recoverable.
  • As with any cancellableExecute caller, a cancelled call leaves an abandoned worker thread until the underlying call eventually returns. This is pre-existing behaviour, mitigated by the watchdog not dispatching new work during recovery.
  • Caching the UiaHasServerSideProvider result (raised in review) is intentionally left out of this PR so it can be evaluated separately.

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

@heath-toby heath-toby marked this pull request as ready for review May 19, 2026 00:00
@heath-toby heath-toby requested a review from a team as a code owner May 19, 2026 00:00
@heath-toby heath-toby requested a review from seanbudd May 19, 2026 00:00
@seanbudd seanbudd marked this pull request as draft May 19, 2026 02:51
@seanbudd

Copy link
Copy Markdown
Member

Converting to draft until #20168 is resolved

@jcsteh

jcsteh commented May 21, 2026

Copy link
Copy Markdown
Contributor

Thanks for working on this. A few high level thoughts:

• watchdog.cancellableExecute / CancellableCallThread.execute gain an optional bounded timeout / ccTimeout (default None = existing behaviour, byte-unchanged).

I don't think we should be introducing a configurable timeout here. watchdog is intended to centralise the management of hang detection and dynamically adjusted timeouts.

We deliberately use a longer timeout of 10 seconds (NORMAL_CORE_ALIVE_TIMEOUT) in the first instance because an app can legitimately take some time to respond to a query if it's under heavy load. While a much shorter timeout intuitively seems like it would make things better, it can make things worse on a heavily loaded or low end system if we fire lots of queries, cancel or abandon them before they're complete and then just fire off more queries. We could perhaps discuss changing this to 5 seconds, which I think matches the Windows hung window timeout. However, I think anything less is asking for a whole new class of problems. In addition, I think allowing callers to customise this ultimately increases complexity and introduces inconsistency, and makes problem diagnosis trickier.

Importantly, note that when a user switches apps, we drop the timeout to 0.5 seconds (MIN_CORE_ALIVE_TIMEOUT). Anything that was previously waiting 10 seconds will immediately stop waiting. This means that if a user switches apps in order to interact with another app, we should immediately start responding again, assuming that watchdog supports cancellation of the blocking call. This gives us the best of both worlds: we don't prematurely cancel calls that might just be taking a while, but a user can also get response back immediately when they switch away from the offending app to do something else.

• New dependency-free coreThreadProtection module: a near-zero-cost shared "an app hung recently" signal (noteAppHang / isHungModeActive).

We already have something similar in watchdog.isAttemptingRecovery. I would suggest using that or integrating any new functionality with it.

It's great that calls like UiaHasServerSideProvider are being moved to cancellableExecute. I've long hypothesised that this could improve hangs significantly, since we unfortunately don't have another watchdog mechanism to cancel them. I'm slightly concerned that we could end up growing a whole heap of abandoned background threads, but that should be mitigated somewhat by the fact that we don't execute anything at all if watchdog goes into recovery mode. There are other mitigations we could implement too, but I think it's best to start simpler here and scale up as needed, as increased complexity may introduce new classes of bugs.

• UIAHandler: timedUIAClientCall, a UIA-client jam breaker with exponential backoff (isUIAClientJammed / noteUIAClientJam).

Again, I wonder whether we should just be relying on watchdog here. Rather than introducing new timeouts, we have a simpler rule: execute via watchdog and let watchdog handle the timeouts and cancellation.

@jcsteh

jcsteh commented May 21, 2026

Copy link
Copy Markdown
Contributor

There's an assumption here that the response from UiaHasServerSideProvider will never (or rarely) change. That should usually be the case, but I think there might be cases where it isn't quite that clear; e.g. an app might still be initialising. If we do get this wrong, it really isn't great. At the very least, I'd suggest that this should be split out into a separate PR so it can be evaluated separately and reverted if needed.

@jcsteh

jcsteh commented May 21, 2026

Copy link
Copy Markdown
Contributor

As a practical example of where a short timeout is not necessarily better, I recall when I was first developing watchdog that I did indeed have it set much shorter. However, it caused problems like the focus not being reported when starting an app because apps are understandably a bit sluggish while they're first loading. IMO, NVDA being unresponsive for a few seconds in an unresponsive app is reasonable, as long as the user can switch apps, including accessing the Windows "not responding" dialog. On the other hand, NVDA failing to report things because it didn't wait long enough is not acceptable.

@heath-toby

Copy link
Copy Markdown
Contributor Author

Thanks @jcsteh — this is really helpful, and I agree with the redirect. The watchdog already centralises this far better than the parallel machinery I had added. Reworking #20170 accordingly:

  • Dropping the per-call ccTimeout. Leaning on watchdog's dynamic timeout — especially the drop to MIN_CORE_ALIVE_TIMEOUT on app-switch, which already gives the "user can always escape" guarantee — is the right model. cancellableExecute will be left unchanged.
  • Dropping coreThreadProtection in favour of the existing watchdog.isAttemptingRecovery.
  • Dropping the UIA jam breaker / exponential backoff — same reasoning: route through cancellableExecute and let watchdog own timeouts and cancellation.
  • Keeping the part you were positive about: routing the blocking UIA client calls (UiaHasServerSideProvider, ElementFromHandleBuildCache, etc.) and the Word object-model text call through cancellableExecute so the watchdog can actually cancel them.
  • Splitting the UiaHasServerSideProvider result cache into its own PR so it can be evaluated and reverted independently, per your point about the result not always being stable while an app is still initialising.
  • In the spirit of "start simpler, scale up as needed", I will also drop the central NVDAObject._getPropertyViaCache guard from this PR — the simple "route blocking calls through watchdog" change stands on its own, and the guard can be revisited later if it proves necessary.

I will force-push the simplified branch to this PR shortly. Thanks again for the watchdog context — the app-switch timeout-drop in particular reframes the whole thing.

Reworked per review feedback on nvaccess#20170: when an application stops
responding, route the blocking cross-process calls that the watchdog
cannot otherwise cancel through watchdog.cancellableExecute, so the
watchdog's existing dynamically-adjusted timeout owns cancellation. In
particular that timeout drops to MIN_CORE_ALIVE_TIMEOUT when the user
switches apps, letting the user escape a hung application immediately.

- UIAHandler._isUIAWindowHelper: UiaHasServerSideProvider.
- NVDAObjects.UIA.kwargsFromSuper: the UIA client element lookups.
- winword getTextWithFields: nvdaInProcUtils_winword_getTextInRange.

On cancellation the caller falls back exactly as on the pre-existing
failure paths. No new timeout, breaker or module is introduced; the
watchdog owns hang detection, timeouts and cancellation.

Fixes nvaccess#20169

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@heath-toby heath-toby changed the title Keep NVDA's core responsive when an application hangs (timeout-bounded accessibility calls) Route blocking accessibility calls through the watchdog so a hung app can't freeze NVDA May 21, 2026
@heath-toby heath-toby force-pushed the fix/uia-provider-cache-perf branch from b282423 to 451fcb6 Compare May 21, 2026 10:05
@heath-toby heath-toby marked this pull request as ready for review May 21, 2026 10:06
@jcsteh

jcsteh commented May 21, 2026

Copy link
Copy Markdown
Contributor

Do we have evidence that ElementFromPointBuildCache and getFocusedElementBuildCache do not respect the UIA ConnectionRecoveryBehaviorOptions_Enabled option? UiaHasServerSideProvider can't respect this option because it is a free function, not a method on an IUIAutomation client interface, and I've definitely seen it show up in NVDA frozen stack logs. Have we definitely seen ElementFromPointBuildCache and getFocusedElementBuildCache show up in frozen stacks?

@seanbudd seanbudd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @heath-toby - changes generally look good to me

Comment thread source/NVDAObjects/UIA/__init__.py Outdated
Comment thread source/UIAHandler/__init__.py Outdated
Comment thread source/NVDAObjects/window/winword.py Outdated
Per @jcsteh: the ElementFrom*BuildCache calls are comtypes COM
client-interface methods that the watchdog can already cancel via
CoCancelCall, so routing them through cancellableExecute adds no
cancellability they lack. Only UiaHasServerSideProvider (a free
function, not a client-interface method) and the winword in-process
RPC genuinely need it. Revert the kwargsFromSuper changes; keep just
those two call sites wrapped.

Per @seanbudd: move the watchdog and exceptions imports to module
level.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@heath-toby

Copy link
Copy Markdown
Contributor Author

Good question — and checking my own frozen-stack logs from this work, you're right to push back.

Across the debug logs I gathered while reproducing these hangs, the number of times each call appeared in a frozen-stack dump:

Call Frozen-stack appearances
UiaHasServerSideProvider 37
ElementFromHandleBuildCache 12
getFocusedElementBuildCache 1
ElementFromPointBuildCache 0

But the raw counts aren't really the deciding factor. The more important point is the one you raised: the ElementFrom*BuildCache calls are comtypes COM methods on the IUIAutomation client interface, so the watchdog can already cancel them via CoCancelCall once its timeout elapses. They show up in frozen-stack dumps only because the dump is taken mid-freeze, before the watchdog's timeout fires — not because the watchdog is unable to reach them. UiaHasServerSideProvider, being a free function rather than a client-interface method, is the one case the watchdog genuinely can't otherwise cancel (and, as you note, it can't benefit from ConnectionRecoveryBehavior either). I don't have evidence the *BuildCache methods bypass ConnectionRecoveryBehavior, so wrapping them was redundant.

So I've narrowed the PR accordingly:

  • Reverted the kwargsFromSuper changes entirely — NVDAObjects/UIA/__init__.py is no longer touched.
  • Kept UiaHasServerSideProvider (UIAHandler) and the Word in-process RPC nvdaInProcUtils_winword_getTextInRange (winword) wrapped. Neither is a UIA client-interface method, so neither benefits from ConnectionRecoveryBehavior or the watchdog's existing CoCancelCall cancellation — both genuinely need cancellableExecute.

This keeps things minimal per your "start simpler, scale up as needed" steer. If ElementFromHandleBuildCache does later prove to need it, that can be a separate, evidence-backed change.

I've also moved the watchdog/exceptions imports to module level per @seanbudd's review.

@jcsteh

jcsteh commented May 22, 2026

Copy link
Copy Markdown
Contributor

Across the debug logs I gathered while reproducing these hangs, the number of times each call appeared in a frozen-stack dump:
Call Frozen-stack appearances
UiaHasServerSideProvider 37
ElementFromHandleBuildCache 12
getFocusedElementBuildCache 1
ElementFromPointBuildCache 0

Thanks, that's interesting info to have.

the ElementFrom*BuildCache calls are comtypes COM methods on the IUIAutomation client interface, so the watchdog can already cancel them via CoCancelCall once its timeout elapses.

Actually, I don't think that's true. CoCancelCall can only cancel COM calls for a server which implements a cancellation object. The standard COM marshaler does, so that means you can cancel COM calls to IAccessible, IAccessible2 and friends, Microsoft Office object model, etc. However, UI Automation has separate client and server interfaces and handles the IPC internally via named pipes. From the perspective of COM, UIA client is purely in-process, so it doesn't use the standard COM marshaler and I'm fairly sure it doesn't implement a cancellation object.

All of this is to say that watchdog very likely can't cancel these UIA calls. However, UIA ConnectionRecoveryBehavior should handle this if Microsoft have implemented it correctly. That's the big question: whether they have. Given past UIA core bugs, I don't have high confidence there.

This keeps things minimal per your "start simpler, scale up as needed" steer. If ElementFromHandleBuildCache does later prove to need it, that can be a separate, evidence-backed change.

That may well end up being the case, but yeah, I do believe the start simple, scale up approach is warranted here.

Even without cancellableExecute, there may also be some benefit in adding watchdog.isAttemptingRecovery guards in various places before we call methods. That way, we avoid the overhead of dispatching and then immediately cancelling them (and for UIA, perhaps a timeout, even a short one) when we know the watchdog is going to cancel them anyway. Perhaps we could have a watchdog.executeIfAlive wrapper or similar which immediately raises CallCancelled if isAttemptingRecovery. But again, I would suggest doing any of this in separate, smaller pull requests to mitigate risk and make tracking down any regressions easier (bisection, etc.).

@jcsteh

jcsteh commented May 22, 2026

Copy link
Copy Markdown
Contributor

Note that there are several other direct calls to nvdaInProcUtils methods we may also wish to wrap with cancellableExecute, but again, probably best to do this with evidence.

Comment thread user_docs/en/changes.md Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to keep NVDA responsive when a target application hangs by routing specific blocking cross-process accessibility calls through watchdog.cancellableExecute, allowing the watchdog to cancel the wait and avoid freezing NVDA’s core thread (Fixes #20169).

Changes:

  • Route UiaHasServerSideProvider(hwnd) checks through watchdog.cancellableExecute and treat cancellation as “non-UIA”.
  • Route the WinWord in-proc text retrieval (nvdaInProcUtils_winword_getTextInRange) through watchdog.cancellableExecute and handle watchdog cancellation.
  • Document the user-facing improvement in the changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
user_docs/en/changes.md Adds a changelog entry describing faster recovery from app hangs due to watchdog-cancellable calls.
source/UIAHandler/init.py Wrapes UiaHasServerSideProvider in watchdog.cancellableExecute and handles CallCancelled.
source/NVDAObjects/window/winword.py Wraps nvdaInProcUtils_winword_getTextInRange in watchdog.cancellableExecute and adds cancellation handling/logging.

Comment thread source/NVDAObjects/window/winword.py Outdated
Comment thread source/NVDAObjects/window/winword.py Outdated
…elog

- winword: on CallCancelled don't return [self.text]; that reads
  self._rangeObj.text, another COM call into the same hung Word, which
  re-blocks the core and defeats the cancellation. Return [""] instead.
  (Per @Copilot.)
- winword: downgrade the cancellation log line from debugWarning to debug
  so a persistent hang doesn't flood the log on repeated speech/braille
  updates. (Per @Copilot.)
- changes.md: rephrase the entry in user-facing terms per @seanbudd.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@heath-toby

Copy link
Copy Markdown
Contributor Author

Thanks for the correction on CoCancelCall — you're right, I was conflating standard-marshaler cancellability with UIA's custom IPC (named pipes, not the standard marshaler, no cancellation object). The narrowing still stands on "start simple, scale up with evidence" grounds, and the empirical case for the *BuildCache calls is weak anyway given how rarely they showed up in our frozen stacks. If they do prove to need wrapping later, that can be its own evidence-backed PR.

Noted on the executeIfAlive / pre-dispatch isAttemptingRecovery guard idea — nice optimization to skip the dispatch-then-cancel overhead, and I agree separate small PRs are the safer path for any follow-ups.

Also acknowledged on the other nvdaInProcUtils_* calls — the winword wrap is a good template for them, and I'll watch for which ones actually show up in real hung-app traces before proposing more.

I've also addressed @seanbudd's user-facing changelog rewording and @Copilot's two points on winword.py (the CallCancelled path was indeed re-calling into hung Word via self.text — fixed to return [""] with an explanatory comment; and the log line is now log.debug rather than log.debugWarning to avoid spam during a persistent hang). Pushed as 714ca5d33.

@seanbudd seanbudd requested a review from michaelDCurran May 25, 2026 23:50
@seanbudd seanbudd added merge-early Merge Early in a developer cycle conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. labels May 25, 2026
@dpy013

dpy013 commented May 26, 2026

Copy link
Copy Markdown
Contributor

hi @heath-toby
It is recommended to resolve the conflicts?

@seanbudd seanbudd merged commit 961db0e into nvaccess:master May 26, 2026
39 of 42 checks passed
@github-actions github-actions Bot added this to the 2026.3 milestone May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. merge-early Merge Early in a developer cycle

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core thread should never block indefinitely on a synchronous accessibility call when an application hangs

5 participants