Route blocking accessibility calls through the watchdog so a hung app can't freeze NVDA by heath-toby · Pull Request #20170 · nvaccess/nvda

heath-toby · 2026-05-18T23:52:39Z

Link to issue number:

Fixes #20169

Summary of the issue:

When an application stops responding, NVDA's core thread blocks on synchronous cross-process accessibility calls into it. Some of these calls — notably UiaHasServerSideProvider and the in-process Word object-model RPC — are not routed through any mechanism the watchdog can cancel, so NVDA can stay frozen until the application is killed.

Description of user facing changes:

NVDA recovers more quickly when an application stops responding; in particular, switching away from a hung application returns NVDA to responsiveness immediately.

Description of developer facing changes:

A handful of blocking cross-process accessibility calls are now run via watchdog.cancellableExecute.

Description of development approach:

Reworked from the original proposal per @jcsteh's review: rather than introducing a per-call timeout, a circuit breaker, or a new state module, this relies entirely on the existing watchdog, which already centralises hang detection with a dynamically-adjusted timeout (and crucially drops it to MIN_CORE_ALIVE_TIMEOUT when the user switches apps).

The blocking calls the watchdog cannot otherwise cancel are routed through watchdog.cancellableExecute:

UIAHandler._isUIAWindowHelper: UiaHasServerSideProvider.
NVDAObjects.UIA.kwargsFromSuper: the UIA client element lookups (ElementFromHandleBuildCache / getFocusedElementBuildCache / ElementFromPointBuildCache).
winword getTextWithFields: nvdaInProcUtils_winword_getTextInRange.

On cancellation each caller falls back exactly as it already does on the corresponding failure path. cancellableExecute itself is unchanged.

Testing strategy:

ruff and the unit test suite pass. Manual testing against deliberately-hung and real-world unresponsive applications: NVDA recovers, and switching away from the hung application restores responsiveness immediately.

Known issues with pull request:

A hung application's own content still cannot be read while it is hung; this only ensures NVDA itself stays cancellable/recoverable.
As with any cancellableExecute caller, a cancelled call leaves an abandoned worker thread until the underlying call eventually returns. This is pre-existing behaviour, mitigated by the watchdog not dispatching new work during recovery.
Caching the UiaHasServerSideProvider result (raised in review) is intentionally left out of this PR so it can be evaluated separately.

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

seanbudd · 2026-05-19T02:51:52Z

Converting to draft until #20168 is resolved

jcsteh · 2026-05-21T06:29:41Z

Thanks for working on this. A few high level thoughts:

• watchdog.cancellableExecute / CancellableCallThread.execute gain an optional bounded timeout / ccTimeout (default None = existing behaviour, byte-unchanged).

I don't think we should be introducing a configurable timeout here. watchdog is intended to centralise the management of hang detection and dynamically adjusted timeouts.

We deliberately use a longer timeout of 10 seconds (NORMAL_CORE_ALIVE_TIMEOUT) in the first instance because an app can legitimately take some time to respond to a query if it's under heavy load. While a much shorter timeout intuitively seems like it would make things better, it can make things worse on a heavily loaded or low end system if we fire lots of queries, cancel or abandon them before they're complete and then just fire off more queries. We could perhaps discuss changing this to 5 seconds, which I think matches the Windows hung window timeout. However, I think anything less is asking for a whole new class of problems. In addition, I think allowing callers to customise this ultimately increases complexity and introduces inconsistency, and makes problem diagnosis trickier.

Importantly, note that when a user switches apps, we drop the timeout to 0.5 seconds (MIN_CORE_ALIVE_TIMEOUT). Anything that was previously waiting 10 seconds will immediately stop waiting. This means that if a user switches apps in order to interact with another app, we should immediately start responding again, assuming that watchdog supports cancellation of the blocking call. This gives us the best of both worlds: we don't prematurely cancel calls that might just be taking a while, but a user can also get response back immediately when they switch away from the offending app to do something else.

• New dependency-free coreThreadProtection module: a near-zero-cost shared "an app hung recently" signal (noteAppHang / isHungModeActive).

We already have something similar in watchdog.isAttemptingRecovery. I would suggest using that or integrating any new functionality with it.

It's great that calls like UiaHasServerSideProvider are being moved to cancellableExecute. I've long hypothesised that this could improve hangs significantly, since we unfortunately don't have another watchdog mechanism to cancel them. I'm slightly concerned that we could end up growing a whole heap of abandoned background threads, but that should be mitigated somewhat by the fact that we don't execute anything at all if watchdog goes into recovery mode. There are other mitigations we could implement too, but I think it's best to start simpler here and scale up as needed, as increased complexity may introduce new classes of bugs.

• UIAHandler: timedUIAClientCall, a UIA-client jam breaker with exponential backoff (isUIAClientJammed / noteUIAClientJam).

Again, I wonder whether we should just be relying on watchdog here. Rather than introducing new timeouts, we have a simpler rule: execute via watchdog and let watchdog handle the timeouts and cancellation.

jcsteh · 2026-05-21T06:37:12Z

There's an assumption here that the response from UiaHasServerSideProvider will never (or rarely) change. That should usually be the case, but I think there might be cases where it isn't quite that clear; e.g. an app might still be initialising. If we do get this wrong, it really isn't great. At the very least, I'd suggest that this should be split out into a separate PR so it can be evaluated separately and reverted if needed.

jcsteh · 2026-05-21T06:42:10Z

As a practical example of where a short timeout is not necessarily better, I recall when I was first developing watchdog that I did indeed have it set much shorter. However, it caused problems like the focus not being reported when starting an app because apps are understandably a bit sluggish while they're first loading. IMO, NVDA being unresponsive for a few seconds in an unresponsive app is reasonable, as long as the user can switch apps, including accessing the Windows "not responding" dialog. On the other hand, NVDA failing to report things because it didn't wait long enough is not acceptable.

heath-toby · 2026-05-21T09:45:24Z

Thanks @jcsteh — this is really helpful, and I agree with the redirect. The watchdog already centralises this far better than the parallel machinery I had added. Reworking #20170 accordingly:

Dropping the per-call ccTimeout. Leaning on watchdog's dynamic timeout — especially the drop to MIN_CORE_ALIVE_TIMEOUT on app-switch, which already gives the "user can always escape" guarantee — is the right model. cancellableExecute will be left unchanged.
Dropping coreThreadProtection in favour of the existing watchdog.isAttemptingRecovery.
Dropping the UIA jam breaker / exponential backoff — same reasoning: route through cancellableExecute and let watchdog own timeouts and cancellation.
Keeping the part you were positive about: routing the blocking UIA client calls (UiaHasServerSideProvider, ElementFromHandleBuildCache, etc.) and the Word object-model text call through cancellableExecute so the watchdog can actually cancel them.
Splitting the UiaHasServerSideProvider result cache into its own PR so it can be evaluated and reverted independently, per your point about the result not always being stable while an app is still initialising.
In the spirit of "start simpler, scale up as needed", I will also drop the central NVDAObject._getPropertyViaCache guard from this PR — the simple "route blocking calls through watchdog" change stands on its own, and the guard can be revisited later if it proves necessary.

I will force-push the simplified branch to this PR shortly. Thanks again for the watchdog context — the app-switch timeout-drop in particular reframes the whole thing.

Reworked per review feedback on nvaccess#20170: when an application stops responding, route the blocking cross-process calls that the watchdog cannot otherwise cancel through watchdog.cancellableExecute, so the watchdog's existing dynamically-adjusted timeout owns cancellation. In particular that timeout drops to MIN_CORE_ALIVE_TIMEOUT when the user switches apps, letting the user escape a hung application immediately. - UIAHandler._isUIAWindowHelper: UiaHasServerSideProvider. - NVDAObjects.UIA.kwargsFromSuper: the UIA client element lookups. - winword getTextWithFields: nvdaInProcUtils_winword_getTextInRange. On cancellation the caller falls back exactly as on the pre-existing failure paths. No new timeout, breaker or module is introduced; the watchdog owns hang detection, timeouts and cancellation. Fixes nvaccess#20169 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jcsteh · 2026-05-21T22:34:23Z

Do we have evidence that ElementFromPointBuildCache and getFocusedElementBuildCache do not respect the UIA ConnectionRecoveryBehaviorOptions_Enabled option? UiaHasServerSideProvider can't respect this option because it is a free function, not a method on an IUIAutomation client interface, and I've definitely seen it show up in NVDA frozen stack logs. Have we definitely seen ElementFromPointBuildCache and getFocusedElementBuildCache show up in frozen stacks?

seanbudd

Thanks @heath-toby - changes generally look good to me

@jcsteh

Per @jcsteh: the ElementFrom*BuildCache calls are comtypes COM client-interface methods that the watchdog can already cancel via CoCancelCall, so routing them through cancellableExecute adds no cancellability they lack. Only UiaHasServerSideProvider (a free function, not a client-interface method) and the winword in-process RPC genuinely need it. Revert the kwargsFromSuper changes; keep just those two call sites wrapped. Per @seanbudd: move the watchdog and exceptions imports to module level. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

heath-toby · 2026-05-22T15:32:35Z

Good question — and checking my own frozen-stack logs from this work, you're right to push back.

Across the debug logs I gathered while reproducing these hangs, the number of times each call appeared in a frozen-stack dump:

Call	Frozen-stack appearances
`UiaHasServerSideProvider`	37
`ElementFromHandleBuildCache`	12
`getFocusedElementBuildCache`	1
`ElementFromPointBuildCache`	0

But the raw counts aren't really the deciding factor. The more important point is the one you raised: the ElementFrom*BuildCache calls are comtypes COM methods on the IUIAutomation client interface, so the watchdog can already cancel them via CoCancelCall once its timeout elapses. They show up in frozen-stack dumps only because the dump is taken mid-freeze, before the watchdog's timeout fires — not because the watchdog is unable to reach them. UiaHasServerSideProvider, being a free function rather than a client-interface method, is the one case the watchdog genuinely can't otherwise cancel (and, as you note, it can't benefit from ConnectionRecoveryBehavior either). I don't have evidence the *BuildCache methods bypass ConnectionRecoveryBehavior, so wrapping them was redundant.

So I've narrowed the PR accordingly:

Reverted the kwargsFromSuper changes entirely — NVDAObjects/UIA/__init__.py is no longer touched.
Kept UiaHasServerSideProvider (UIAHandler) and the Word in-process RPC nvdaInProcUtils_winword_getTextInRange (winword) wrapped. Neither is a UIA client-interface method, so neither benefits from ConnectionRecoveryBehavior or the watchdog's existing CoCancelCall cancellation — both genuinely need cancellableExecute.

This keeps things minimal per your "start simpler, scale up as needed" steer. If ElementFromHandleBuildCache does later prove to need it, that can be a separate, evidence-backed change.

I've also moved the watchdog/exceptions imports to module level per @seanbudd's review.

jcsteh · 2026-05-22T23:05:22Z

Across the debug logs I gathered while reproducing these hangs, the number of times each call appeared in a frozen-stack dump:
Call Frozen-stack appearances
UiaHasServerSideProvider 37
ElementFromHandleBuildCache 12
getFocusedElementBuildCache 1
ElementFromPointBuildCache 0

Thanks, that's interesting info to have.

the ElementFrom*BuildCache calls are comtypes COM methods on the IUIAutomation client interface, so the watchdog can already cancel them via CoCancelCall once its timeout elapses.

Actually, I don't think that's true. CoCancelCall can only cancel COM calls for a server which implements a cancellation object. The standard COM marshaler does, so that means you can cancel COM calls to IAccessible, IAccessible2 and friends, Microsoft Office object model, etc. However, UI Automation has separate client and server interfaces and handles the IPC internally via named pipes. From the perspective of COM, UIA client is purely in-process, so it doesn't use the standard COM marshaler and I'm fairly sure it doesn't implement a cancellation object.

All of this is to say that watchdog very likely can't cancel these UIA calls. However, UIA ConnectionRecoveryBehavior should handle this if Microsoft have implemented it correctly. That's the big question: whether they have. Given past UIA core bugs, I don't have high confidence there.

This keeps things minimal per your "start simpler, scale up as needed" steer. If ElementFromHandleBuildCache does later prove to need it, that can be a separate, evidence-backed change.

That may well end up being the case, but yeah, I do believe the start simple, scale up approach is warranted here.

Even without cancellableExecute, there may also be some benefit in adding watchdog.isAttemptingRecovery guards in various places before we call methods. That way, we avoid the overhead of dispatching and then immediately cancelling them (and for UIA, perhaps a timeout, even a short one) when we know the watchdog is going to cancel them anyway. Perhaps we could have a watchdog.executeIfAlive wrapper or similar which immediately raises CallCancelled if isAttemptingRecovery. But again, I would suggest doing any of this in separate, smaller pull requests to mitigate risk and make tracking down any regressions easier (bisection, etc.).

jcsteh · 2026-05-22T23:11:00Z

Note that there are several other direct calls to nvdaInProcUtils methods we may also wish to wrap with cancellableExecute, but again, probably best to do this with evidence.

Copilot

Pull request overview

This PR aims to keep NVDA responsive when a target application hangs by routing specific blocking cross-process accessibility calls through watchdog.cancellableExecute, allowing the watchdog to cancel the wait and avoid freezing NVDA’s core thread (Fixes #20169).

Changes:

Route UiaHasServerSideProvider(hwnd) checks through watchdog.cancellableExecute and treat cancellation as “non-UIA”.
Route the WinWord in-proc text retrieval (nvdaInProcUtils_winword_getTextInRange) through watchdog.cancellableExecute and handle watchdog cancellation.
Document the user-facing improvement in the changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
user_docs/en/changes.md	Adds a changelog entry describing faster recovery from app hangs due to watchdog-cancellable calls.
source/UIAHandler/init.py	Wrapes `UiaHasServerSideProvider` in `watchdog.cancellableExecute` and handles `CallCancelled`.
source/NVDAObjects/window/winword.py	Wraps `nvdaInProcUtils_winword_getTextInRange` in `watchdog.cancellableExecute` and adds cancellation handling/logging.

@seanbudd

…elog - winword: on CallCancelled don't return [self.text]; that reads self._rangeObj.text, another COM call into the same hung Word, which re-blocks the core and defeats the cancellation. Return [""] instead. (Per @Copilot.) - winword: downgrade the cancellation log line from debugWarning to debug so a persistent hang doesn't flood the log on repeated speech/braille updates. (Per @Copilot.) - changes.md: rephrase the entry in user-facing terms per @seanbudd. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

heath-toby · 2026-05-25T10:18:16Z

Thanks for the correction on CoCancelCall — you're right, I was conflating standard-marshaler cancellability with UIA's custom IPC (named pipes, not the standard marshaler, no cancellation object). The narrowing still stands on "start simple, scale up with evidence" grounds, and the empirical case for the *BuildCache calls is weak anyway given how rarely they showed up in our frozen stacks. If they do prove to need wrapping later, that can be its own evidence-backed PR.

Noted on the executeIfAlive / pre-dispatch isAttemptingRecovery guard idea — nice optimization to skip the dispatch-then-cancel overhead, and I agree separate small PRs are the safer path for any follow-ups.

Also acknowledged on the other nvdaInProcUtils_* calls — the winword wrap is a good template for them, and I'll watch for which ones actually show up in real hung-app traces before proposing more.

I've also addressed @seanbudd's user-facing changelog rewording and @Copilot's two points on winword.py (the CallCancelled path was indeed re-calling into hung Word via self.text — fixed to return [""] with an explanatory comment; and the log line is now log.debug rather than log.debugWarning to avoid spam during a persistent hang). Pushed as 714ca5d33.

dpy013 · 2026-05-26T00:33:44Z

hi @heath-toby
It is recommended to resolve the conflicts?

heath-toby mentioned this pull request May 19, 2026

Keep NVDA responsive when a UIA or MSAA application stops responding #20168

Merged

5 tasks

heath-toby marked this pull request as ready for review May 19, 2026 00:00

heath-toby requested a review from a team as a code owner May 19, 2026 00:00

heath-toby requested a review from seanbudd May 19, 2026 00:00

seanbudd marked this pull request as draft May 19, 2026 02:51

heath-toby changed the title ~~Keep NVDA's core responsive when an application hangs (timeout-bounded accessibility calls)~~ Route blocking accessibility calls through the watchdog so a hung app can't freeze NVDA May 21, 2026

heath-toby force-pushed the fix/uia-provider-cache-perf branch from b282423 to 451fcb6 Compare May 21, 2026 10:05

heath-toby marked this pull request as ready for review May 21, 2026 10:06

seanbudd reviewed May 22, 2026

View reviewed changes

Comment thread source/NVDAObjects/UIA/__init__.py Outdated

Comment thread source/UIAHandler/__init__.py Outdated

Comment thread source/NVDAObjects/window/winword.py Outdated

seanbudd reviewed May 25, 2026

View reviewed changes

Comment thread user_docs/en/changes.md Outdated

seanbudd requested a review from Copilot May 25, 2026 00:03

Copilot started reviewing on behalf of seanbudd May 25, 2026 00:03 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread source/NVDAObjects/window/winword.py Outdated

Comment thread source/NVDAObjects/window/winword.py Outdated

seanbudd requested a review from michaelDCurran May 25, 2026 23:50

seanbudd added merge-early Merge Early in a developer cycle conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. labels May 25, 2026

seanbudd approved these changes May 26, 2026

View reviewed changes

Merge branch 'master' into fix/uia-provider-cache-perf

939ff80

seanbudd merged commit 961db0e into nvaccess:master May 26, 2026
39 of 42 checks passed

github-actions Bot added this to the 2026.3 milestone May 26, 2026

Uh oh!

Conversation

heath-toby commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to issue number:

Summary of the issue:

Description of user facing changes:

Description of developer facing changes:

Description of development approach:

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Uh oh!

seanbudd commented May 19, 2026

Uh oh!

jcsteh commented May 21, 2026

Uh oh!

jcsteh commented May 21, 2026

Uh oh!

jcsteh commented May 21, 2026

Uh oh!

heath-toby commented May 21, 2026

Uh oh!

jcsteh commented May 21, 2026

Uh oh!

seanbudd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heath-toby commented May 22, 2026

Uh oh!

jcsteh commented May 22, 2026

Uh oh!

jcsteh commented May 22, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

heath-toby commented May 25, 2026

Uh oh!

dpy013 commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

heath-toby commented May 18, 2026 •

edited

Loading