Route blocking accessibility calls through the watchdog so a hung app can't freeze NVDA#20170
Conversation
|
Converting to draft until #20168 is resolved |
|
Thanks for working on this. A few high level thoughts:
I don't think we should be introducing a configurable timeout here. watchdog is intended to centralise the management of hang detection and dynamically adjusted timeouts. We deliberately use a longer timeout of 10 seconds (NORMAL_CORE_ALIVE_TIMEOUT) in the first instance because an app can legitimately take some time to respond to a query if it's under heavy load. While a much shorter timeout intuitively seems like it would make things better, it can make things worse on a heavily loaded or low end system if we fire lots of queries, cancel or abandon them before they're complete and then just fire off more queries. We could perhaps discuss changing this to 5 seconds, which I think matches the Windows hung window timeout. However, I think anything less is asking for a whole new class of problems. In addition, I think allowing callers to customise this ultimately increases complexity and introduces inconsistency, and makes problem diagnosis trickier. Importantly, note that when a user switches apps, we drop the timeout to 0.5 seconds (MIN_CORE_ALIVE_TIMEOUT). Anything that was previously waiting 10 seconds will immediately stop waiting. This means that if a user switches apps in order to interact with another app, we should immediately start responding again, assuming that watchdog supports cancellation of the blocking call. This gives us the best of both worlds: we don't prematurely cancel calls that might just be taking a while, but a user can also get response back immediately when they switch away from the offending app to do something else.
We already have something similar in watchdog.isAttemptingRecovery. I would suggest using that or integrating any new functionality with it. It's great that calls like UiaHasServerSideProvider are being moved to cancellableExecute. I've long hypothesised that this could improve hangs significantly, since we unfortunately don't have another watchdog mechanism to cancel them. I'm slightly concerned that we could end up growing a whole heap of abandoned background threads, but that should be mitigated somewhat by the fact that we don't execute anything at all if watchdog goes into recovery mode. There are other mitigations we could implement too, but I think it's best to start simpler here and scale up as needed, as increased complexity may introduce new classes of bugs.
Again, I wonder whether we should just be relying on watchdog here. Rather than introducing new timeouts, we have a simpler rule: execute via watchdog and let watchdog handle the timeouts and cancellation. |
|
There's an assumption here that the response from UiaHasServerSideProvider will never (or rarely) change. That should usually be the case, but I think there might be cases where it isn't quite that clear; e.g. an app might still be initialising. If we do get this wrong, it really isn't great. At the very least, I'd suggest that this should be split out into a separate PR so it can be evaluated separately and reverted if needed. |
|
As a practical example of where a short timeout is not necessarily better, I recall when I was first developing watchdog that I did indeed have it set much shorter. However, it caused problems like the focus not being reported when starting an app because apps are understandably a bit sluggish while they're first loading. IMO, NVDA being unresponsive for a few seconds in an unresponsive app is reasonable, as long as the user can switch apps, including accessing the Windows "not responding" dialog. On the other hand, NVDA failing to report things because it didn't wait long enough is not acceptable. |
|
Thanks @jcsteh — this is really helpful, and I agree with the redirect. The watchdog already centralises this far better than the parallel machinery I had added. Reworking #20170 accordingly:
I will force-push the simplified branch to this PR shortly. Thanks again for the watchdog context — the app-switch timeout-drop in particular reframes the whole thing. |
Reworked per review feedback on nvaccess#20170: when an application stops responding, route the blocking cross-process calls that the watchdog cannot otherwise cancel through watchdog.cancellableExecute, so the watchdog's existing dynamically-adjusted timeout owns cancellation. In particular that timeout drops to MIN_CORE_ALIVE_TIMEOUT when the user switches apps, letting the user escape a hung application immediately. - UIAHandler._isUIAWindowHelper: UiaHasServerSideProvider. - NVDAObjects.UIA.kwargsFromSuper: the UIA client element lookups. - winword getTextWithFields: nvdaInProcUtils_winword_getTextInRange. On cancellation the caller falls back exactly as on the pre-existing failure paths. No new timeout, breaker or module is introduced; the watchdog owns hang detection, timeouts and cancellation. Fixes nvaccess#20169 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b282423 to
451fcb6
Compare
|
Do we have evidence that ElementFromPointBuildCache and getFocusedElementBuildCache do not respect the UIA ConnectionRecoveryBehaviorOptions_Enabled option? UiaHasServerSideProvider can't respect this option because it is a free function, not a method on an IUIAutomation client interface, and I've definitely seen it show up in NVDA frozen stack logs. Have we definitely seen ElementFromPointBuildCache and getFocusedElementBuildCache show up in frozen stacks? |
seanbudd
left a comment
There was a problem hiding this comment.
Thanks @heath-toby - changes generally look good to me
Per @jcsteh: the ElementFrom*BuildCache calls are comtypes COM client-interface methods that the watchdog can already cancel via CoCancelCall, so routing them through cancellableExecute adds no cancellability they lack. Only UiaHasServerSideProvider (a free function, not a client-interface method) and the winword in-process RPC genuinely need it. Revert the kwargsFromSuper changes; keep just those two call sites wrapped. Per @seanbudd: move the watchdog and exceptions imports to module level. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Good question — and checking my own frozen-stack logs from this work, you're right to push back. Across the debug logs I gathered while reproducing these hangs, the number of times each call appeared in a frozen-stack dump:
But the raw counts aren't really the deciding factor. The more important point is the one you raised: the So I've narrowed the PR accordingly:
This keeps things minimal per your "start simpler, scale up as needed" steer. If I've also moved the |
Thanks, that's interesting info to have.
Actually, I don't think that's true. CoCancelCall can only cancel COM calls for a server which implements a cancellation object. The standard COM marshaler does, so that means you can cancel COM calls to IAccessible, IAccessible2 and friends, Microsoft Office object model, etc. However, UI Automation has separate client and server interfaces and handles the IPC internally via named pipes. From the perspective of COM, UIA client is purely in-process, so it doesn't use the standard COM marshaler and I'm fairly sure it doesn't implement a cancellation object. All of this is to say that watchdog very likely can't cancel these UIA calls. However, UIA ConnectionRecoveryBehavior should handle this if Microsoft have implemented it correctly. That's the big question: whether they have. Given past UIA core bugs, I don't have high confidence there.
That may well end up being the case, but yeah, I do believe the start simple, scale up approach is warranted here. Even without cancellableExecute, there may also be some benefit in adding |
|
Note that there are several other direct calls to nvdaInProcUtils methods we may also wish to wrap with cancellableExecute, but again, probably best to do this with evidence. |
There was a problem hiding this comment.
Pull request overview
This PR aims to keep NVDA responsive when a target application hangs by routing specific blocking cross-process accessibility calls through watchdog.cancellableExecute, allowing the watchdog to cancel the wait and avoid freezing NVDA’s core thread (Fixes #20169).
Changes:
- Route
UiaHasServerSideProvider(hwnd)checks throughwatchdog.cancellableExecuteand treat cancellation as “non-UIA”. - Route the WinWord in-proc text retrieval (
nvdaInProcUtils_winword_getTextInRange) throughwatchdog.cancellableExecuteand handle watchdog cancellation. - Document the user-facing improvement in the changelog.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| user_docs/en/changes.md | Adds a changelog entry describing faster recovery from app hangs due to watchdog-cancellable calls. |
| source/UIAHandler/init.py | Wrapes UiaHasServerSideProvider in watchdog.cancellableExecute and handles CallCancelled. |
| source/NVDAObjects/window/winword.py | Wraps nvdaInProcUtils_winword_getTextInRange in watchdog.cancellableExecute and adds cancellation handling/logging. |
…elog - winword: on CallCancelled don't return [self.text]; that reads self._rangeObj.text, another COM call into the same hung Word, which re-blocks the core and defeats the cancellation. Return [""] instead. (Per @Copilot.) - winword: downgrade the cancellation log line from debugWarning to debug so a persistent hang doesn't flood the log on repeated speech/braille updates. (Per @Copilot.) - changes.md: rephrase the entry in user-facing terms per @seanbudd. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the correction on Noted on the Also acknowledged on the other I've also addressed @seanbudd's user-facing changelog rewording and @Copilot's two points on winword.py (the |
|
hi @heath-toby |
Link to issue number:
Fixes #20169
Summary of the issue:
When an application stops responding, NVDA's core thread blocks on synchronous cross-process accessibility calls into it. Some of these calls — notably
UiaHasServerSideProviderand the in-process Word object-model RPC — are not routed through any mechanism the watchdog can cancel, so NVDA can stay frozen until the application is killed.Description of user facing changes:
NVDA recovers more quickly when an application stops responding; in particular, switching away from a hung application returns NVDA to responsiveness immediately.
Description of developer facing changes:
A handful of blocking cross-process accessibility calls are now run via
watchdog.cancellableExecute.Description of development approach:
Reworked from the original proposal per @jcsteh's review: rather than introducing a per-call timeout, a circuit breaker, or a new state module, this relies entirely on the existing watchdog, which already centralises hang detection with a dynamically-adjusted timeout (and crucially drops it to
MIN_CORE_ALIVE_TIMEOUTwhen the user switches apps).The blocking calls the watchdog cannot otherwise cancel are routed through
watchdog.cancellableExecute:UIAHandler._isUIAWindowHelper:UiaHasServerSideProvider.NVDAObjects.UIA.kwargsFromSuper: the UIA client element lookups (ElementFromHandleBuildCache/getFocusedElementBuildCache/ElementFromPointBuildCache).getTextWithFields:nvdaInProcUtils_winword_getTextInRange.On cancellation each caller falls back exactly as it already does on the corresponding failure path.
cancellableExecuteitself is unchanged.Testing strategy:
ruffand the unit test suite pass. Manual testing against deliberately-hung and real-world unresponsive applications: NVDA recovers, and switching away from the hung application restores responsiveness immediately.Known issues with pull request:
cancellableExecutecaller, a cancelled call leaves an abandoned worker thread until the underlying call eventually returns. This is pre-existing behaviour, mitigated by the watchdog not dispatching new work during recovery.UiaHasServerSideProviderresult (raised in review) is intentionally left out of this PR so it can be evaluated separately.Code Review Checklist: