Skip to content

Show dialog to enable image desc once if it is not enabled in settings panel#19243

Merged
seanbudd merged 3 commits into
nvaccess:masterfrom
tianzeshi-study:startImageDescDialog
Nov 21, 2025
Merged

Show dialog to enable image desc once if it is not enabled in settings panel#19243
seanbudd merged 3 commits into
nvaccess:masterfrom
tianzeshi-study:startImageDescDialog

Conversation

@tianzeshi-study

Copy link
Copy Markdown
Contributor

Link to issue number:

Resolves #19097

Summary of the issue:

Users need to open settings panel to enable image description

Description of user facing changes:

If image description is not enabled in settings pannel, it will show a dialog to enable it in current session when press the keyboard shortcut

Description of developer facing changes:

Nonne

Description of development approach:

None

Testing strategy:

Disable image description in settings panel, then press the keyboard shortcut to enable it in a dialog.

Known issues with pull request:

After enabling it requires another key press for recognition, but I think this is acceptable

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

@tianzeshi-study tianzeshi-study requested a review from a team as a code owner November 20, 2025 08:21

@seanbudd seanbudd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread source/gui/_localCaptioner/messageDialogs.py Outdated
@seanbudd seanbudd merged commit 480b087 into nvaccess:master Nov 21, 2025
4 of 5 checks passed
@github-actions github-actions Bot added this to the 2026.1 milestone Nov 21, 2025
@CyrilleB79

Copy link
Copy Markdown
Contributor

Why this choice of a per-session validity?

It's quite strange to validate the dialog box, and then, when you open the settings, the checkbox remains unchecked. It's worth noting that the fact that this parameter is temporary (only for the session) is not mentioned in the new enabling dialog box, what creates more confusion.

The only other parameter that I know which has both a temporary value and a persistent config value (provided you save the config) is the screen curtain. I already find this quite confusing for the screen curtain. But for AI description, I find it more confusing and even useless.

I'd recommend that we drop this temporary setting and that pressing yes just enable the AI description feature in settings. This UX would be similar to Windows History clipboard for example.

If you disagree with me, could you please provide the different use cases / user stories for:

  • AI Image description temporarily enabled, only for the current session
  • AI image description enabled in config

Thanks.

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author

Why this choice of a per-session validity?

I think the configuration should only be modified in the settings panel.
And enabling image description may cause NVDA to start late or use high memory, so it is more appropriate to just stay in the current session.

@seanbudd

Copy link
Copy Markdown
Member

@tianzeshi-study - do our keyboard gestures set the config? I would assume they would

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author

@tianzeshi-study - do our keyboard gestures set the config? I would assume they would

No, they don't. They are also session-only.
Just like in this PR, the changes apply only to the current session, making it convenient to quickly enable or disable image descriptions within that session.

The config can only be set in settings panel.

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author

My initial thought was that the setting in the Settings panel should explicitly say “Enable image descriptions on startup,” rather than the more ambiguous “Enable image descriptions.”

And “Enable image descriptions” should apply only to the current session by default, without modifying the configuration.

@CyrilleB79

Copy link
Copy Markdown
Contributor

With this feature, there are 2 points where an information from the user is required:

  1. Download the model if it is not already on the user's machine. It cannot be done automatically because:
    • the size of the model is important and the user may not want to have NVDA size growing
    • the download time may be important so the model cannot just be downloaded when the script is used for the first time
  2. Enable the feature (i.e. load the model in memory). It cannot be done automatically because:
    • doing this the first time the script is called may delay the answer time when the script is called for the first time
    • doing at NVDA startup delays the startup, even if the model is loaded in a separate thread if I am correct.

This PR was addressing the second point. IMO for this, there was no need to have a dialog which just makes the UX less easy.
Note that I had not in mind either that the script to enable the feature was to enable the feature for the session.

Given that, my suggestion for the best UX is:

  1. Do not speak anymore of enabled / disabled AI recognition feature. The feature should be present in NVDA and considered available as soon as the model has been downloaded
  2. Do not ask the user if they want to load the captioner in memory. Do you really know many applications which ask for your confirmation before loading something in memory?
  3. Since loading the model (the captioner) in memory may take some time (between 1 second and many seconds), give the user the choice, through the checkbox (or better, a 2-item combo-box) in the settings panel, when the captioner should be loaded in memory:
    • at NVDA startup. Pro: the recognition script works immediately. Cons: NVDA startup is delayed.
    • upon first call of the script. Pro: no impact on NVDA startup, and no impact at all if the feature is note used. Cons: The first script call may be delayed; if it is very long, a "Enabling image captioner..." message could be issued.
  4. Remove the script to enable the feature / load the captioner in memory, as well as the confirm dialog of this PR

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author
4. Remove the script to enable the feature / load the captioner in memory, as well as the confirm dialog of this PR

For some advanced users, it may be necessary to unload the model when memory usage becomes too high, in order to ensure that NVDA continues running smoothly without stuttering. Without a toggle script, NVDA’s memory usage would remain permanently high.

@CyrilleB79

Copy link
Copy Markdown
Contributor

For some advanced users, it may be necessary to unload the model when memory usage becomes too high, in order to ensure that NVDA continues running smoothly without stuttering. Without a toggle script, NVDA’s memory usage would remain permanently high.

I do not understand the use case. Moreover if it's mainly for advanced users, does it mean that non-advanced user will never bother to unload the model from memory and thus, experience NVDA stuttering?

Again, unloading software components from memory is something very unusual that I have never seen in any other application. If it's really needed, this should be at least better documented.

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author

I do not understand the use case. Moreover if it's mainly for advanced users, does it mean that non-advanced user will never bother to unload the model from memory and thus, experience NVDA stuttering?

Again, unloading software components from memory is something very unusual that I have never seen in any other application. If it's really needed, this should be at least better documented.

Just like closing a large Microsoft Word document or shutting down a browser tab that is causing the system to lag.

Ordinary users may also need to learn to manage memory usage manually to avoid potential lag. They generally understand that when their computer becomes slow, they need to close something.

@cary-rowen

Copy link
Copy Markdown
Contributor

I completely agree with @CyrilleB79

The current design of the merged Pull Request is fundamentally flawed in two main areas: User Experience (UX) and Design Philosophy.

  • Poor Temporary UX: The "session-only" enablement via a dialog is highly confusing. Users expect any confirmed "Enable" action to be saved persistently, not just for the current session.
  • Anti-Pattern Memory Management: Forcing users to manually unload the AI model (by toggling it off) to free up RAM is an unacceptable, counter-intuitive design choice. Memory management should be an automated function of the application (e.g., through configurable "load on demand" settings), not a manual task for the end-user.
  • Invalid Analogy: The comparison to closing browser tabs to "free memory" is weak. Users primarily close tabs because the content is no longer required, not because they are consciously engaging in memory resource management.

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author
* **Poor Temporary UX:** The "session-only" enablement via a dialog is highly confusing. Users expect any confirmed "Enable" action to be saved persistently, not just for the current session.

* **Anti-Pattern Memory Management:** Forcing users to manually **unload the AI model** (by toggling it off) to free up RAM is an unacceptable, **counter-intuitive** design choice. Memory management should be an automated function of the application (e.g., through configurable "load on demand" settings), not a manual task for the end-user.

* **Invalid Analogy:** The comparison to closing browser tabs to "free memory" is weak. Users primarily close tabs because the content is **no longer required**, not because they are consciously engaging in memory resource management.

First, no one is being forced to unload the model manually — it’s an optional feature, and it doesn’t even have a gesture assigned.
Second, when a computer becomes slow, users naturally choose to close things they no longer need. This is not contradictory.

seanbudd added a commit that referenced this pull request Jan 9, 2026
SaschaCowley pushed a commit that referenced this pull request Jan 11, 2026
Reverts:
- #18475
- #19036
- #19024
- #19055
- #19057
- #19178
- #19243
- #19327
- Partial revert: #19342

### Issues fixed
Fixes #19298 

### Issues reopened
Reopens #16281

### Reason for revert / Can this PR be reimplemented? If so, what is
required for the next attempt

The current implementation of AI image descriptions yields low quality
captions from a 3 year old model (see #19298).
The current implementation also requires using numpy, which hogs RAM,
slows initialization, and increases the weight of the installer.
An attempt was made to convert this to C++ using WinML and Windows ONNX
runtimes as per #18662.
This would have removed numpy, and improved flexibility for using
different models in the future.
Unfortunately, this was not found to be feasible, as ONNX C++ fails to
work via 64bit emulation on ARM
(microsoft/onnxruntime#15403).

This means we have the following options for image descriptions:

1. Continue to use the python onnxruntime, and accept the RAM and
storage hits. Instead, improve the quality of the captioner with better
models such as
[git-base-coco](https://huggingface.co/microsoft/git-base-coco) or
[blip2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
2. Wait until MS builds ARM64EC into C++ ONNX (blocked by
microsoft/onnxruntime#15403)
3. Attempt to build our own fork of ONNX with ARM64EC
4. Build a separate ARM native installer of NVDA, offer as an
alternative to allow for ARM devices to do image descriptions with
numpy.
5. Release the feature on C++ without support for ARM devices.

All of these options require a significant amount of work.
As such, sadly this feature is not ready for a stable release.

Instead this code will be moved to a feature branch, until ONNX C++
matures such as fixing
microsoft/onnxruntime#15403.
Additionally, ONNX C++ runtimes are only available through the
experimental 2.0 version of the Windows App SDK, and requires you to
build your own headers from it.
I think this feature will be blocked until
microsoft/onnxruntime#15403 is implemented and
the 2.0 version of the Windows App SDK becomes stable.
Future re-implementations should also consider using higher quality,
more modern models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alpha: Prompt to load AI Image captioner when the keystroke is pressed and the captioner is not loaded

4 participants