Support image descriptions using local AI model#18475
Conversation
… and triggering via keyboard shortcut
|
Welcome to NVDA! We are glad to see your contribution to NVDA. When I tried to build the launcher to compare the file size of the launcher bundled with ONNXRuntime to the previous launcher, I found that a portable version could be created from this launcher. However, I encountered an error when starting the created portable version. Below is the log. |
See test results for failed build of commit 99389e1305 |
It seems that portable version can not find numpy as python dependency. That's a bit confusing, because onnxruntime Depend on numpy and will automate install it as dependency. |
No, this is related to the py2exe script. NVDA excludes numpy in setup.py. |
yes you are right, I see it : # numpy is an optional dependency of comtypes but we don't require it. |
Need to release manually by the user I don't think this is a good experience. Can this be an automatic recycling design in the background?
This is not a high-frequency operation, please do not assign gestures by default. |
This is because keeping the model in memory reduces the time it takes for each recognition without loading the model every time. Manually releasing the model is just a way to balance memory footprint and recognition speed, otherwise it may require internal maintenance of a timer to automatic released model after a period of time. However, the time for the user to recognize is unknown.
yes, you are right, It seems that need to add a button to open Model Manager in the settings panel instead of keyboard shortcut |
coding Standards Co-authored-by: Sean Budd <seanbudd123@gmail.com>
coding Standards Co-authored-by: Sean Budd <seanbudd123@gmail.com>
No need to test on command line Co-authored-by: Sean Budd <seanbudd123@gmail.com>
gettext formatting Co-authored-by: Sean Budd <seanbudd123@gmail.com>
No need to test on command line Co-authored-by: Sean Budd <seanbudd123@gmail.com>
See test results for failed build of commit 94f89120cc |
…nd downloader; replace optional type to follow coding standards; improve some comment
See test results for failed build of commit 0200bcbd56 |
|
Re-opening to trigger a build against the 64bit only branch |
Qchristensen
left a comment
There was a problem hiding this comment.
Looks good and will be of great interest to many users
Reverts: - #18475 - #19036 - #19024 - #19055 - #19057 - #19178 - #19243 - #19327 - Partial revert: #19342 ### Issues fixed Fixes #19298 ### Issues reopened Reopens #16281 ### Reason for revert / Can this PR be reimplemented? If so, what is required for the next attempt The current implementation of AI image descriptions yields low quality captions from a 3 year old model (see #19298). The current implementation also requires using numpy, which hogs RAM, slows initialization, and increases the weight of the installer. An attempt was made to convert this to C++ using WinML and Windows ONNX runtimes as per #18662. This would have removed numpy, and improved flexibility for using different models in the future. Unfortunately, this was not found to be feasible, as ONNX C++ fails to work via 64bit emulation on ARM (microsoft/onnxruntime#15403). This means we have the following options for image descriptions: 1. Continue to use the python onnxruntime, and accept the RAM and storage hits. Instead, improve the quality of the captioner with better models such as [git-base-coco](https://huggingface.co/microsoft/git-base-coco) or [blip2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco). 2. Wait until MS builds ARM64EC into C++ ONNX (blocked by microsoft/onnxruntime#15403) 3. Attempt to build our own fork of ONNX with ARM64EC 4. Build a separate ARM native installer of NVDA, offer as an alternative to allow for ARM devices to do image descriptions with numpy. 5. Release the feature on C++ without support for ARM devices. All of these options require a significant amount of work. As such, sadly this feature is not ready for a stable release. Instead this code will be moved to a feature branch, until ONNX C++ matures such as fixing microsoft/onnxruntime#15403. Additionally, ONNX C++ runtimes are only available through the experimental 2.0 version of the Windows App SDK, and requires you to build your own headers from it. I think this feature will be blocked until microsoft/onnxruntime#15403 is implemented and the 2.0 version of the Windows App SDK becomes stable. Future re-implementations should also consider using higher quality, more modern models.
Link to issue number:
Resolves #16281
Summary of the issue:
NVDA currently lacks a built‑in, offline image captioning feature. Existing solutions require a reliable internet connection—raising privacy concerns, potential costs, and latency—and many NVDA users (especially in developing regions or on older hardware) have limited connectivity or constrained resources. There is no robust, integrated offline alternative.
Description of user facing changes:
Introduces device‑side image description directly within NVDA, requiring no cloud service.
Adds three global commands (with default shortcuts):
Extends NVDA’s settings panel to enable/disable offline captioning and configure model paths.
Description of developer facing changes:
New
_localCaptionermodule containing:captioner.py: Core inference engine exposinggenerate_caption(image)for producing text descriptions.panel.py: NVDA settings integration (lazy or on‑startup model loading, custom path).modelDownloader.py: CLI tool to download ONNX models.modelManager.py: GUI for selecting download paths and managing available models.Uses the Hugging Face
Xenova/vit-gpt2-image-captioningmodel in ONNX format (viaonnxruntime) to balance accuracy, speed, and low resource usage.Modular design allows for future extension to additional models or formats.
Description of development approach:
_localCaptionerself‑contained and compatible with NVDA’s plugin architecture.captioner.py,modelDownloader.py) for quick tests and a GUI (modelManager.py) for end‑users.config.json) conform to Hugging Face format for easy swapping of models.Testing strategy:
--Manual/CLI tests--:
.venv, runningpython captioner.pyto verify caption generation.python modelDownloader.pyandpython modelManager.pyto validate download and GUI workflows.--Shortcut verification--: Ensured all three default keybindings trigger the expected actions.
--Performance--: Measured caption generation time < 5 seconds on representative hardware.
--Resource cleanup--: Confirmed that releasing the model frees allocated memory.
--Compatibility--: Tested on Windows with limited CPU/RAM configurations to mimic low‑end hardware.
Known issues with pull request:
Code Review Checklist:
Documentation:
Testing:
UX of all users considered:
API is compatible with existing add‑ons.
Security precautions taken.
@coderabbitai summary