Skip to content

Support image descriptions using local AI model#18475

Merged
seanbudd merged 113 commits into
nvaccess:try-64bitfrom
tianzeshi-study:captionUseLocalModel
Sep 15, 2025
Merged

Support image descriptions using local AI model#18475
seanbudd merged 113 commits into
nvaccess:try-64bitfrom
tianzeshi-study:captionUseLocalModel

Conversation

@tianzeshi-study

@tianzeshi-study tianzeshi-study commented Jul 15, 2025

Copy link
Copy Markdown
Contributor

Link to issue number:

Resolves #16281

Summary of the issue:

NVDA currently lacks a built‑in, offline image captioning feature. Existing solutions require a reliable internet connection—raising privacy concerns, potential costs, and latency—and many NVDA users (especially in developing regions or on older hardware) have limited connectivity or constrained resources. There is no robust, integrated offline alternative.

Description of user facing changes:

  • Introduces device‑side image description directly within NVDA, requiring no cloud service.

  • Adds three global commands (with default shortcuts):

    • --NVDA+Windows+,--: Generate a caption for the current image under focus.
    • --NVDA+Windows+Shift+,--: Release the loaded model and free memory.
    • --NVDA+Windows+Ctrl+,--: Open the Model Manager GUI to download or manage models.
  • Extends NVDA’s settings panel to enable/disable offline captioning and configure model paths.

Description of developer facing changes:

  • New _localCaptioner module containing:

    • captioner.py: Core inference engine exposing generate_caption(image) for producing text descriptions.
    • panel.py: NVDA settings integration (lazy or on‑startup model loading, custom path).
    • modelDownloader.py: CLI tool to download ONNX models.
    • modelManager.py: GUI for selecting download paths and managing available models.
  • Uses the Hugging Face Xenova/vit-gpt2-image-captioning model in ONNX format (via onnxruntime) to balance accuracy, speed, and low resource usage.

  • Modular design allows for future extension to additional models or formats.

Description of development approach:

  • --Modular integration--: Keeps _localCaptioner self‑contained and compatible with NVDA’s plugin architecture.
  • --Lightweight inference--: Leverages ONNXRuntime for fast, local inference without heavy PyTorch or TensorFlow dependencies.
  • --Lazy loading--: Model is only loaded when first invoked (or at startup, if configured), minimizing initial memory footprint.
  • --Dual interfaces--: Provides both CLI scripts (captioner.py, modelDownloader.py) for quick tests and a GUI (modelManager.py) for end‑users.
  • --Extensible architecture--: Configuration files (e.g., config.json) conform to Hugging Face format for easy swapping of models.

Testing strategy:

  • --Manual/CLI tests--:

    • Activated in a .venv, running python captioner.py to verify caption generation.
    • Running python modelDownloader.py and python modelManager.py to validate download and GUI workflows.
  • --Shortcut verification--: Ensured all three default keybindings trigger the expected actions.

  • --Performance--: Measured caption generation time < 5 seconds on representative hardware.

  • --Resource cleanup--: Confirmed that releasing the model frees allocated memory.

  • --Compatibility--: Tested on Windows with limited CPU/RAM configurations to mimic low‑end hardware.

Known issues with pull request:

  • Captions are currently generated in English only; multi‑language support may be added later.
  • First‑time model download requires an active internet connection.Generate image caption need to download models first, may cause difficulty in unit test and system test.
  • In some complex UI contexts, NVDA may not correctly identify the target image for captioning.

Code Review Checklist:

  • Documentation:

    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:

    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:

    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add‑ons.

  • Security precautions taken.

@coderabbitai summary

Comment thread pyproject.toml
@hwf1324

hwf1324 commented Jul 15, 2025

Copy link
Copy Markdown
Contributor

Welcome to NVDA!

We are glad to see your contribution to NVDA.

When I tried to build the launcher to compare the file size of the launcher bundled with ONNXRuntime to the previous launcher, I found that a portable version could be created from this launcher. However, I encountered an error when starting the created portable version. Below is the log.

INFO - __main__ (22:34:47.643) - MainThread (21824):
Starting NVDA version source-18475-b4a029d x86
INFO - core.main (22:34:47.730) - MainThread (21824):
Config dir: D:\NVDA\snapshot\pr\18475\userConfig
INFO - config.ConfigManager._loadConfig (22:34:47.733) - MainThread (21824):
Loading config: D:\NVDA\snapshot\pr\18475\userConfig\nvda.ini
INFO - core.main (22:34:48.223) - MainThread (21824):
Windows version: Windows 11 24H2 (10.0.26100.4652) workstation AMD64
INFO - core.main (22:34:48.223) - MainThread (21824):
Using Python version 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:00:00) [MSC v.1938 32 bit (Intel)]
INFO - core.main (22:34:48.223) - MainThread (21824):
Using comtypes version 1.4.6
INFO - core.main (22:34:48.225) - MainThread (21824):
Using configobj version 5.1.0 with validate version 1.0.1
ERROR - braille.getDisplayDrivers (22:34:48.286) - MainThread (21824):
Error while importing braille display driver alva
Traceback (most recent call last):
  File "braille.pyc", line 3869, in getDisplayDrivers
  File "braille.pyc", line 470, in _getDisplayDriver
  File "braille.pyc", line 464, in _getDisplayDriver
  File "importlib\__init__.pyc", line 126, in import_module
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "brailleDisplayDrivers\alva.pyc", line 16, in <module>
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
ERROR - braille.getDisplayDrivers (22:34:48.306) - MainThread (21824):
Error while importing braille display driver eurobraille
Traceback (most recent call last):
  File "braille.pyc", line 3869, in getDisplayDrivers
  File "braille.pyc", line 470, in _getDisplayDriver
  File "braille.pyc", line 464, in _getDisplayDriver
  File "importlib\__init__.pyc", line 126, in import_module
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "brailleDisplayDrivers\eurobraille\__init__.pyc", line 9, in <module>
  File "brailleDisplayDrivers\eurobraille\driver.pyc", line 20, in <module>
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
ERROR - braille.getDisplayDrivers (22:34:48.321) - MainThread (21824):
Error while importing braille display driver handyTech
Traceback (most recent call last):
  File "braille.pyc", line 3869, in getDisplayDrivers
  File "braille.pyc", line 470, in _getDisplayDriver
  File "braille.pyc", line 464, in _getDisplayDriver
  File "importlib\__init__.pyc", line 126, in import_module
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "brailleDisplayDrivers\handyTech.pyc", line 29, in <module>
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
INFO - NVDAHelperLocal (22:34:48.348) - MainThread (21824):
Thread 21824, build\x86\localWin10\oneCoreSpeech.cpp, ocSpeech_initialize, 215:
ocSpeech_initialize

INFO - NVDAHelperLocal (22:34:48.348) - MainThread (21824):
Thread 21824, build\x86\localWin10\oneCoreSpeech.cpp, OcSpeechState::activate, 89:
Activating

INFO - NVDAHelperLocal (22:34:48.407) - MainThread (21824):
Thread 21824, build\x86\localWin10\oneCoreSpeech.cpp, preventEndUtteranceSilence_, 443:
AppendedSilence supported

INFO - synthDriverHandler.setSynth (22:34:48.416) - MainThread (21824):
Loaded synthDriver oneCore
WARNING - mathPres.initialize (22:34:48.421) - MainThread (21824):
MathPlayer 4 not available
INFO - core._setUpWxApp (22:34:48.421) - MainThread (21824):
Using wx version 4.2.2 msw (phoenix) wxWidgets 3.2.6 with six version 1.17.0
INFO - brailleInput.initialize (22:34:48.422) - MainThread (21824):
Braille input initialized
INFO - braille.initialize (22:34:48.422) - MainThread (21824):
Using liblouis version 3.34.0
INFO - braille.initialize (22:34:48.422) - MainThread (21824):
Using pySerial version 3.5
INFO - braille.BrailleHandler._setDisplay (22:34:48.425) - MainThread (21824):
Loaded braille display driver 'noBraille', current display has 0 cells.
INFO - core.main (22:34:48.649) - MainThread (21824):
Java Access Bridge support initialized
INFO - UIAHandler.UIAHandler.MTAThreadFunc (22:34:48.882) - UIAHandler.UIAHandler.MTAThread (30568):
UIAutomation: IUIAutomation6
CRITICAL - __main__ (22:34:49.105) - MainThread (21824):
core failure
Traceback (most recent call last):
  File "nvda.pyw", line 309, in <module>
  File "core.pyc", line 911, in main
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
ERROR - keyboardHandler.internal_keyDownEvent (22:34:49.111) - winInputHook (31268):
internal_keyDownEvent
Traceback (most recent call last):
  File "keyboardHandler.pyc", line 276, in internal_keyDownEvent
  File "inputCore.pyc", line 529, in executeGesture
  File "baseObject.pyc", line 59, in __get__
  File "baseObject.pyc", line 167, in _getPropertyViaCache
  File "inputCore.pyc", line 182, in _get_script
  File "scriptHandler.pyc", line 112, in findScript
  File "scriptHandler.pyc", line 125, in _findScript
  File "scriptHandler.pyc", line 183, in _yieldObjectsForFindScript
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
ERROR - keyboardHandler.internal_keyDownEvent (22:34:49.123) - winInputHook (31268):
internal_keyDownEvent
Traceback (most recent call last):
  File "keyboardHandler.pyc", line 276, in internal_keyDownEvent
  File "inputCore.pyc", line 529, in executeGesture
  File "baseObject.pyc", line 59, in __get__
  File "baseObject.pyc", line 167, in _getPropertyViaCache
  File "inputCore.pyc", line 182, in _get_script
  File "scriptHandler.pyc", line 112, in findScript
  File "scriptHandler.pyc", line 125, in _findScript
  File "scriptHandler.pyc", line 183, in _yieldObjectsForFindScript
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'

@AppVeyorBot

Copy link
Copy Markdown
  • FAIL: Translation comments check. Translation comments missing or unexpectedly included. See build log for more information.
  • PASS: License check.
  • PASS: Unit tests.
  • FAIL: System tests (tags: installer NVDA). See test results for more information.
  • Build (for testing PR): https://ci.appveyor.com/api/buildjobs/ilrmf24qtlap4bcp/artifacts/output/nvda_snapshot_pr18475-37348,99389e13.exe
  • CI timing (mins):
    INIT 0.0,
    INSTALL_START 1.9,
    INSTALL_END 1.2,
    BUILD_START 0.0,
    BUILD_END 28.9,
    TESTSETUP_START 0.0,
    TESTSETUP_END 0.4,
    TEST_START 0.0,
    TEST_END 1.3,
    FINISH_END 0.1

See test results for failed build of commit 99389e1305

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author

Welcome to NVDA!

We are glad to see your contribution to NVDA.

When I tried to build the launcher to compare the file size of the launcher bundled with ONNXRuntime to the previous launcher, I found that a portable version could be created from this launcher. However, I encountered an error when starting the created portable version. Below is the log.

INFO - __main__ (22:34:47.643) - MainThread (21824):
Starting NVDA version source-18475-b4a029d x86
INFO - core.main (22:34:47.730) - MainThread (21824):
Config dir: D:\NVDA\snapshot\pr\18475\userConfig
INFO - config.ConfigManager._loadConfig (22:34:47.733) - MainThread (21824):
Loading config: D:\NVDA\snapshot\pr\18475\userConfig\nvda.ini
INFO - core.main (22:34:48.223) - MainThread (21824):
Windows version: Windows 11 24H2 (10.0.26100.4652) workstation AMD64
INFO - core.main (22:34:48.223) - MainThread (21824):
Using Python version 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:00:00) [MSC v.1938 32 bit (Intel)]
INFO - core.main (22:34:48.223) - MainThread (21824):
Using comtypes version 1.4.6
INFO - core.main (22:34:48.225) - MainThread (21824):
Using configobj version 5.1.0 with validate version 1.0.1
ERROR - braille.getDisplayDrivers (22:34:48.286) - MainThread (21824):
Error while importing braille display driver alva
Traceback (most recent call last):
  File "braille.pyc", line 3869, in getDisplayDrivers
  File "braille.pyc", line 470, in _getDisplayDriver
  File "braille.pyc", line 464, in _getDisplayDriver
  File "importlib\__init__.pyc", line 126, in import_module
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "brailleDisplayDrivers\alva.pyc", line 16, in <module>
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
ERROR - braille.getDisplayDrivers (22:34:48.306) - MainThread (21824):
Error while importing braille display driver eurobraille
Traceback (most recent call last):
  File "braille.pyc", line 3869, in getDisplayDrivers
  File "braille.pyc", line 470, in _getDisplayDriver
  File "braille.pyc", line 464, in _getDisplayDriver
  File "importlib\__init__.pyc", line 126, in import_module
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "brailleDisplayDrivers\eurobraille\__init__.pyc", line 9, in <module>
  File "brailleDisplayDrivers\eurobraille\driver.pyc", line 20, in <module>
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
ERROR - braille.getDisplayDrivers (22:34:48.321) - MainThread (21824):
Error while importing braille display driver handyTech
Traceback (most recent call last):
  File "braille.pyc", line 3869, in getDisplayDrivers
  File "braille.pyc", line 470, in _getDisplayDriver
  File "braille.pyc", line 464, in _getDisplayDriver
  File "importlib\__init__.pyc", line 126, in import_module
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "brailleDisplayDrivers\handyTech.pyc", line 29, in <module>
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
INFO - NVDAHelperLocal (22:34:48.348) - MainThread (21824):
Thread 21824, build\x86\localWin10\oneCoreSpeech.cpp, ocSpeech_initialize, 215:
ocSpeech_initialize

INFO - NVDAHelperLocal (22:34:48.348) - MainThread (21824):
Thread 21824, build\x86\localWin10\oneCoreSpeech.cpp, OcSpeechState::activate, 89:
Activating

INFO - NVDAHelperLocal (22:34:48.407) - MainThread (21824):
Thread 21824, build\x86\localWin10\oneCoreSpeech.cpp, preventEndUtteranceSilence_, 443:
AppendedSilence supported

INFO - synthDriverHandler.setSynth (22:34:48.416) - MainThread (21824):
Loaded synthDriver oneCore
WARNING - mathPres.initialize (22:34:48.421) - MainThread (21824):
MathPlayer 4 not available
INFO - core._setUpWxApp (22:34:48.421) - MainThread (21824):
Using wx version 4.2.2 msw (phoenix) wxWidgets 3.2.6 with six version 1.17.0
INFO - brailleInput.initialize (22:34:48.422) - MainThread (21824):
Braille input initialized
INFO - braille.initialize (22:34:48.422) - MainThread (21824):
Using liblouis version 3.34.0
INFO - braille.initialize (22:34:48.422) - MainThread (21824):
Using pySerial version 3.5
INFO - braille.BrailleHandler._setDisplay (22:34:48.425) - MainThread (21824):
Loaded braille display driver 'noBraille', current display has 0 cells.
INFO - core.main (22:34:48.649) - MainThread (21824):
Java Access Bridge support initialized
INFO - UIAHandler.UIAHandler.MTAThreadFunc (22:34:48.882) - UIAHandler.UIAHandler.MTAThread (30568):
UIAutomation: IUIAutomation6
CRITICAL - __main__ (22:34:49.105) - MainThread (21824):
core failure
Traceback (most recent call last):
  File "nvda.pyw", line 309, in <module>
  File "core.pyc", line 911, in main
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
ERROR - keyboardHandler.internal_keyDownEvent (22:34:49.111) - winInputHook (31268):
internal_keyDownEvent
Traceback (most recent call last):
  File "keyboardHandler.pyc", line 276, in internal_keyDownEvent
  File "inputCore.pyc", line 529, in executeGesture
  File "baseObject.pyc", line 59, in __get__
  File "baseObject.pyc", line 167, in _getPropertyViaCache
  File "inputCore.pyc", line 182, in _get_script
  File "scriptHandler.pyc", line 112, in findScript
  File "scriptHandler.pyc", line 125, in _findScript
  File "scriptHandler.pyc", line 183, in _yieldObjectsForFindScript
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'
ERROR - keyboardHandler.internal_keyDownEvent (22:34:49.123) - winInputHook (31268):
internal_keyDownEvent
Traceback (most recent call last):
  File "keyboardHandler.pyc", line 276, in internal_keyDownEvent
  File "inputCore.pyc", line 529, in executeGesture
  File "baseObject.pyc", line 59, in __get__
  File "baseObject.pyc", line 167, in _getPropertyViaCache
  File "inputCore.pyc", line 182, in _get_script
  File "scriptHandler.pyc", line 112, in findScript
  File "scriptHandler.pyc", line 125, in _findScript
  File "scriptHandler.pyc", line 183, in _yieldObjectsForFindScript
  File "globalCommands.pyc", line 75, in <module>
  File "_localCaptioner\__init__.pyc", line 34, in <module>
  File "_localCaptioner\captioner.pyc", line 22, in <module>
ModuleNotFoundError: No module named 'numpy'

It seems that portable version can not find numpy as python dependency. That's a bit confusing, because onnxruntime Depend on numpy and will automate install it as dependency.
Maybe portable version has some difference, I'll try to find out what the problem is. Maybe I should directly add numpy to pyproject.toml as project dependency.
Thank you for your reply!

@hwf1324

hwf1324 commented Jul 15, 2025

Copy link
Copy Markdown
Contributor

Maybe portable version has some difference, I'll try to find out what the problem is. Maybe I should directly add numpy to pyproject.toml as project dependency.

No, this is related to the py2exe script. NVDA excludes numpy in setup.py.

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author

Maybe portable version has some difference, I'll try to find out what the problem is. Maybe I should directly add numpy to pyproject.toml as project dependency.

No, this is related to the py2exe script. NVDA excludes numpy in setup.py.

yes you are right, I see it : # numpy is an optional dependency of comtypes but we don't require it.
Maybe I will remove this exclusion in next commit.

@seanbudd seanbudd changed the title generate image caption use local AI model, supports model downloadin… Support image descriptions using local AI model Jul 16, 2025
Comment thread source/_localCaptioner/panel.py Outdated
Comment thread source/_localCaptioner/panel.py Outdated
Comment thread source/_localCaptioner/panel.py Outdated
Comment thread source/_localCaptioner/modelDownloader.py Outdated
Comment thread source/_localCaptioner/modelDownloader.py Outdated
Comment thread source/_localCaptioner/captioner.py Outdated
Comment thread source/_localCaptioner/captioner.py Outdated
Comment thread source/_localCaptioner/modelManager.py Outdated
Comment thread source/_localCaptioner/modelManager.py Outdated
Comment thread source/_localCaptioner/modelDownloader.py Outdated
Comment thread source/_localCaptioner/modelManager.py Outdated
Comment thread source/_localCaptioner/panel.py Outdated
@cary-rowen

Copy link
Copy Markdown
Contributor

NVDA+Windows+Shift+,--: Release the loaded model and free memory.

Need to release manually by the user I don't think this is a good experience. Can this be an automatic recycling design in the background?

NVDA+Windows+Ctrl+,--: Open the Model Manager GUI to download or manage models.

This is not a high-frequency operation, please do not assign gestures by default.

@tianzeshi-study

Copy link
Copy Markdown
Contributor Author

Need to release manually by the user I don't think this is a good experience. Can this be an automatic recycling design in the background?

This is because keeping the model in memory reduces the time it takes for each recognition without loading the model every time. Manually releasing the model is just a way to balance memory footprint and recognition speed, otherwise it may require internal maintenance of a timer to automatic released model after a period of time. However, the time for the user to recognize is unknown.

This is not a high-frequency operation, please do not assign gestures by default.

yes, you are right, It seems that need to add a button to open Model Manager in the settings panel instead of keyboard shortcut

tianzeshi-study and others added 8 commits July 16, 2025 13:48
coding Standards

Co-authored-by: Sean Budd <seanbudd123@gmail.com>
coding Standards

Co-authored-by: Sean Budd <seanbudd123@gmail.com>
No need to test on command line

Co-authored-by: Sean Budd <seanbudd123@gmail.com>
gettext formatting

Co-authored-by: Sean Budd <seanbudd123@gmail.com>
No need to test on command line

Co-authored-by: Sean Budd <seanbudd123@gmail.com>
@AppVeyorBot

Copy link
Copy Markdown
  • FAIL: Translation comments check. Translation comments missing or unexpectedly included. See build log for more information.
  • PASS: License check.
  • PASS: Unit tests.
  • FAIL: System tests (tags: installer NVDA). See test results for more information.
  • Build (for testing PR): https://ci.appveyor.com/api/buildjobs/wh2a3cg5j26v7l67/artifacts/output/nvda_snapshot_pr18475-37378,94f89120.exe
  • CI timing (mins):
    INIT 0.0,
    INSTALL_START 1.5,
    INSTALL_END 1.0,
    BUILD_START 0.0,
    BUILD_END 21.7,
    TESTSETUP_START 0.0,
    TESTSETUP_END 0.4,
    TEST_START 0.0,
    TEST_END 1.0,
    FINISH_END 0.1

See test results for failed build of commit 94f89120cc

@AppVeyorBot

Copy link
Copy Markdown
  • FAIL: Translation comments check. Translation comments missing or unexpectedly included. See build log for more information.
  • PASS: License check.
  • PASS: Unit tests.
  • FAIL: System tests (tags: installer NVDA). See test results for more information.
  • Build (for testing PR): https://ci.appveyor.com/api/buildjobs/y0ot19yyxbeoail6/artifacts/output/nvda_snapshot_pr18475-37387,0200bcbd.exe
  • CI timing (mins):
    INIT 0.0,
    INSTALL_START 2.0,
    INSTALL_END 1.0,
    BUILD_START 0.0,
    BUILD_END 25.8,
    TESTSETUP_START 0.0,
    TESTSETUP_END 0.4,
    TEST_START 0.0,
    TEST_END 1.3,
    FINISH_END 0.1

See test results for failed build of commit 0200bcbd56

@seanbudd seanbudd marked this pull request as draft September 12, 2025 05:21
@tianzeshi-study tianzeshi-study marked this pull request as ready for review September 12, 2025 08:29

@seanbudd seanbudd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and congrats!

Comment thread pyproject.toml Outdated
@seanbudd seanbudd changed the base branch from master to try-64bit September 15, 2025 00:57
@seanbudd

Copy link
Copy Markdown
Member

Re-opening to trigger a build against the 64bit only branch

@seanbudd seanbudd closed this Sep 15, 2025
@seanbudd seanbudd reopened this Sep 15, 2025
Comment thread tests/system/robot/automatedImageDescriptions.py Outdated
Comment thread tests/system/robot/automatedImageDescriptions.py Outdated
Comment thread tests/system/robot/automatedImageDescriptions.py Outdated

@Qchristensen Qchristensen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and will be of great interest to many users

Comment thread user_docs/en/changes.md Outdated
@seanbudd seanbudd merged commit e1cef07 into nvaccess:try-64bit Sep 15, 2025
23 checks passed
@github-actions github-actions Bot added this to the 2026.1 milestone Sep 15, 2025
@seanbudd seanbudd mentioned this pull request Oct 10, 2025
15 tasks
seanbudd added a commit that referenced this pull request Jan 9, 2026
SaschaCowley pushed a commit that referenced this pull request Jan 11, 2026
Reverts:
- #18475
- #19036
- #19024
- #19055
- #19057
- #19178
- #19243
- #19327
- Partial revert: #19342

### Issues fixed
Fixes #19298 

### Issues reopened
Reopens #16281

### Reason for revert / Can this PR be reimplemented? If so, what is
required for the next attempt

The current implementation of AI image descriptions yields low quality
captions from a 3 year old model (see #19298).
The current implementation also requires using numpy, which hogs RAM,
slows initialization, and increases the weight of the installer.
An attempt was made to convert this to C++ using WinML and Windows ONNX
runtimes as per #18662.
This would have removed numpy, and improved flexibility for using
different models in the future.
Unfortunately, this was not found to be feasible, as ONNX C++ fails to
work via 64bit emulation on ARM
(microsoft/onnxruntime#15403).

This means we have the following options for image descriptions:

1. Continue to use the python onnxruntime, and accept the RAM and
storage hits. Instead, improve the quality of the captioner with better
models such as
[git-base-coco](https://huggingface.co/microsoft/git-base-coco) or
[blip2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
2. Wait until MS builds ARM64EC into C++ ONNX (blocked by
microsoft/onnxruntime#15403)
3. Attempt to build our own fork of ONNX with ARM64EC
4. Build a separate ARM native installer of NVDA, offer as an
alternative to allow for ARM devices to do image descriptions with
numpy.
5. Release the feature on C++ without support for ARM devices.

All of these options require a significant amount of work.
As such, sadly this feature is not ready for a stable release.

Instead this code will be moved to a feature branch, until ONNX C++
matures such as fixing
microsoft/onnxruntime#15403.
Additionally, ONNX C++ runtimes are only available through the
experimental 2.0 version of the Windows App SDK, and requires you to
build your own headers from it.
I think this feature will be blocked until
microsoft/onnxruntime#15403 is implemented and
the 2.0 version of the Windows App SDK becomes stable.
Future re-implementations should also consider using higher quality,
more modern models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OFFER: SMART IMAGE RECOGNITION

10 participants