Skip to content

Improve unicode normalization#18803

Merged
seanbudd merged 5 commits into
nvaccess:masterfrom
LeonarddeR:moreSplitFixes
Aug 28, 2025
Merged

Improve unicode normalization#18803
seanbudd merged 5 commits into
nvaccess:masterfrom
LeonarddeR:moreSplitFixes

Conversation

@LeonarddeR

Copy link
Copy Markdown
Collaborator

Link to issue number:

Fixes #18722, for real this time.

Summary of the issue:

#18722 wasn't yet fixed, i.e. combining diacritics were still not pronounced correctly when Unicode normalization is on.

Description of user facing changes:

Combining diacritics will no be correctly announced when bound to a character that doesn't result in normalization, such as space.

Description of developer facing changes:

None

Description of development approach:

For normalization per character, we use textUtils.uniscribe.splitAtCharacterBoundaries to split per character and then perform per char normalization. However, combinations like \u0301 (space + acute) are treated as one character, i.e. the acute binds to the preceding space. Normalization doesn't happen in this case, but yet the normalization logic changes the announcement behavior.

This fixes two issues:

  1. An issue in the uniscribe c++ code. We need the start offset, an offset at the end of every character and the end offset of a string to calculate boundaries correctly. If we don't, offsets aren't calculated correctly for the space with acute case.
  2. The issue were pronunciation of characters changes even though no normalization doesn't occur.

These cases are backed with unit tests to ensure the behavior is what we expect.

Testing strategy:

Unit tests

Known issues with pull request:

None

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

Copilot AI review requested due to automatic review settings August 25, 2025 19:44
@LeonarddeR LeonarddeR requested a review from a team as a code owner August 25, 2025 19:44

This comment was marked as outdated.

Comment thread tests/unit/test_textUtils.py Outdated
Comment thread tests/unit/test_textUtils.py Outdated
@LeonarddeR LeonarddeR requested review from Copilot and seanbudd August 26, 2025 17:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes Unicode normalization issues where combining diacritics were not pronounced correctly when bound to characters that don't result in normalization (like space + acute accent). The fix addresses both C++ offset calculation issues and Python logic for handling non-normalized character combinations.

Key changes:

  • Fixed C++ character boundary calculation to include start offset (position 0)
  • Updated Python logic to handle cases where normalization doesn't occur but character splitting is needed
  • Added comprehensive unit tests for both the uniscribe splitting function and speech output

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
nvdaHelper/local/textUtils.cpp Fixed character boundary calculation to include start position
source/textUtils/uniscribe.py Corrected offset slicing to match C++ changes
source/speech/speech.py Enhanced normalization logic to handle non-normalized character combinations
tests/unit/test_textUtils.py Added comprehensive tests for character boundary splitting
tests/unit/test_speech.py Added test for space+acute accent pronunciation behavior

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread tests/unit/test_speech.py
Comment thread tests/unit/test_textUtils.py Outdated
Comment thread tests/unit/test_textUtils.py
@seanbudd seanbudd merged commit d80d349 into nvaccess:master Aug 28, 2025
40 checks passed
@github-actions github-actions Bot added this to the 2026.1 milestone Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Combining diacritics are not read when typing/navigating if Unicode normalization is enabled

3 participants