Improve unicode normalization by LeonarddeR · Pull Request #18803 · nvaccess/nvda

LeonarddeR · 2025-08-25T19:44:17Z

Link to issue number:

Fixes #18722, for real this time.

Summary of the issue:

#18722 wasn't yet fixed, i.e. combining diacritics were still not pronounced correctly when Unicode normalization is on.

Description of user facing changes:

Combining diacritics will no be correctly announced when bound to a character that doesn't result in normalization, such as space.

Description of developer facing changes:

None

Description of development approach:

For normalization per character, we use textUtils.uniscribe.splitAtCharacterBoundaries to split per character and then perform per char normalization. However, combinations like \u0301 (space + acute) are treated as one character, i.e. the acute binds to the preceding space. Normalization doesn't happen in this case, but yet the normalization logic changes the announcement behavior.

This fixes two issues:

An issue in the uniscribe c++ code. We need the start offset, an offset at the end of every character and the end offset of a string to calculate boundaries correctly. If we don't, offsets aren't calculated correctly for the space with acute case.
The issue were pronunciation of characters changes even though no normalization doesn't occur.

These cases are backed with unit tests to ensure the behavior is what we expect.

Testing strategy:

Unit tests

Known issues with pull request:

None

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull Request Overview

This PR fixes Unicode normalization issues where combining diacritics were not pronounced correctly when bound to characters that don't result in normalization (like space + acute accent). The fix addresses both C++ offset calculation issues and Python logic for handling non-normalized character combinations.

Key changes:

Fixed C++ character boundary calculation to include start offset (position 0)
Updated Python logic to handle cases where normalization doesn't occur but character splitting is needed
Added comprehensive unit tests for both the uniscribe splitting function and speech output

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
nvdaHelper/local/textUtils.cpp	Fixed character boundary calculation to include start position
source/textUtils/uniscribe.py	Corrected offset slicing to match C++ changes
source/speech/speech.py	Enhanced normalization logic to handle non-normalized character combinations
tests/unit/test_textUtils.py	Added comprehensive tests for character boundary splitting
tests/unit/test_speech.py	Added test for space+acute accent pronunciation behavior

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Improve unicode normalization

dd03cbf

Copilot AI review requested due to automatic review settings August 25, 2025 19:44

LeonarddeR requested a review from a team as a code owner August 25, 2025 19:44

LeonarddeR requested a review from SaschaCowley August 25, 2025 19:44

This comment was marked as outdated.

Sign in to view

seanbudd reviewed Aug 26, 2025

View reviewed changes

Comment thread tests/unit/test_textUtils.py Outdated

Comment thread tests/unit/test_textUtils.py Outdated

LeonarddeR and others added 3 commits August 26, 2025 19:51

Update source/speech/speech.py

3c03d4e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Improve test syntax

e8c1bd9

Merge remote-tracking branch 'origin/master' into moreSplitFixes

068aae5

LeonarddeR requested review from Copilot and seanbudd August 26, 2025 17:53

Copilot AI reviewed Aug 26, 2025

View reviewed changes

Comment thread tests/unit/test_speech.py

Comment thread tests/unit/test_textUtils.py Outdated

accute > acute

e49b876

seanbudd reviewed Aug 27, 2025

View reviewed changes

Comment thread tests/unit/test_textUtils.py

seanbudd approved these changes Aug 27, 2025

View reviewed changes

seanbudd merged commit d80d349 into nvaccess:master Aug 28, 2025
40 checks passed

github-actions Bot added this to the 2026.1 milestone Aug 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve unicode normalization#18803

Improve unicode normalization#18803
seanbudd merged 5 commits into
nvaccess:masterfrom
LeonarddeR:moreSplitFixes

LeonarddeR commented Aug 25, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LeonarddeR commented Aug 25, 2025

Link to issue number:

Summary of the issue:

Description of user facing changes:

Description of developer facing changes:

Description of development approach:

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants