Unicode normalization follow up, adding character navigation and several fixes by LeonarddeR · Pull Request #16622 · nvaccess/nvda

LeonarddeR · 2024-05-28T16:26:04Z

Link to issue number:

Summary of the issue:

It has been discussed that normalization would also be helpful for character navigation. There's also an issue where character descriptions and symbol pronunciation didn't work correctly because normalization took place before symbol processing. Furthermore, for the UnicodeNormalizationOffsetConverter used for braille, it was discovered that diffing didn't turn out to be accurate enough.

Description of user facing changes

When normalization is enabled, there's an extra option Report normalized when navigating by character in the speech settings.
Added global commands foor speech and braille normalization (without assigned gesture)
When normalization is enabled, characters will now always be normalized as well.
Braille Unicode normalization is more reliable now.

Description of development approach

Normalization is now always applied to speech, rather than only for object and text info speech. I also changed some helper functions to be able to report normalized when navigating by character.
Added a SuppressUnicodeNormalizationCommand that allows you to suppress global normalization within a speech sequence. This command is used when creating a spelling sequence, because spelling has its own normalization logic now. It also ensures that when spelling a character or providing a character description (i.e. NVDA+. double press), normalization does not occur. It can even be used to disable character normalization altogether if necessary.
Rewrote UnicodeNormalizationOffsetConverter to use a new function in NVDAHelper local that uses uniscribe to give a list of offsets for character boundaries. This allows us to split a string into glyphs and then apply normalization to every glyph, which results in more reliable offset calculation and less complex code.

Testing strategy:

Test that á and Ĳ are announced in their normalized form when Unicode normalization is on, and that the normalized word is added to the character announcement when enabled. Note that for á, this doesn't always work as several textInfo implementations (such as Mozilla) are slightly broken when navigating across compositions. I must probably report that.
Test that, even when unicode normalization is off, á is now announced as a accute when navigating by character, thereby improving that behavior as well

Known issues with pull request:

None known

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

Summary by CodeRabbit

New Features
- Added an option to report normalized characters during character navigation.
- Introduced commands to cycle through Unicode normalization states for braille and speech.
- Added a checkbox in the speech settings panel for reporting normalized characters.
Enhancements
- Improved speech synthesis with a new command to suppress Unicode normalization.
- Enhanced text processing functions to support Unicode normalization.
Tests
- Added unit tests for notifications and speech normalization, including tests for ligatures and decomposed characters.
Documentation
- Updated user guide with information on toggling Unicode normalization and reporting normalized characters.

seanbudd · 2024-05-29T00:07:25Z

Hi @LeonarddeR - just noting that #16616 hasn't been triaged yet.
We're waiting on further discussion and testing in 2024.3 before considering changing the default for this in 2024.4

LeonarddeR · 2024-05-29T05:54:49Z

Thanks for pointing that out. Note that I was aware of the standpoint to delay this pr to 2024.4, but others might not have been.

Adriani90 · 2024-05-30T07:45:35Z

@seanbudd this seems quite stable, which points do you think need still clarification? If you need more community feedback, this needs to be merged into alpha. I don‘t think there is anythink open regarding speech in #16616. you can also keep this behavior in alpha until you think it is stable enough to bring it into Beta. At least this is how we dealt with new features in the past as well, see e.g. cancellable speech which was enabled only in alpha for a longer period of time until it finally reached the stable version.

CyrilleB79 · 2024-05-30T13:57:27Z

@seanbudd would you accept milestone 2024.3 for this PR if we change default value to disabled (see #16624 (comment))? This would at least avoid to ship 2024.3 with a buggy feature, closing #16624.

A subsequent PR for 2024.4 could then switch the default value to enabled.

seanbudd · 2024-05-31T00:05:25Z

@CyrilleB79 - yes as requested in #16624 (comment), however @LeonarddeR seems to suggest in #16624 (comment) that this is not possible for whatever reason

LeonarddeR · 2024-05-31T05:05:53Z

@seanbudd I must have misunderstood you then. I thought you were asking in #16624 (comment) whether it would be possible to fix #16624 without changing character navigation behavior. That would be very difficult. If you'd agree changing this pr to disable normalization again by default but yet change character navigation to also normalize, that is certainly possibly and I can change the pr as requested by @CyrilleB79.

seanbudd · 2024-05-31T05:48:56Z

That would be great yes. The normalization fix can go into 2024.3, but we don't want to turn it on by default until it has had some wider testing in a release

AppVeyorBot · 2024-05-31T09:03:28Z

FAIL: Lint check. See test results for more information.
PASS: Translation comments check.
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/xpx9om5592xwryp7/artifacts/output/nvda_snapshot_pr16622-32209,8e0d3c7f.exe
PASS: System tests (tags: installer NVDA).
PASS: Unit tests.
CI timing (mins):
INIT 0.0,
INSTALL_START 1.1,
INSTALL_END 0.9,
BUILD_START 0.0,
BUILD_END 12.0,
TESTSETUP_START 0.0,
TESTSETUP_END 0.4,
TEST_START 0.0,
TEST_END 2.4,
FINISH_END 0.2

See test results for failed build of commit 8e0d3c7fdb

AppVeyorBot · 2024-05-31T09:22:40Z

PASS: Translation comments check.
PASS: Unit tests.
PASS: Lint check.
FAIL: System tests (tags: installer NVDA). See test results for more information.
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/b78s5cpjpirs2eyj/artifacts/output/nvda_snapshot_pr16622-32210,de6705d0.exe
CI timing (mins):
INIT 0.0,
INSTALL_START 1.1,
INSTALL_END 0.9,
BUILD_START 0.0,
BUILD_END 11.4,
TESTSETUP_START 0.0,
TESTSETUP_END 0.4,
TEST_START 0.0,
TEST_END 2.3,
FINISH_END 0.2

See test results for failed build of commit de6705d08d

coderabbitai · 2024-06-06T20:28:41Z

Tip

For best results, initiate chat on the files or code changes.

@LeonarddeR, after reviewing the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets in textUtils.cpp, here are some observations and a proposed refactoring strategy to reduce code duplication:

Observations:

Common Functionality: Both functions utilize the ScriptItemize and ScriptBreak functions from the Uniscribe API to analyze text. The main difference lies in how they handle the results from these functions.
Error Handling and Resource Management: Both functions handle errors similarly and have similar resource management patterns, especially with dynamic memory allocation for SCRIPT_ITEM and SCRIPT_LOGATTR.

Proposed Refactoring:

To reduce duplication, you can abstract the common parts of these functions into helper functions. Here’s a proposed refactoring:

Common Script Itemization:
Create a function to handle the common itemization logic.

bool itemizeScript(const wchar_t* text, int textLength, SCRIPT_ITEM** items, int* numItems) {
    *items = new SCRIPT_ITEM[textLength + 1];
    return ScriptItemize(text, textLength, textLength, nullptr, nullptr, *items, numItems) == S_OK && *numItems > 0;
}

Common Script Breaking:
Create a function to handle the common breaking logic.

bool breakScript(const wchar_t* text, int textLength, SCRIPT_ITEM* items, int numItems, SCRIPT_LOGATTR** logAttrArray) {
    *logAttrArray = new SCRIPT_LOGATTR[textLength];
    int nextICharPos = textLength;
    for (int itemIndex = numItems - 1; itemIndex >= 0; --itemIndex) {
        int iCharPos = items[itemIndex].iCharPos;
        int iCharLength = nextICharPos - iCharPos;
        if (ScriptBreak(text + iCharPos, iCharLength, &(items[itemIndex].a), *logAttrArray + iCharPos) != S_OK) {
            return false;
        }
    }
    return true;
}

Refactor Existing Functions:
Utilize these helper functions in calculateCharacterBoundaries and _calculateUniscribeOffsets to reduce duplication and centralize the common logic.

This approach not only reduces duplication but also enhances maintainability by centralizing the error handling and memory management, making the code cleaner and easier to manage.

Let me know if this refactoring aligns with your expectations or if you need further modifications!

Analysis chain

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Extract the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets from textUtils.cpp
ast-grep --lang cpp --pattern $'$_ $_(calculateCharacterBoundaries, _calculateUniscribeOffsets)($$$) {
  $$$
}' --file textUtils.cpp

Length of output: 327

Script:

#!/bin/bash
# Correctly extract the implementations of calculateCharacterBoundaries and _calculateUniscribeOffsets from textUtils.cpp
ast-grep --lang cpp --pattern $'$_ $_(calculateCharacterBoundaries, _calculateUniscribeOffsets)($$$) {
  $$$
}'

Length of output: 112

Script:

#!/bin/bash
# Fetch the entire content of textUtils.cpp
cat textUtils.cpp

Length of output: 65

Script:

#!/bin/bash
# Search for the textUtils.cpp file in the repository to find the correct path
fd textUtils.cpp

Length of output: 49

Script:

#!/bin/bash
# Fetch the entire content of textUtils.cpp from the correct path
cat nvdaHelper/local/textUtils.cpp

Length of output: 4852

LeonarddeR · 2024-06-06T20:50:05Z

@coderabbitai I didn't like your suggestions much. I refactored it in the last commit.

coderabbitai · 2024-06-06T20:52:48Z

Tip

For best results, initiate chat on the files or code changes.

@LeonarddeR, I've reviewed the latest changes in textUtils.cpp from your last commit. It looks like you've made significant improvements by abstracting the common logic into the _getLogAttrArray function, which simplifies the handling of script itemization and breaking. This refactoring enhances the modularity and readability of the code.

Using std::vector for managing SCRIPT_ITEM and SCRIPT_LOGATTR arrays is a robust choice as it automatically manages memory and makes the code cleaner and safer by avoiding manual memory management.

If there's anything specific you'd like feedback on or further improvements you're considering, feel free to let me know! Great work on these refinements.

Analysis chain

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Fetch the latest changes made to the textUtils.cpp file in the last commit
git diff HEAD~1 -- nvdaHelper/local/textUtils.cpp

Length of output: 3463

seanbudd · 2024-06-06T23:57:41Z

@LeonarddeR -is this ready for re-review?

I'd encourage any change of policy/usage of feature flags to be moved to a separate discussion.
Right now they are encouraged for general settings where appropriate, and are used throughout the braille settings, as well as in paragraph nav and other general settings.

LeonarddeR · 2024-06-07T05:24:12Z

@seanbudd Yes, I think the unit tests, including some edge cases, seem to prove that this works as advertised.

Qchristensen

UserGuide looks good.

Adriani90 · 2024-06-08T00:22:49Z

@seanbudd

I'd encourage any change of policy/usage of feature flags to be moved to a separate discussion.
Right now they are encouraged for general settings where appropriate, and are used throughout the braille settings, as well as in paragraph nav and other general settings.

This is because people didn't pay attention to the actual purpose of feature flags when reviewing the coresponding PRs. I think it makes sense to clearly not encourage using these feature flags outside of the advanced settings panel anymore, and state the actuual purpose in the policy docs.
Many users find this feature flag approach outside of the advanced settings as quite bad UX.

seanbudd · 2024-06-11T02:58:57Z

is this ready for re-review?

LeonarddeR · 2024-06-11T05:23:30Z

Yes, sorry. Forgot to mark it as ready

…symbol definition, the symbol replacement is spoken (#16950) Fixup for #16622 Summary of the issue: When unicode normalization of a character (e.g. ·) resulted into a character that had a symbol definition (e.g. ·, middle dot), the symbol definition wasn't applied to the normalization. This resulted in NVDA speaking nothing or only the word normalized. Description of user facing changes NVDA will now properly speak the · character (Greek Ano Teleia) as middle dot when normalizing. This also applies to other characters where normalization results in a character that's part of the symbol dictionary. Description of development approach When normalizing a character, ensure it is thrown through characterProcessing.processSpeechSymbol.

LeonarddeR added 5 commits May 28, 2024 18:18

SPeech: Enable unicode normalization by default

bcdf4c0

Add input gestures

82e2bd7

Speak normalized on character nav

b7844cb

Fix unit test

9d6d799

Lint

7d09e8d

LeonarddeR mentioned this pull request May 28, 2024

Normalization of unicode cahracter: allow excluding the symbols in the symbols.dic file from the normalization #16624

Closed

LeonarddeR changed the title ~~SPeech Unicode normalization: Enable by default and normalize character navigation~~ Speech Unicode normalization: Enable by default and normalize character navigation May 29, 2024

Merge remote-tracking branch 'origin/master' into normalizationFollowUp

b4d74d3

LeonarddeR added this to the 2024.4 milestone May 29, 2024

LeonarddeR added 4 commits May 31, 2024 09:23

Merge remote-tracking branch 'origin/master' into normalizationFollowUp

438d217

Update user guide and change default

50f6087

Fix announcing ligatures

f94c55f

Updates to speech

b35ebde

Fix suppression

19f698a

LeonarddeR changed the title ~~Speech Unicode normalization: Enable by default and normalize character navigation~~ Speech Unicode normalization: normalize character navigation May 31, 2024

LeonarddeR added 3 commits June 1, 2024 09:43

Last fixups, opefully

13c2168

Use walrus

5f54f77

Slightly expand mixed test

b56c369

LeonarddeR changed the title ~~Speech Unicode normalization: normalize character navigation~~ Unicode normalization follow up, adding character navigation and several fixes Jun 1, 2024

LeonarddeR and others added 2 commits June 1, 2024 11:22

Add a SequenceMatcher monkey patch

e9faa2b

Better assertions

b854b76

Get rid of duplicated code

d25042b

LeonarddeR marked this pull request as ready for review June 7, 2024 05:22

Qchristensen approved these changes Jun 7, 2024

View reviewed changes

seanbudd reviewed Jun 7, 2024

View reviewed changes

seanbudd marked this pull request as draft June 7, 2024 06:58

Apply suggestions from code review

6d801d1

LeonarddeR added 4 commits June 8, 2024 13:38

Add processText doc string

afd2aec

Add _getSpellingSpeechWithoutCharMode docstring

50e35bf

Add return type

7c3fd6f

Import Generator

5bf722f

LeonarddeR marked this pull request as ready for review June 11, 2024 05:22

seanbudd approved these changes Jun 11, 2024

View reviewed changes

seanbudd merged commit 6b366fe into nvaccess:master Jun 11, 2024

LeonarddeR mentioned this pull request Aug 3, 2024

Ensure that when character normalization results in a character with symbol definition, the symbol replacement is spoken #16950

Merged

5 tasks

This was referenced Sep 28, 2024

Set default input and translation tables according to NVDA's language #17222

Merged

Update translations from Crowdin #17233

Merged

coderabbitai Bot mentioned this pull request Oct 8, 2024

Update translations from Crowdin #17261

Merged

coderabbitai Bot mentioned this pull request Oct 15, 2024

When spelling by line, ensure that only normalized characters report as such #17295

Merged

5 tasks

coderabbitai Bot mentioned this pull request Jan 8, 2025

Add ability to customize automatic update channels for add-ons #17597

Merged

5 tasks

LeonarddeR mentioned this pull request Aug 16, 2025

Combining diacritics are not read when typing/navigating if Unicode normalization is enabled #18722

Closed

LeonarddeR deleted the normalizationFollowUp branch August 23, 2025 06:27

Uh oh!

Conversation

LeonarddeR commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to issue number:

Summary of the issue:

Description of user facing changes

Description of development approach

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Summary by CodeRabbit

Uh oh!

seanbudd commented May 29, 2024

Uh oh!

LeonarddeR commented May 29, 2024

Uh oh!

Adriani90 commented May 30, 2024

Uh oh!

CyrilleB79 commented May 30, 2024

Uh oh!

seanbudd commented May 31, 2024

Uh oh!

LeonarddeR commented May 31, 2024

Uh oh!

seanbudd commented May 31, 2024

Uh oh!

AppVeyorBot commented May 31, 2024

Uh oh!

AppVeyorBot commented May 31, 2024

Uh oh!

coderabbitai Bot commented Jun 6, 2024

Observations:

Proposed Refactoring:

Uh oh!

LeonarddeR commented Jun 6, 2024

Uh oh!

coderabbitai Bot commented Jun 6, 2024

Uh oh!

seanbudd commented Jun 6, 2024

Uh oh!

LeonarddeR commented Jun 7, 2024

Uh oh!

Qchristensen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adriani90 commented Jun 8, 2024

Uh oh!

seanbudd commented Jun 11, 2024

Uh oh!

LeonarddeR commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

LeonarddeR commented May 28, 2024 •

edited

Loading