Use uniscribe to calculate character offsets where allowed by michaelDCurran · Pull Request #10550 · nvaccess/nvda

michaelDCurran · 2019-11-26T21:54:14Z

Link to issue number:

None.

Summary of the issue:

When moving by character in an NVDA virtualBuffer, each and every unicode code point is treated as its own character, even if is is visually combined with another code point to create one composit character. Examples are:

é: e with an acute
❤️: red heart with a variation selector
8⃣: number 8 inside a keycap
Similarly, when moving by character in Notepad with the arrow keys, NVDA only reads the first code point in composit characters.

Description of how this pull request fixes the issue:

Just like how we use the Windows uniscribe library to calculate word offsets in some places, use it to also calculate character offsets.
This involved:

Abstracted code from nvdaHelperLocal's calculateWordOffsets into _calculateUniscribeOffsets which takes a unit argument of either character or word. For word the code is as it was. For character, it widens the offsets until hitting a code point marked with fCharStop. calculateCharacterOffsets has been added which just calls _calculateUniscribeOffsets with a unit of character.
Abstract out all the uniscribe code from OffsetsTextInfo._getWordOffsets into a new _getUniscribeOffsets method, and use this also in _getCharacterOffsets.

Testing performed:

Using the example characters above in a virtualBuffer, moved throw them with the arrow keys making sure that they are all treated as single composit characters.
Using the example characters above in a notepad document, move through them with left and right arrow and make sure that NVDA announces the entire composit character.

Known issues with pull request:

Although NVDA now matches the behaviour of notepad and other standard edit controls, which includes treating acutes, variation selectors and some other modifiers as being a part of the previous symbol, complex tied emojies that use multiple unicode characters connected with a tie u+200d symbol, are still not treated as one single composit character. But if we did, this would differ from Windows' own standard edit control behaviour.

Change log entry:

Bug fixes:

NVDA will now treat certain composit unicode characters such as e-acute as one single character when moving through text.

… calculate the bounds for a character. This allows us to treat something like e-acute as one character.

…cterOffsets, allowing unit tests to pass again.

LeonarddeR · 2019-11-27T04:37:35Z

I recall trying this before for something, but then decided not to go this way because it was slower. It might help to do some performance tests,e.g. moving 4000 characters forward at once. I'm pretty sure the placemarkers add-on uses that logic.

michaelDCurran · 2019-11-27T06:15:07Z

Some quick benchmarking: With the World War I wikipedia article loaded in Firefox, And with the review cursor at the top of the document in browse mode: ``` import time r=review.copy() t=time.time() r.move(textInfos.UNIT_CHARACTER,4000) time.time()-t ``` Runs seem to be between 0.9 and 1.5 seconds both with this change and without this change. In other words, both the new and old code seem to be affected quite significantly by other things in the environment (which is not surprizing as it is a loop that runs 4000 times). And the added usage of uniscribe does not seem to slow things down as far as I can tell. It is also worth noting that _getCharacterOffsets always fetched the text for the current line. The only difference is the actual uniscribe call. I accept that there is usage in the wild such as the placeMarkers add-on that calls move with a large number. However, with any other text api (UIA, other object models etc) this call would probably be much much worse. Still, if we do notice a performance decrease in real usage, we of course should take this into consideration.

LeonarddeR · 2019-11-27T06:22:48Z

Thanks for these benchmarks, that proves that this is really a valuable change after all. I will review it codewise later today.

michaelDCurran · 2019-11-27T06:52:42Z

I did a second test, once my machine had stopped installing a Windows update in the background :p This time comparing with and without the change, both doing a move of 20000 characters: Without the change: 5.2 seconds With the change: 5.7 seconds It is about an increase of 1.09 times. So yes, if the move is very large (like 20000) then the difference is noticeable.

… code in both calculateWordOffsets and calculateCharacterOffsets.

michaelDCurran · 2019-11-27T22:34:47Z

@LeonarddeR I have addressed all your review actions I believe. When abstracting _calculateUniscribeOffsets in textUtils.cpp, I still copied the two basic for loops that walk the offsets, otherwise it would have become very complex to read with fWordStop changed to fCharStop dynamically changed based on the unit somehow.
Also, that comment about uniscribe being broken without the two extra chars added: I'm not sure if this is still true, but as there are many Operating System variations to test on, I'd prefer not to address that at the moment. Moving by word logic should not be changed, just code moved a bit.

LeonarddeR

Thanks, looks awesome now!

LeonarddeR · 2019-12-03T18:19:38Z

I'm afraid that this pull request introduces off by n errors in braille.TextInfoRegion.getTextInfoForBraillePos. However, it is pretty difficult to avoid this. We should somehow be able to decouple braille positions from characters.

michaelDCurran · 2019-12-03T21:40:27Z

This is of course not the only place where characterOffsets is overridden by an API that can return offset bounds wider than 1. Should I however revert this for now as we don't want to make this worse?

LeonarddeR · 2019-12-04T08:18:22Z

No, no need to revert this. At least for uniscribe, we can disable it at the TextInfo object level before moving by character.

michaelDCurran added 3 commits November 26, 2019 14:52

OffsetsTextInfo._getCharacterOffsets: use uniscribe where possible to…

b845492

… calculate the bounds for a character. This allows us to treat something like e-acute as one character.

Add copyright header to textUtils.cpp

13b917d

Fix linting issues.

df68917

michaelDCurran requested a review from LeonarddeR November 26, 2019 21:54

Restore some accidentally removed code from OffsetsTextInfo._getChara…

88bd8c9

…cterOffsets, allowing unit tests to pass again.

LeonarddeR suggested changes Nov 27, 2019

View reviewed changes

michaelDCurran added 2 commits November 28, 2019 08:27

nvdaHelperLocal's textUtils.cpp: abstract out code to avoid duplicate…

2e3b8fa

… code in both calculateWordOffsets and calculateCharacterOffsets.

Address review actions.

83469c1

LeonarddeR approved these changes Nov 28, 2019

View reviewed changes

michaelDCurran merged commit 1045d2d into master Nov 28, 2019

nvaccessAuto added this to the 2019.3 milestone Nov 28, 2019

michaelDCurran added a commit that referenced this pull request Nov 28, 2019

Update changes file for pr #10550

0b518ae

This was referenced Apr 25, 2020

Keys which produce a compound character spoken as separate typed characters #1428

Closed

Better support for handling compound characters in languages such as Korean and Tamil #2791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use uniscribe to calculate character offsets where allowed#10550

Use uniscribe to calculate character offsets where allowed#10550
michaelDCurran merged 6 commits into
masterfrom
uniscribeCharacterOffsets

michaelDCurran commented Nov 26, 2019 •

edited

Loading

Uh oh!

LeonarddeR commented Nov 27, 2019 via email

Uh oh!

michaelDCurran commented Nov 27, 2019 via email

Uh oh!

LeonarddeR commented Nov 27, 2019 via email

Uh oh!

michaelDCurran commented Nov 27, 2019 via email

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelDCurran commented Nov 27, 2019

Uh oh!

LeonarddeR left a comment

Uh oh!

LeonarddeR commented Dec 3, 2019

Uh oh!

michaelDCurran commented Dec 3, 2019 via email

Uh oh!

LeonarddeR commented Dec 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

michaelDCurran commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to issue number:

Summary of the issue:

Description of how this pull request fixes the issue:

Testing performed:

Known issues with pull request:

Change log entry:

Uh oh!

LeonarddeR commented Nov 27, 2019 via email

Uh oh!

michaelDCurran commented Nov 27, 2019 via email

Uh oh!

LeonarddeR commented Nov 27, 2019 via email

Uh oh!

michaelDCurran commented Nov 27, 2019 via email

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelDCurran commented Nov 27, 2019

Uh oh!

LeonarddeR left a comment

Choose a reason for hiding this comment

Uh oh!

LeonarddeR commented Dec 3, 2019

Uh oh!

michaelDCurran commented Dec 3, 2019 via email

Uh oh!

LeonarddeR commented Dec 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelDCurran commented Nov 26, 2019 •

edited

Loading