Offsets textInfos: treat 32 bit unicode characters consuming two UTF-16 code units as one character instead of two by LeonarddeR · Pull Request #8953 · nvaccess/nvda

LeonarddeR · 2018-11-16T14:27:30Z

The major part of this code is authored by @jcsteh, many thanks for that.

Link to issue number:

Summary of the issue:

When arrowing by character in notepad and firefox for emoji such as 🤦, this emoji isn't read when CLDR processing is on. This is because these emoji consist of two UTF-16 code units and therefore cover two offsets in a textInfo.

Description of how this pull request fixes the issue:

For UTF-16 surrogate pairs, treat the 32 bit character as one character. When getting character offsets, return two offsets instead of one for these characters. This has the following advantages:

Moving by character in virtual buffers is now similar to moving by character in notepad, firefox and other offset based textInfo controls.
Moving by character in Notepad etc. will now report the complete character for UTF-16 surrogate pairs, instead of only the one UTF-16 code point.

Testing performed:

Tested in Firefox, Notepad and Firefox virtual buffers with the 🤦 emoji.

Known issues with pull request:

There might be performance issues, but as I pointed out in #8782 (comment), we're doing something similar when reporting spelling errors.

Change log entry:

Bug fixes
- When moving by character in Notepad or in browse mode, 32 bit emoji characters consisting of two UTF-16 code points (such as 🤦) will now read properly. (Emojis Do Not Speak when Arrowing by Character #8782)

…nts beyond 16 bits.

jcsteh

AppVeyor failed to build this, perhaps because it tried before you changed the base branch. You may need to push this again with a different hash to trigger a build. git commit --amend --no-edit and then git push -f might do it.

If it's not too much trouble, a small unit test would be good here. Cases to verify include expanding from a normal character (no surrogates), expanding from a leading surrogate, expanding from a trailing surrogate, expanding from a leading surrogate when the next character is not a trailing surrogate, expanding from a trailing surrogate when the previous character isn't a leading surrogate. Note that you can use the tests.unit.textProvider module to help with this.

jcsteh · 2018-11-23T02:56:47Z

+			return offset, offset + 2
+		elif offset > 0 and LOW_SURROGATE_FIRST <= char <= LOW_SURROGATE_LAST:
+			# Low (trailing) surrogate; previous offset is also part of this character.
+			return offset - 1, offset + 1


One thing I didn't consider when I wrote this code is what happens for invalid pairs like trailing then trailing, leading then leading, leading then non-surrogate, etc. This won't cause exceptions, but it seems that most implementations treat them as individual characters in this case. For example, U+d800 U+d800 (��) is treated as two separate characters by both Notepad and Firefox.

I think we can get around this by fetching three offsets: the one before, the current one and the one after. With some cleverness, this shouldn't be too tedious:

if offset > 0: chars = self._getTextRange(offset - 1, offset + 2) # Slicing avoids the need to check length. If invalid, it'll be the empty string. prevChar = chars[0:1] curChar = chars[1:2] nextchar = chars[2:3] else: chars = self._getTextRange(offset, offset + 2) prevChar = u"" # Slicing avoids the need to check length. If invalid, it'll be the empty string. curchar = chars[0:1] nextChar = chars[1:2] if HIGH_SURROGATE_FIRST <= curChar <= HIGH_SURROGATE_LAST and LOW_SURROGATE_FIRST <= nextChar <= LOW_SURROGATE_LAST: ... You get the idea.

Note that because prevChar will be empty if we're at offset 0, you don't have to check offset > 0 again; any checks against prevChar will be False.

LeonarddeR · 2018-11-23T13:02:18Z

@jcsteh: Thanks for your review. I've addressed your comments and will provide unit tests over the weekend.

LeonarddeR · 2018-11-23T20:25:47Z

Unit tests provided.

jcsteh

Nice! Thanks.

jcsteh · 2018-11-23T21:11:56Z

@@ -0,0 +1,131 @@
+# -*- coding: UTF-8 -*-


Per the Contributing guide, please remove the UTF-8 BOM. This won't break here because xgettext isn't used on unit tests, but best to be consistent.

jcsteh · 2018-11-23T21:13:40Z

+
+	def test_nonSurrogateForward(self):
+		obj = BasicTextProvider(text="abc")
+		ti = obj.makeTextInfo(Offsets(0, 0)) # Range at a


I'm concerned that the "Range at a" comment might be misleading, as this is a collapsed range at this point. Maybe the comment could say "Starting at a" or "Range collapsed at a" or similar? Ditto for all the other "Range at" comments below.

LeonarddeR · 2018-11-24T09:00:07Z

I converted the file to UTF-8 without BOM, but now NP++ tends to open the file in ANSY mode by default. Strange behaviour, since it's certainly UTF-8

jcsteh · 2018-11-24T09:09:22Z

Ug. Perhaps it's confused by the (intentionally) invalid sequences involving surrogates?

LeonarddeR · 2018-11-24T18:30:34Z

@jcsteh: I wonder what the offsets based textInfos will do as soon as we switch to python 3. In python 3 unicode strings, 32 bit unicode characters are treated as one character in a string.

Python 2:

>>> len(u"👍👍👍")
6

Python 3:

>>> len(u"� � � ")
3

It is likely that this code will break in a major way on python 3, especially for cases where getTextRange will get the requested offsets based on storyText, such as in simple edit controls. storyText will be 3 characters long, whereas storyLength will be at least 6. Furthermore, things like pointFromOffset and offsetFromPoint will most likely break.

jcsteh · 2018-11-25T00:32:44Z

Indeed; this is absolutely not Python 3 compatible. I think we're going to need to have a way to fetch text as UTF-16 bytes arrays, do the work with those and then convert to strings only when returning text for presentation.

LeonarddeR · 2018-11-28T13:51:01Z

@michaelDCurran: All review comments by @jcsteh have been addressed, so this is ready for another look. I updated the changes file while at it.

tspivey · 2018-11-29T23:20:20Z

I can't get the unicode value of these extended characters.
STR:

Paste 🤦 into notepad and move to it.
Press read character 3 times.

I expect to get the unicode value of the character, but I only get the emoji repeated again.

Also, the 🤦 emoji is still treated as 2 separate characters in the new issue edit box in Firefox, but not in browse mode, Notepad, etc.

LeonarddeR · 2018-11-30T08:52:03Z

I can't get the unicode value of these extended characters.

Ugh. This should be fixed by #8995.

Also, the 🤦 emoji is still treated as 2 separate characters in the new issue edit box in Firefox, but not in browse mode, Notepad, etc.

You're right. NVDA is getting the character offsets from Firefox directly using IA2. @jcsteh: Thoughts on this matter? We could consider relying on the default getCharacterOffsets for IA2 or just Firefox. I'm afraid the former might be more accurate, as this also bugs in Chrome.

jcsteh · 2018-11-30T08:55:59Z

The correct solution is for us to fix this in Firefox. I think changing this for IA2 would be bad for LibreOffice at least, as I believe there may be fields that only have a single character stop but consume several offsets. I'm not certain, though.

LeonarddeR · 2018-11-30T09:13:53Z

The correct solution is for us to fix this in Firefox.

Agreed, but Chrome also needs a fix.

I think changing this for IA2 would be bad for LibreOffice at least, as I believe there may be fields that only have a single character stop but consume several offsets. I'm not certain, though.

On the other hand, LibreOffice also fails now and only returns one offset for two offset characters. I've also seen horrible mistakes regarding word offsets in LibreOffice, i.e. type "1.2.3." in Writer and navigate through that with ctrl+left/right arrow. LibreOffice word offsets returned by IA2 tend to behave like uniscribe.

OffsetsTextInfo: When retrieving characters, support Unicode code poi…

5d47421

…nts beyond 16 bits.

LeonarddeR requested a review from jcsteh November 16, 2018 14:27

LeonarddeR changed the base branch from offsetsUnicodeBeyond16 to master November 16, 2018 14:41

jcsteh requested changes Nov 23, 2018

View reviewed changes

Leonard de Ruijter added 3 commits November 23, 2018 08:42

Use constants for surrogates

9109f7a

Merge remote-tracking branch 'origin/master' into offsetsUnicodeBeyond16

6a66a84

Review actions based on code snippet by @jcsteh

ccdbfa2

Unit tests

d8a15c5

jcsteh approved these changes Nov 23, 2018

View reviewed changes

Leonard de Ruijter added 2 commits November 24, 2018 09:53

UTF-8 without BOM for tests

1faccf0

Move range comments

e3da174

LeonarddeR mentioned this pull request Nov 26, 2018

Python 3: ctypes.c_wchar has a fixed size of 2 bytes whereas py3 strings have variable character length #8981

Closed

Leonard de Ruijter added 2 commits November 28, 2018 14:48

Merge remote-tracking branch 'origin/master' into offsetsUnicodeBeyond16

8743c12

Update changes

b2ac54b

LeonarddeR requested a review from michaelDCurran November 28, 2018 13:50

michaelDCurran approved these changes Nov 29, 2018

View reviewed changes

michaelDCurran merged commit 8f77dc9 into nvaccess:master Nov 29, 2018

nvaccessAuto added this to the 2018.4 milestone Nov 29, 2018

LeonarddeR mentioned this pull request Nov 30, 2018

Speak ordinal and hex value of 32 bit unicode characters when pressing review current character three times #8995

Merged

This was referenced Dec 8, 2018

Braille routing accuracy breaks in text with emoji #9034

Closed

Firefox: moving by word does not say punctuation immediately proceeding words #3337

Closed

LeonarddeR mentioned this pull request Dec 19, 2018

Move method in textInfo objects is slow when moving by many characters #9093

Open

Uh oh!

Conversation

LeonarddeR commented Nov 16, 2018

Link to issue number:

Summary of the issue:

Description of how this pull request fixes the issue:

Testing performed:

Known issues with pull request:

Change log entry:

Uh oh!

jcsteh left a comment

Choose a reason for hiding this comment

Uh oh!

jcsteh Nov 23, 2018

Choose a reason for hiding this comment

Uh oh!

LeonarddeR commented Nov 23, 2018

Uh oh!

LeonarddeR commented Nov 23, 2018

Uh oh!

jcsteh left a comment

Choose a reason for hiding this comment

Uh oh!

jcsteh Nov 23, 2018

Choose a reason for hiding this comment

Uh oh!

jcsteh Nov 23, 2018

Choose a reason for hiding this comment

Uh oh!

LeonarddeR commented Nov 24, 2018

Uh oh!

jcsteh commented Nov 24, 2018 via email

Uh oh!

LeonarddeR commented Nov 24, 2018

Uh oh!

jcsteh commented Nov 25, 2018 via email

Uh oh!

LeonarddeR commented Nov 28, 2018

Uh oh!

tspivey commented Nov 29, 2018

Uh oh!

LeonarddeR commented Nov 30, 2018

Uh oh!

jcsteh commented Nov 30, 2018 via email

Uh oh!

LeonarddeR commented Nov 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants