Skip to content

Offsets textInfos: treat 32 bit unicode characters consuming two UTF-16 code units as one character instead of two#8953

Merged
michaelDCurran merged 9 commits into
nvaccess:masterfrom
BabbageCom:offsetsUnicodeBeyond16
Nov 29, 2018
Merged

Offsets textInfos: treat 32 bit unicode characters consuming two UTF-16 code units as one character instead of two#8953
michaelDCurran merged 9 commits into
nvaccess:masterfrom
BabbageCom:offsetsUnicodeBeyond16

Conversation

@LeonarddeR

Copy link
Copy Markdown
Collaborator

The major part of this code is authored by @jcsteh, many thanks for that.

Link to issue number:

Fixes #8782

Summary of the issue:

When arrowing by character in notepad and firefox for emoji such as 🤦, this emoji isn't read when CLDR processing is on. This is because these emoji consist of two UTF-16 code units and therefore cover two offsets in a textInfo.

Description of how this pull request fixes the issue:

For UTF-16 surrogate pairs, treat the 32 bit character as one character. When getting character offsets, return two offsets instead of one for these characters. This has the following advantages:

  • Moving by character in virtual buffers is now similar to moving by character in notepad, firefox and other offset based textInfo controls.
  • Moving by character in Notepad etc. will now report the complete character for UTF-16 surrogate pairs, instead of only the one UTF-16 code point.

Testing performed:

Tested in Firefox, Notepad and Firefox virtual buffers with the 🤦 emoji.

Known issues with pull request:

There might be performance issues, but as I pointed out in #8782 (comment), we're doing something similar when reporting spelling errors.

Change log entry:

@LeonarddeR LeonarddeR requested a review from jcsteh November 16, 2018 14:27
@LeonarddeR LeonarddeR changed the base branch from offsetsUnicodeBeyond16 to master November 16, 2018 14:41

@jcsteh jcsteh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppVeyor failed to build this, perhaps because it tried before you changed the base branch. You may need to push this again with a different hash to trigger a build. git commit --amend --no-edit and then git push -f might do it.

If it's not too much trouble, a small unit test would be good here. Cases to verify include expanding from a normal character (no surrogates), expanding from a leading surrogate, expanding from a trailing surrogate, expanding from a leading surrogate when the next character is not a trailing surrogate, expanding from a trailing surrogate when the previous character isn't a leading surrogate. Note that you can use the tests.unit.textProvider module to help with this.

return offset, offset + 2
elif offset > 0 and LOW_SURROGATE_FIRST <= char <= LOW_SURROGATE_LAST:
# Low (trailing) surrogate; previous offset is also part of this character.
return offset - 1, offset + 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I didn't consider when I wrote this code is what happens for invalid pairs like trailing then trailing, leading then leading, leading then non-surrogate, etc. This won't cause exceptions, but it seems that most implementations treat them as individual characters in this case. For example, U+d800 U+d800 (��) is treated as two separate characters by both Notepad and Firefox.

I think we can get around this by fetching three offsets: the one before, the current one and the one after. With some cleverness, this shouldn't be too tedious:

if offset > 0:
	chars = self._getTextRange(offset - 1, offset + 2)
	# Slicing avoids the need to check length. If invalid, it'll be the empty string.
	prevChar = chars[0:1]
	curChar = chars[1:2]
	nextchar = chars[2:3]
else:
	chars = self._getTextRange(offset, offset + 2)
	prevChar = u""
	# Slicing avoids the need to check length. If invalid, it'll be the empty string.
	curchar = chars[0:1]
	nextChar = chars[1:2]
if HIGH_SURROGATE_FIRST <= curChar <= HIGH_SURROGATE_LAST and LOW_SURROGATE_FIRST <= nextChar <= LOW_SURROGATE_LAST:
... You get the idea.

Note that because prevChar will be empty if we're at offset 0, you don't have to check offset > 0 again; any checks against prevChar will be False.

@LeonarddeR

Copy link
Copy Markdown
Collaborator Author

@jcsteh: Thanks for your review. I've addressed your comments and will provide unit tests over the weekend.

@LeonarddeR

Copy link
Copy Markdown
Collaborator Author

Unit tests provided.

@jcsteh jcsteh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks.

Comment thread tests/unit/test_textInfos.py Outdated
@@ -0,0 +1,131 @@
# -*- coding: UTF-8 -*-

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Contributing guide, please remove the UTF-8 BOM. This won't break here because xgettext isn't used on unit tests, but best to be consistent.

Comment thread tests/unit/test_textInfos.py Outdated

def test_nonSurrogateForward(self):
obj = BasicTextProvider(text="abc")
ti = obj.makeTextInfo(Offsets(0, 0)) # Range at a

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that the "Range at a" comment might be misleading, as this is a collapsed range at this point. Maybe the comment could say "Starting at a" or "Range collapsed at a" or similar? Ditto for all the other "Range at" comments below.

@LeonarddeR

Copy link
Copy Markdown
Collaborator Author

I converted the file to UTF-8 without BOM, but now NP++ tends to open the file in ANSY mode by default. Strange behaviour, since it's certainly UTF-8

@jcsteh

jcsteh commented Nov 24, 2018 via email

Copy link
Copy Markdown
Contributor

@LeonarddeR

Copy link
Copy Markdown
Collaborator Author

@jcsteh: I wonder what the offsets based textInfos will do as soon as we switch to python 3. In python 3 unicode strings, 32 bit unicode characters are treated as one character in a string.

Python 2:

>>> len(u"👍👍👍")
6

Python 3:

>>> len(u"� � � ")
3

It is likely that this code will break in a major way on python 3, especially for cases where getTextRange will get the requested offsets based on storyText, such as in simple edit controls. storyText will be 3 characters long, whereas storyLength will be at least 6. Furthermore, things like pointFromOffset and offsetFromPoint will most likely break.

@jcsteh

jcsteh commented Nov 25, 2018 via email

Copy link
Copy Markdown
Contributor

@LeonarddeR

Copy link
Copy Markdown
Collaborator Author

@michaelDCurran: All review comments by @jcsteh have been addressed, so this is ready for another look. I updated the changes file while at it.

@michaelDCurran michaelDCurran merged commit 8f77dc9 into nvaccess:master Nov 29, 2018
@nvaccessAuto nvaccessAuto added this to the 2018.4 milestone Nov 29, 2018
@tspivey

tspivey commented Nov 29, 2018

Copy link
Copy Markdown
Collaborator

I can't get the unicode value of these extended characters.
STR:

  1. Paste 🤦 into notepad and move to it.
  2. Press read character 3 times.

I expect to get the unicode value of the character, but I only get the emoji repeated again.

Also, the 🤦 emoji is still treated as 2 separate characters in the new issue edit box in Firefox, but not in browse mode, Notepad, etc.

@LeonarddeR

Copy link
Copy Markdown
Collaborator Author

I can't get the unicode value of these extended characters.

Ugh. This should be fixed by #8995.

Also, the 🤦 emoji is still treated as 2 separate characters in the new issue edit box in Firefox, but not in browse mode, Notepad, etc.

You're right. NVDA is getting the character offsets from Firefox directly using IA2. @jcsteh: Thoughts on this matter? We could consider relying on the default getCharacterOffsets for IA2 or just Firefox. I'm afraid the former might be more accurate, as this also bugs in Chrome.

@jcsteh

jcsteh commented Nov 30, 2018 via email

Copy link
Copy Markdown
Contributor

@LeonarddeR

Copy link
Copy Markdown
Collaborator Author

The correct solution is for us to fix this in Firefox.

Agreed, but Chrome also needs a fix.

I think changing this for IA2 would be bad for LibreOffice at least, as I believe there may be fields that only have a single character stop but consume several offsets. I'm not certain, though.

On the other hand, LibreOffice also fails now and only returns one offset for two offset characters. I've also seen horrible mistakes regarding word offsets in LibreOffice, i.e. type "1.2.3." in Writer and navigate through that with ctrl+left/right arrow. LibreOffice word offsets returned by IA2 tend to behave like uniscribe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emojis Do Not Speak when Arrowing by Character

5 participants