Originally posted in #8953 (comment)
In python 3 unicode strings, 32 bit unicode characters are treated as one character in a string.
Steps to reproduce
Python 2:
Python 3:
Expected problems
It is likely that code involving offsets textInfos will break in a major way on python 3, especially for cases where getTextRange will get the requested offsets based on storyText, such as in simple edit controls. In the example above, storyText will be 3 characters long, whereas storyLength will be at least 6. This will result in broken behavior when reading through text in Notepad.
On Windows, c widechars (c_wchar) are 2 bytes in size. On python 2, the size of unicode characters in unicode strings is also two bytes. However, in Python 3, the unicode character length is variable.
this is a lovely article on how Python does unicode:
In Python 3.3 and later, the internal storage of Unicode is now dynamic and chosen on a per-string basis. Here’s how it works:
- Python parses source code on the assumption that it’s UTF-8.
- When it needs to create string objects, Python determines the highest code point in the string, and looks at the size of the encoding needed to store that code point as-is.
- Python then chooses that encoding — which will be one of latin-1, UCS-2, or UCS-4 — to store the string.
Discussion
@jsteh said in #8953 (comment): I think we're going to need to have a way to fetch text as UTF-16 bytes arrays, do the work with those and then convert to strings only when returning text for presentation.
This makes sense to do, however when we convert a wchar array to a python bytes array, we'll have to do the string termination by ourselves.
Originally posted in #8953 (comment)
In python 3 unicode strings, 32 bit unicode characters are treated as one character in a string.
Steps to reproduce
Python 2:
Python 3:
Expected problems
It is likely that code involving offsets textInfos will break in a major way on python 3, especially for cases where getTextRange will get the requested offsets based on storyText, such as in simple edit controls. In the example above, storyText will be 3 characters long, whereas storyLength will be at least 6. This will result in broken behavior when reading through text in Notepad.
On Windows, c widechars (c_wchar) are 2 bytes in size. On python 2, the size of unicode characters in unicode strings is also two bytes. However, in Python 3, the unicode character length is variable.
this is a lovely article on how Python does unicode:
Discussion
@jsteh said in #8953 (comment): I think we're going to need to have a way to fetch text as UTF-16 bytes arrays, do the work with those and then convert to strings only when returning text for presentation.
This makes sense to do, however when we convert a wchar array to a python bytes array, we'll have to do the string termination by ourselves.