Steps to reproduce:
import textUtils; textUtils.UnicodeNormalizationOffsetConverter('ratie \nen/of rapportage?\xa0\r\n·Structurele taken\xa0of \nscripts\xa0die\xa0periodiek moeten \ndraaien\xa0en\xa0waar\xa0controle\xa0/\xa0logging\xa0omheen \nmoet?\xa0\r\n\xa0\r\nZo ja,\xa0waarmee dient er rekening gehouden te \nworden? Benoem\xa0ook\xa0i')
Actual behavior:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "P:\A11y\nvda\source\textUtils.py", line 439, in __init__
self.computedStrToEncodedOffsets, self.computedEncodedToStrOffsets = self._calculateOffsets()
^^^^^^^^^^^^^^^^^^^^^^^^
File "P:\A11y\nvda\source\textUtils.py", line 496, in _calculateOffsets
assert normalizedBuffer == normalizedPart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
Expected behavior:
No error
System configuration
NVDA installed/portable/running from source:
Running from source. Installed or portable doesn't assert, but fails later on in the braille module.
NVDA version:
Master
Additional context
I have tracked this down to behavior in difflib:
difflib.ndiff uses difflib.Differ under the hood, which itself uses difflib.SequenceMatcher
difflib.SequenceMatcher has a parameter autojunk defaulting to True. It is not possible to override this with neither ndiff nor Differ
- When
autojunk is True, the diff gets scrambled up, treating characters as removed and then inserted further on, effectively messing up our logic to handle the diff
Possible solution
I'm working on a new implementation based on SequenceMatcher directly. The nice thing about SequenceMatcher is that it offers:
- More context: a diff only offers info about insert/delete/equal, whereas
SequenceMatcher also provides replacements. Therefore we can distinguish between modifier reordering and character normalization
- Direct info about offset ranges.
import unicodedata, difflib
difflib.SequenceMatcher(None, "Ééijo", unicodedata.normalize("NFKC", "Ééijo"), False).get_opcodes()
> Returns: [('replace', 0, 5, 0, 4), ('equal', 5, 6, 4, 5)]
Steps to reproduce:
Actual behavior:
Expected behavior:
No error
System configuration
NVDA installed/portable/running from source:
Running from source. Installed or portable doesn't assert, but fails later on in the braille module.
NVDA version:
Master
Additional context
I have tracked this down to behavior in difflib:
difflib.ndiffusesdifflib.Differunder the hood, which itself usesdifflib.SequenceMatcherdifflib.SequenceMatcherhas a parameterautojunkdefaulting to True. It is not possible to override this with neitherndiffnorDifferautojunkis True, the diff gets scrambled up, treating characters as removed and then inserted further on, effectively messing up our logic to handle the diffPossible solution
I'm working on a new implementation based on
SequenceMatcherdirectly. The nice thing aboutSequenceMatcheris that it offers:SequenceMatcheralso providesreplacements. Therefore we can distinguish between modifier reordering and character normalization