Skip to content

Braille unicode normalization unreliable for longer texts with many non-breaking spaces and repeating characters #16640

@LeonarddeR

Description

@LeonarddeR

Steps to reproduce:

import textUtils; textUtils.UnicodeNormalizationOffsetConverter('ratie \nen/of rapportage?\xa0\r\n·Structurele taken\xa0of \nscripts\xa0die\xa0periodiek moeten \ndraaien\xa0en\xa0waar\xa0controle\xa0/\xa0logging\xa0omheen \nmoet?\xa0\r\n\xa0\r\nZo ja,\xa0waarmee dient er rekening gehouden te \nworden? Benoem\xa0ook\xa0i')

Actual behavior:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "P:\A11y\nvda\source\textUtils.py", line 439, in __init__
    self.computedStrToEncodedOffsets, self.computedEncodedToStrOffsets = self._calculateOffsets()
                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "P:\A11y\nvda\source\textUtils.py", line 496, in _calculateOffsets
    assert normalizedBuffer == normalizedPart
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Expected behavior:

No error

System configuration

NVDA installed/portable/running from source:

Running from source. Installed or portable doesn't assert, but fails later on in the braille module.

NVDA version:

Master

Additional context

I have tracked this down to behavior in difflib:

  1. difflib.ndiff uses difflib.Differ under the hood, which itself uses difflib.SequenceMatcher
  2. difflib.SequenceMatcher has a parameter autojunk defaulting to True. It is not possible to override this with neither ndiff nor Differ
  3. When autojunk is True, the diff gets scrambled up, treating characters as removed and then inserted further on, effectively messing up our logic to handle the diff

Possible solution

I'm working on a new implementation based on SequenceMatcher directly. The nice thing about SequenceMatcher is that it offers:

  1. More context: a diff only offers info about insert/delete/equal, whereas SequenceMatcher also provides replacements. Therefore we can distinguish between modifier reordering and character normalization
  2. Direct info about offset ranges.
import unicodedata, difflib
difflib.SequenceMatcher(None, "Ééijo", unicodedata.normalize("NFKC", "Ééijo"), False).get_opcodes()
> Returns: [('replace', 0, 5, 0, 4), ('equal', 5, 6, 4, 5)]

Metadata

Metadata

Assignees

Labels

component/braillep3https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priorityrare / intermittent bugcannot be easily reproduced, bug happens intermittentlytriagedHas been triaged, issue is waiting for implementation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions