Braille unicode normalization unreliable for longer texts with many non-breaking spaces and repeating characters

### Steps to reproduce:
```Python
import textUtils; textUtils.UnicodeNormalizationOffsetConverter('ratie \nen/of rapportage?\xa0\r\n·Structurele taken\xa0of \nscripts\xa0die\xa0periodiek moeten \ndraaien\xa0en\xa0waar\xa0controle\xa0/\xa0logging\xa0omheen \nmoet?\xa0\r\n\xa0\r\nZo ja,\xa0waarmee dient er rekening gehouden te \nworden? Benoem\xa0ook\xa0i')
```

### Actual behavior:
```
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "P:\A11y\nvda\source\textUtils.py", line 439, in __init__
    self.computedStrToEncodedOffsets, self.computedEncodedToStrOffsets = self._calculateOffsets()
                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "P:\A11y\nvda\source\textUtils.py", line 496, in _calculateOffsets
    assert normalizedBuffer == normalizedPart
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
```

### Expected behavior:
No error

### System configuration
#### NVDA installed/portable/running from source:
Running from source. Installed or portable doesn't assert, but fails later on in the braille module.

#### NVDA version:
Master

### Additional context
I have tracked this down to behavior in difflib:
1. `difflib.ndiff` uses `difflib.Differ` under the hood, which itself uses `difflib.SequenceMatcher`
2. `difflib.SequenceMatcher` has a parameter `autojunk` defaulting to True. It is not possible to override this with neither `ndiff` nor `Differ`
3. When `autojunk` is True, the diff gets scrambled up, treating characters as removed and then inserted further on, effectively messing up our logic to handle the diff

### Possible solution
I'm working on a new implementation based on `SequenceMatcher` directly. The nice thing about `SequenceMatcher` is that it offers:

1. More context: a diff only offers info about insert/delete/equal, whereas `SequenceMatcher` also provides `replacements`. Therefore we can distinguish between modifier reordering and character normalization
2. Direct info about offset ranges.

```Python
import unicodedata, difflib
difflib.SequenceMatcher(None, "Ééĳo", unicodedata.normalize("NFKC", "Ééĳo"), False).get_opcodes()
> Returns: [('replace', 0, 5, 0, 4), ('equal', 5, 6, 4, 5)]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Braille unicode normalization unreliable for longer texts with many non-breaking spaces and repeating characters #16640

Steps to reproduce:

Actual behavior:

Expected behavior:

System configuration

NVDA installed/portable/running from source:

NVDA version:

Additional context

Possible solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Braille unicode normalization unreliable for longer texts with many non-breaking spaces and repeating characters #16640

Description

Steps to reproduce:

Actual behavior:

Expected behavior:

System configuration

NVDA installed/portable/running from source:

NVDA version:

Additional context

Possible solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions