Add pyphen-based hyphenation abstraction layer by LeonarddeR · Pull Request #20145 · nvaccess/nvda

LeonarddeR · 2026-05-16T07:18:08Z

Link to issue number:

Part of #17010. Split out from #19916 at reviewer request — this is the first of three PRs. The braille text-wrap refactor and syllable-aware wrap mode follow in separate PRs.

Summary of the issue:

NVDA lacks a locale-aware hyphenation API. One is needed to implement syllable-boundary braille text wrapping, so that long words can be broken at linguistically correct positions rather than always at the raw display edge.

Description of user facing changes:

None. This PR adds internal infrastructure only; no behaviour changes for users.

Description of developer facing changes:

New public function in textUtils.hyphenation:

def getHyphenPositions(text: str, locale: str) -> tuple[int, ...]

Returns the character offsets within text at which a hyphen may be inserted for the given locale. Returns an empty tuple for locales without a pyphen dictionary, logging a debug message once per locale per map lifetime, so callers can fall back cleanly without raising.

A py2exe hook in source/setup.py bundles pyphen's hyph_*.dic files into dist/pyphenDictionaries/ and rewrites pyphen's dictionary lookup path at freeze time so the dictionaries are accessible in frozen builds.

Description of development approach:

LocaleDataMap (already used in NVDA for locale-aware character processing) handles locale fallback and caching. The _pyphenFactory function deliberately rejects region-subtag fallbacks — e.g. it will not silently serve en dictionaries for an en_US lookup — delegating that fallback logic to LocaleDataMap so region matching stays consistent with the rest of NVDA's locale handling.

Testing strategy:

Unit tests cover getHyphenPositions for a known locale (en_US — returns a non-empty tuple of valid positions) and an unknown locale (returns () without raising, idempotent on repeated calls).

Manual testing: confirmed scons.bat dist produces dist/pyphenDictionaries/ containing only hyph_*.dic files.

Known issues with pull request:

None.

Code Review Checklist:

Documentation:
- Change log N/A
- User Documentation N/A
- Developer / Technical Documentation (docstrings added)
- Context sensitive help for GUI changes N/A
Testing:
- Unit tests (added tests/unit/test_hyphenation.py)
- System (end to end) tests N/A
- Manual testing (frozen build verified)
UX of all users considered:
- Braille: foundation for future feature, no observable change yet
- Localization: graceful degradation for unsupported locales
API is compatible with existing add-ons.
Security precautions taken.

Introduces textUtils.hyphenation with getHyphenPositions() — a locale-aware API over the pyphen library. Includes py2exe hook to freeze pyphen dictionaries outside library.zip for frozen builds. Part of nvaccess#17010

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds hyphenation support to NVDA via the pyphen library, including a new utility module, a py2exe build hook to package pyphen dictionaries alongside the frozen executable, and unit tests.

Changes:

New textUtils.hyphenation module exposing getHyphenPositions backed by a LocaleDataMap cache of Pyphen instances.
A py2exe hook that copies hyph_*.dic files next to the executable and AST-rewrites pyphen's dictionaries assignment so it resolves at runtime in the frozen build.
Adds pyphen to project dependencies and unit tests for the new module.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

File	Description
pyproject.toml	Adds `pyphen` runtime dependency.
source/setup.py	Adds `_PyphenTransformer` and `_hook_pyphen` py2exe hook to bundle and relocate pyphen dictionaries.
source/textUtils/hyphenation.py	New module providing `getHyphenPositions` with locale-aware caching via `LocaleDataMap`.
tests/unit/test_hyphenation.py	Unit tests covering known and unknown locale behaviour.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

- Add type hints to `_PyphenTransformer.visit_Try` - Call `generic_visit` on non-matching Try nodes so nested Try nodes are traversed - Inject `from pathlib import Path` into the rewritten pyphen module Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

seanbudd

Thanks @LeonarddeR

Part of #17010. Split out from #19916 at reviewer request — this is the second of three PRs. The pyphen abstraction layer was shipped in #20145; the syllable-aware wrap mode follows in a separate PR. Summary of the issue: The braille word-wrap setting is a single boolean that gives users no control over how words are broken at the display edge. When a word is cut mid-way there is also no visual indication that the word continues on the next row. Description of user facing changes: The boolean "word wrap" checkbox in the braille settings has been replaced with a Text wrap combo box with three choices: Off — Wrap at the raw edge of the display, cutting words in the middle if necessary. No visual indication that a word was cut. Show mark when words are cut — Wrap at the raw edge, but whenever a word is cut mid-way, replace the last cell of the row with a continuation mark (braille dots 7-8) so the reader knows the word continues on the next row. At word boundaries — Prefer breaking at spaces. If no space fits on the row, fall back to cutting the word and showing the continuation mark. Existing config profiles with the old setting are automatically upgraded. Description of developer facing changes: BrailleTextWrapFlag feature flag enum added to config.featureFlagEnums with members DEFAULT, NONE, MARK_WORD_CUTS, AT_WORD_BOUNDARIES. Config schema bumped v22 → v23; old wordWrap boolean is deprecated and bridged bidirectionally to textWrap via _linkDeprecatedValues, so add-ons reading or writing the old key keep working (with a deprecation warning). CONTINUATION_SHAPE = 0xC0 (dots 7-8) constant added to braille. _WindowRowPositions frozen dataclass added to braille to hold the start/end buffer positions and continuation-mark flag for each row of the braille window, replacing the previous anonymous tuple. Description of development approach: The continuation mark is unified: it consistently means "a word was cut here" regardless of mode, so readers get a predictable signal. BrailleBuffer._calculateWindowRowBufferOffsets is extended to implement all three modes. Each entry in _windowRowBufferOffsets is a _WindowRowPositions instance whose showContinuationMark field records whether that row needs a continuation mark. BrailleBuffer._get_windowBrailleCells reads that flag to insert the mark. BrailleBuffer._set_windowEndPos short-circuits space-seeking for NONE and MARK_WORD_CUTS modes (backwards scroll alignment).

) Closes #17010 Follow-up for #20146 and #20145. This is the last of three PRs replacing #19916. Summary of the issue: Word wrap is sometimes pretty aggressive, especially on shorter braille displays. The previous two PRs added the text wrap infrastructure and continuation marks; this PR adds the final mode that splits long words at syllable boundaries using hyphenation dictionaries. Description of user facing changes: A fourth option, At word or syllable boundaries, is added to the Text wrap combo box in braille settings. Like "At word boundaries", it avoids splitting words mid-way, but when a word is too long to fit on the display it additionally tries to split at a syllable boundary (using hyphenation dictionaries from the pyphen library) so less of the word spills onto the next row. NVDA marks the split with the continuation mark (braille dots 7-8). For locales without a pyphen dictionary, the mode falls back cleanly to word-boundary behaviour without any error. Description of developer facing changes: BrailleTextWrapFlag.AT_WORD_OR_SYLLABLE_BOUNDARIES member added to config.featureFlagEnums. Region._languageIndexes (dict[int, str]) tracks language-span boundaries within a braille region. Populated during _addFieldText and _addTextWithFields when format fields carry a language attribute or when field text is in a different language than the surrounding content. Region._getLanguageAtPos(pos) looks up the language at a raw-text offset using a bisect on the (always-ascending) keys of _languageIndexes. BrailleBuffer._getLanguageAtBufferPos(pos) delegates to the region that owns that braille cell. louisHelper.getTableLanguage(table) queries louis.getTableInfo for the "language" key and normalises the result, providing the default language for a region when no format-field language is known. Description of development approach: When AT_WORD_OR_SYLLABLE_BOUNDARIES is selected and a word straddles a row boundary, _calculateWindowRowBufferOffsets already finds the last space before the display edge. This PR adds a second pass: it looks up the full word (from that space to the next space), retrieves the language at the word's braille position, and calls textUtils.hyphenation.getHyphenPositions (introduced in #20145) to obtain candidate hyphen offsets. It then iterates the candidates from the end (closest to the display edge) and picks the first that falls within the current row, updating end accordingly and setting showContinuationMark. Language tracking in Region ensures that the correct pyphen dictionary is selected even when a braille region contains multilingual content (e.g. a paragraph with inline foreign phrases).

LeonarddeR added 2 commits May 15, 2026 09:14

feat: add pyphen-based hyphenation abstraction layer

a21cd25

Introduces textUtils.hyphenation with getHyphenPositions() — a locale-aware API over the pyphen library. Includes py2exe hook to freeze pyphen dictionaries outside library.zip for frozen builds. Part of nvaccess#17010

Remove issue number prefixes from test docstrings

997e0ff

LeonarddeR mentioned this pull request May 16, 2026

Add braille text wrap modes with continuation marks #20146

Merged

5 tasks

LeonarddeR marked this pull request as ready for review May 16, 2026 07:41

LeonarddeR requested a review from a team as a code owner May 16, 2026 07:41

LeonarddeR requested review from Copilot and seanbudd May 16, 2026 07:41

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

Comment thread source/textUtils/hyphenation.py

Comment thread source/setup.py Outdated

Comment thread source/setup.py

LeonarddeR mentioned this pull request May 16, 2026

Hyphenate Braille using pyphen #19916

Closed

12 tasks

Potential fix for pull request finding

f4b4725

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

seanbudd reviewed May 18, 2026

View reviewed changes

Comment thread source/setup.py

Comment thread source/setup.py Outdated

LeonarddeR and others added 3 commits May 18, 2026 20:05

Fixup

d44a0de

Merge remote-tracking branch 'origin/master' into pyphen-abstraction

6b3aaab

LeonarddeR mentioned this pull request May 18, 2026

Add hook for pyphen py2exe/py2exe#242

Merged

Ensure textUtils is bundled, otherwise pyphen will be ignored

7cebc58

seanbudd added the conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. label May 18, 2026

seanbudd approved these changes May 19, 2026

View reviewed changes

seanbudd merged commit ad48676 into nvaccess:master May 19, 2026
35 of 39 checks passed

github-actions Bot added this to the 2026.3 milestone May 19, 2026

LeonarddeR mentioned this pull request May 20, 2026

Add syllable-boundary braille text wrap using pyphen hyphenation #20186

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pyphen-based hyphenation abstraction layer#20145

Add pyphen-based hyphenation abstraction layer#20145
seanbudd merged 7 commits into
nvaccess:masterfrom
LeonarddeR:pyphen-abstraction

LeonarddeR commented May 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanbudd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LeonarddeR commented May 16, 2026

Link to issue number:

Summary of the issue:

Description of user facing changes:

Description of developer facing changes:

Description of development approach:

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanbudd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants