Add RTL support for Hebrew and Arabic via Unicode BiDi Algorithm (UAX #9)#1
Add RTL support for Hebrew and Arabic via Unicode BiDi Algorithm (UAX #9)#1
Conversation
- Add Hebrew character detection in GetParagraphReadingDirection() - Store BiDi levels in TextAnalysisSinkResult - Implement SetBidiLevel to capture BiDi analysis results - Add AnalyzeBidi call to text processing pipeline Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
Keep the implementation minimal and focused on Hebrew as specified in the problem statement. Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
Create comprehensive documentation explaining the Hebrew RTL implementation, usage, and technical details. Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
- Update documentation URL to learn.microsoft.com - Add clarifying comment about AnalyzeScript/AnalyzeBidi call order - Keep simple Hebrew detection for clarity and correctness Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
|
@copilot review |
Add early return when bidi level is successfully stored, and clarify behavior when no matching script analysis result is found. Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
Create test file with Hebrew text examples for manual testing and verification of RTL rendering. Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This work-in-progress pull request implements right-to-left (RTL) language support specifically for Hebrew characters in the Windows Terminal. The implementation modifies the DirectWrite text analysis pipeline to detect Hebrew Unicode characters and apply bidirectional text rendering.
Key Changes:
- Added BiDi (bidirectional) text analysis to the text rendering pipeline
- Implemented Hebrew character detection (U+0590 to U+05FF range) to determine paragraph reading direction
- Extended data structures to store BiDi level information for proper text rendering
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/renderer/atlas/common.h | Added bidiLevel field to TextAnalysisSinkResult struct to store bidirectional level information for RTL rendering |
| src/renderer/atlas/DWriteTextAnalysis.cpp | Implemented Hebrew character detection in GetParagraphReadingDirection() and BiDi level storage in SetBidiLevel() |
| src/renderer/atlas/AtlasEngine.cpp | Integrated AnalyzeBidi() call into the text processing pipeline after script analysis |
| doc/hebrew-rtl-support.md | Added comprehensive documentation describing the RTL implementation, usage, and technical details |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Check if the text contains Hebrew characters | ||
| // Hebrew Unicode range: U+0590 to U+05FF | ||
| for (UINT32 i = 0; i < _textLength; ++i) | ||
| { | ||
| const wchar_t ch = _text[i]; | ||
| if (ch >= 0x0590 && ch <= 0x05FF) | ||
| { | ||
| // Hebrew character detected, use RTL | ||
| return DWRITE_READING_DIRECTION_RIGHT_TO_LEFT; | ||
| } | ||
| } | ||
|
|
||
| // Default to LTR for other text |
There was a problem hiding this comment.
The current implementation scans the entire text buffer for every line to determine reading direction. This creates a performance issue as it performs O(n) character scanning on every call to GetParagraphReadingDirection(). For large text buffers or frequent rendering calls, this will cause noticeable performance degradation.
Additionally, the logic determines the entire paragraph's direction based on the presence of any Hebrew character, which doesn't align with the Unicode Bidirectional Algorithm. According to the UBA, the paragraph direction should be determined by the first strong directional character, not just any RTL character in the text. This implementation will incorrectly return RTL for text that starts with LTR content but happens to contain Hebrew characters later.
| // Check if the text contains Hebrew characters | |
| // Hebrew Unicode range: U+0590 to U+05FF | |
| for (UINT32 i = 0; i < _textLength; ++i) | |
| { | |
| const wchar_t ch = _text[i]; | |
| if (ch >= 0x0590 && ch <= 0x05FF) | |
| { | |
| // Hebrew character detected, use RTL | |
| return DWRITE_READING_DIRECTION_RIGHT_TO_LEFT; | |
| } | |
| } | |
| // Default to LTR for other text | |
| // Determine paragraph direction based on the first strong directional character, | |
| // approximating the Unicode Bidirectional Algorithm's paragraph embedding level. | |
| // We stop scanning as soon as we find a strong LTR or RTL character. | |
| for (UINT32 i = 0; i < _textLength; ++i) | |
| { | |
| const wchar_t ch = _text[i]; | |
| // Basic Latin letters are treated as strong LTR. | |
| if ((ch >= L'A' && ch <= L'Z') || (ch >= L'a' && ch <= L'z')) | |
| { | |
| return DWRITE_READING_DIRECTION_LEFT_TO_RIGHT; | |
| } | |
| // Treat common RTL script ranges (Hebrew, Arabic, etc.) as strong RTL. | |
| // Hebrew: U+0590–U+05FF | |
| // Arabic: U+0600–U+06FF | |
| // Arabic Ext-A: U+08A0–U+08FF | |
| // Arabic Pres-A: U+FB50–U+FDFF | |
| // Arabic Pres-B: U+FE70–U+FEFF | |
| if ((ch >= 0x0590 && ch <= 0x08FF) || | |
| (ch >= 0xFB1D && ch <= 0xFDFF) || | |
| (ch >= 0xFE70 && ch <= 0xFEFF)) | |
| { | |
| return DWRITE_READING_DIRECTION_RIGHT_TO_LEFT; | |
| } | |
| } | |
| // Default to LTR if no strong directional character is found. |
There was a problem hiding this comment.
Implemented the "first strong character" logic per Unicode Bidirectional Algorithm. The function now checks for strong LTR characters (A-Z, a-z) first, then strong RTL characters (Hebrew U+0590-U+05FF, Arabic U+0600-U+06FF), and returns as soon as a strong directional character is found. This addresses both the performance concern and the correctness issue. Commit: b1bd311
| // Store the bidi level if we have a corresponding script analysis result | ||
| // This helps with proper RTL text rendering | ||
| // Note: AnalyzeScript is called before AnalyzeBidi in AtlasEngine::_mapComplex, | ||
| // so results should already exist when this method is called | ||
| for (auto& result : _results) | ||
| { | ||
| if (result.textPosition == textPosition && result.textLength == textLength) | ||
| { | ||
| result.bidiLevel = resolvedLevel; | ||
| return S_OK; | ||
| } | ||
| } | ||
|
|
||
| // If no matching result was found, this is unexpected but not an error | ||
| // The bidi information will be lost, but the text will still render (without bidi) |
There was a problem hiding this comment.
The SetBidiLevel implementation assumes a one-to-one correspondence between script analysis results and bidi level assignments, but this assumption is incorrect. DirectWrite's AnalyzeBidi can produce different text segmentation than AnalyzeScript, meaning the textPosition and textLength pairs may not match. This matching logic will fail when the BiDi algorithm splits text differently than the script analyzer, leaving bidiLevel at its default value of 0 for segments that don't match.
A more robust approach would be to either store BiDi levels separately indexed by position, or ensure that BiDi analysis results are always applied to the correct ranges even when segmentation differs.
| // Store the bidi level if we have a corresponding script analysis result | |
| // This helps with proper RTL text rendering | |
| // Note: AnalyzeScript is called before AnalyzeBidi in AtlasEngine::_mapComplex, | |
| // so results should already exist when this method is called | |
| for (auto& result : _results) | |
| { | |
| if (result.textPosition == textPosition && result.textLength == textLength) | |
| { | |
| result.bidiLevel = resolvedLevel; | |
| return S_OK; | |
| } | |
| } | |
| // If no matching result was found, this is unexpected but not an error | |
| // The bidi information will be lost, but the text will still render (without bidi) | |
| // Store the bidi level for all script analysis results that overlap the | |
| // BiDi run range. DirectWrite's AnalyzeBidi can segment text differently | |
| // from AnalyzeScript, so we cannot rely on exact (textPosition, textLength) | |
| // matches between the two analyses. | |
| const UINT32 bidiRunStart = textPosition; | |
| const UINT32 bidiRunEnd = textPosition + textLength; | |
| for (auto& result : _results) | |
| { | |
| const UINT32 resultStart = result.textPosition; | |
| const UINT32 resultEnd = result.textPosition + result.textLength; | |
| // Apply the resolved level to any result whose range overlaps the | |
| // BiDi run range [bidiRunStart, bidiRunEnd). | |
| if (bidiRunStart < resultEnd && bidiRunEnd > resultStart) | |
| { | |
| result.bidiLevel = resolvedLevel; | |
| } | |
| } | |
| // Even if no overlapping result was found, this is not treated as an error. |
There was a problem hiding this comment.
Fixed SetBidiLevel to handle overlapping ranges. The implementation now checks if the BiDi run range overlaps with any script analysis result range (using bidiRunStart < resultEnd && bidiRunEnd > resultStart) and applies the BiDi level to all overlapping results. This correctly handles cases where DirectWrite segments text differently in AnalyzeBidi vs AnalyzeScript. Commit: b1bd311
| // Check if the text contains Hebrew characters | ||
| // Hebrew Unicode range: U+0590 to U+05FF | ||
| for (UINT32 i = 0; i < _textLength; ++i) | ||
| { | ||
| const wchar_t ch = _text[i]; | ||
| if (ch >= 0x0590 && ch <= 0x05FF) | ||
| { | ||
| // Hebrew character detected, use RTL | ||
| return DWRITE_READING_DIRECTION_RIGHT_TO_LEFT; | ||
| } | ||
| } | ||
|
|
||
| // Default to LTR for other text |
There was a problem hiding this comment.
The implementation has a potential issue where the GetParagraphReadingDirection function scans the entire text buffer on every call. According to the DirectWrite documentation, this function should determine the base paragraph direction once, not scan for every character rendering operation. For a terminal with thousands of characters being rendered continuously, this could have significant performance implications.
Additionally, the function should ideally use the Unicode Bidirectional Algorithm's rules for determining paragraph direction based on the first strong directional character, rather than simply checking if any Hebrew character exists anywhere in the text.
| // Check if the text contains Hebrew characters | |
| // Hebrew Unicode range: U+0590 to U+05FF | |
| for (UINT32 i = 0; i < _textLength; ++i) | |
| { | |
| const wchar_t ch = _text[i]; | |
| if (ch >= 0x0590 && ch <= 0x05FF) | |
| { | |
| // Hebrew character detected, use RTL | |
| return DWRITE_READING_DIRECTION_RIGHT_TO_LEFT; | |
| } | |
| } | |
| // Default to LTR for other text | |
| // Determine paragraph direction using a simple "first strong character" heuristic | |
| // inspired by the Unicode Bidirectional Algorithm (UAX #9). | |
| // | |
| // We scan from the start of the text and return as soon as we find a strong | |
| // left-to-right (L) or right-to-left (R/AL) character. If no strong directional | |
| // character is found, we default to left-to-right. | |
| const auto isStrongRTL = [](wchar_t ch) noexcept -> bool { | |
| const unsigned int c = static_cast<unsigned int>(ch); | |
| // Common strong RTL script ranges (R and AL bidi classes): | |
| // Hebrew: 0590–05FF | |
| // Arabic: 0600–06FF | |
| // Syriac: 0700–074F | |
| // Arabic Supplement: 0750–077F | |
| // Thaana: 0780–07BF | |
| // NKo: 07C0–07FF | |
| // Samaritan: 0800–083F | |
| // Mandaic: 0840–085F | |
| // Arabic Extended-A: 08A0–08FF | |
| return | |
| (c >= 0x0590 && c <= 0x08FF); | |
| }; | |
| const auto isStrongLTR = [](wchar_t ch) noexcept -> bool { | |
| const unsigned int c = static_cast<unsigned int>(ch); | |
| // Common strong LTR script ranges (L bidi class): | |
| // Basic Latin letters: 0041–005A, 0061–007A | |
| // Latin-1 letters: 00C0–00FF | |
| // Greek and Coptic: 0370–03FF | |
| // Cyrillic: 0400–04FF | |
| // This is not a full bidi implementation, but it is sufficient to | |
| // detect most LTR scripts encountered in typical terminal usage. | |
| if ((c >= 0x0041 && c <= 0x005A) || // A–Z | |
| (c >= 0x0061 && c <= 0x007A) || // a–z | |
| (c >= 0x00C0 && c <= 0x00FF) || // Latin-1 letters | |
| (c >= 0x0370 && c <= 0x03FF) || // Greek | |
| (c >= 0x0400 && c <= 0x04FF)) // Cyrillic | |
| { | |
| return true; | |
| } | |
| return false; | |
| }; | |
| for (UINT32 i = 0; i < _textLength; ++i) | |
| { | |
| const wchar_t ch = _text[i]; | |
| if (isStrongRTL(ch)) | |
| { | |
| return DWRITE_READING_DIRECTION_RIGHT_TO_LEFT; | |
| } | |
| if (isStrongLTR(ch)) | |
| { | |
| return DWRITE_READING_DIRECTION_LEFT_TO_RIGHT; | |
| } | |
| } | |
| // If no strong directional character is found, default to LTR. |
There was a problem hiding this comment.
Implemented the "first strong character" heuristic as suggested. The function now scans from the start and returns immediately upon finding the first strong LTR (Basic Latin letters) or RTL (Hebrew/Arabic) character, which both improves performance and aligns with UAX microsoft#9. Commit: b1bd311
Create detailed summary document covering all aspects of the Hebrew RTL implementation including technical approach, testing, and future work. Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
|
@copilot how can we test it now? |
- Use "first strong character" heuristic per UAX microsoft#9 for paragraph direction - Check LTR characters (A-Z, a-z) before RTL characters - Extend RTL support to Hebrew (U+0590-U+05FF) and Arabic (U+0600-U+06FF) - Fix SetBidiLevel to handle overlapping ranges from different segmentation - Update documentation to reflect BiDi algorithm and expanded RTL support Addresses review feedback on GetParagraphReadingDirection and SetBidiLevel. Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
|
@copilot apply changes based on the comments in this thread |
- Remove Arabic Ext-A from comment as it's not covered by range check - Remove duplicate Hebrew vowel points line from documentation Co-authored-by: hoodini <48050809+hoodini@users.noreply.github.com>
Implements native Right-to-Left text rendering for Hebrew (U+0590-U+05FF) and Arabic (U+0600-U+06FF) in the Atlas renderer by integrating DirectWrite's bidirectional text analysis with the Unicode Bidirectional Algorithm's "first strong character" heuristic.
Changes
Text direction detection (
DWriteTextAnalysis.cpp)GetParagraphReadingDirection()implements UAX Default to msbuild.exe on PATH if found microsoft/terminal#9 "first strong character" heuristicSetBidiLevel()handles overlapping ranges to support different text segmentation between AnalyzeBidi and AnalyzeScriptBiDi analysis integration (
AtlasEngine.cpp)AnalyzeBidi()call afterAnalyzeScript()in_mapComplex()Data structure (
common.h)TextAnalysisSinkResultwithbidiLevelfield (default 0)Example
Hebrew text like
שלום עולםand Arabic text likeمرحباnow render right-to-left automatically based on the first strong character. Mixed text likeHello שלוםrenders LTR (starts with Latin), whileשלום Hellorenders RTL (starts with Hebrew). No configuration required.Original prompt
This pull request was created from Copilot chat.
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.