Enhanced text layout: links, thoughts and discussion

I've been looking at implementing some alternative hanging punctuation code as envisionned in https://github.com/koreader/koreader/issues/2844#issuecomment-464483142.

I figured I may need, in lvtextfm.cpp, some alternative methods for laying out lines and spacing words (but just that, not redesigning the whole thing!), and so I began looking at other topics like line breaking, bi-directionnal text, proper CJK text layout, at least to see how that could fit in and to not take early some wrong directions that would forbid working on these additional features.

Sadly, I know nothing about other languages and writtings than western ones... So I have tons of questions for CJK and RTL readers, that I may ask later in this issue, if there's some willing to help on that :)
I have no personal use for all that, as I only read western, but these are some quite interesting topics :)
I sometimes think that it could be simple to have these right by just using the appropriate thirdparty libraries. But at some other moments, I feel that even the libraries won't do all that correctly, and there may be much manual tweaking needed, possibly by language... So, it ends up feeling like opening a can of worms...

Terminology:
**CJK** = Chinese, Japanese and Korean
**RTL** = Right To Left (Arabic, Persian, Hebrew... scripts)
**LTR** = Left To Right (Latin, western languages, CJK...)
**Bidi** = Bidirectional text (LTR and RTL mixed)
For now, just cut and pasting and organizing my accumulation of urls and thoughts.

#### Unicode text layout references and algorithms:
http://www.unicode.org/reports/tr14/ UAX#14: Unicode Line Breaking Algorithm
&nbsp; http://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt reference file
&nbsp; http://jkorpela.fi/unicode/linebr.html Unicode line breaking rules: explanations and criticism
https://www.unicode.org/reports/tr29/ UAX#29: Unicode Text Segmentation
http://www.unicode.org/reports/tr9/ UAX#9: Unicode Bidirectional Algorithm
http://www.unicode.org/reports/tr11/ UAX#11: East Asian Width
https://www.w3.org/TR/jlreq/ Requirements for Japanese Text Layout
https://www.w3.org/TR/clreq/ Requirements for Chinese Text Layout
https://w3c.github.io/typography/ International text layout and typography index (links)
https://unicode.org/cldr/utility/breaks.jsp Unicode Utilities (to test algorithms output)

https://drafts.csswg.org/css-text-3/ CSS take on all that enhanced typography
&nbsp; Appendix D,E,F gives some insight about _writting systems_ and the importance of the `lang=` attribute

#### Sites with valuable informations about foreign scripts, languages, typography and chars
https://r12a.github.io/scripts/ **Wonderful and complete descriptions of each script, usage, layout**
&nbsp; https://r12a.github.io/scripts/phrases Sample phrases in various scripts
&nbsp; https://r12a.github.io/scripts/tutorial/summaries/wrapping Sample phrases for testing wrapping
http://www.alanwood.net/unicode/index.html Dated, but very complete
http://jkorpela.fi/chars/index.html Characters and encodings
&nbsp; http://jkorpela.fi/chars/spaces.html http://jkorpela.fi/dashes.html
https://jrgraphix.net/research/unicode.php Unicode Character Ranges
&nbsp; https://unicode.org/charts/
&nbsp; http://unifoundry.com/unifont/index.html large single image of the full unicode planes

#### Line breaking & justification
https://en.wikipedia.org/wiki/Line_wrap_and_word_wrap
https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages
https://www.w3.org/International/articles/css3-text/ CSS and International Text (line breaking and text alignment)
https://www.w3.org/International/articles/typography/justification Approaches to full justification
http://w3c.github.io/i18n-drafts/articles/typography/linebreak.en Approaches to line breaking
https://www.w3.org/TR/2003/CR-css3-text-20030514/#justification-prop CSS justification options, describing the various ways to justify appropriately for some scripts

https://github.com/bramstein/typeset/ TeX line breaking algorithm in JavaScript
https://wiki.mozilla.org/Gecko:Line_Breaking Mozilla documentation about line breaking (obsolete? mention it should switch to UAX#14) implemented in https://github.com/mozilla-services/services-central-legacy/blob/master/intl/lwbrk/src/nsJISx4501LineBreaker.cpp

#### Hanging punctuation / Optical margin alignment

https://en.wikipedia.org/wiki/Hanging_punctuation
https://en.wikipedia.org/wiki/Optical_margin_alignment
https://askfrance.me/q/comment-bien-choisir-saillie-pour-les-lettres-et-la-ponctuation-hors-36070225344
https://helpx.adobe.com/fr/photoshop/using/formatting-paragraphs.html#specify_hanging_punctuation_for_roman_fonts
https://french.stackexchange.com/questions/1432/whats-hanging-punctuation-in-french
https://drafts.csswg.org/css-text/#hanging https://www.w3.org/TR/css-text-3/#hanging-punctuation-property There is support in CSS, but it's very limited and targetted to CJK

Relevant commit about its implementation in crengine: 3ffe69441 (extended to other ideograph by 81bbb8d6c 59377ba89).

I figured we could have both CJK hanging punctuation and western optical margin alignment handled the same, by using, for each candidate glyphs, a % of its width, to be pushed in the margins.
So, hanging punctuation in CJK can go fully in the margin, because the fixed-width ideogram glyphs have a good amount of blank space, and in the end, the space taken in the margin is smaller than the ideogram widths. For other western non-fixed-width glyps (punctuation), we would use a %. Some suggesstions and discussions at:
https://www.w3.org/Mail/flatten/index?subject=Amending+hanging-punctuation+for+Western+typography&list=www-style
https://source.contextgarden.net/tex/context/base/mkiv/font-imp-quality.lua hanging punctuation percentage by char
https://lists.w3.org/Archives/Public/www-style/2011Apr/0276.html

#### BIDI / RTL:
https://www.w3.org/International/articlelist#direction
&nbsp; https://www.w3.org/International/questions/qa-html-dir Q/A
&nbsp; https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
&nbsp; https://www.w3.org/International/articles/inline-bidi-markup/index.en for inline elements
&nbsp; https://www.w3.org/International/questions/qa-html-dir.en for block elements
http://www.i18nguy.com/markup/right-to-left.html
https://www.mobileread.com/forums/showpost.php?p=3828770&postcount=406 sample-persian-book.epub with screenshots of the expected result

Unrelated to crengine, but to check if we want to make the UI RTL:
&nbsp; https://labs.spotify.com/2019/04/15/right-to-left-the-mirror-world/
&nbsp; https://material.io/design/usability/bidirectionality.html UI

#### Various articles about the text layout process
https://www.unicodeconference.org/presentations/S5T2-Röttsches-Esfahbod.pdf Text rendering in Chrome (by HarfBuzz author)
https://simoncozens.github.io/fonts-and-layout/ Some (unfinished book) about text layout.
http://litherum.blogspot.com/2015/02/end-to-end-tour-of-text-rendering.html
&nbsp; http://litherum.blogspot.com/2013/11/complex-text-handling-in-webkit-part-1.html Encoding
&nbsp; http://litherum.blogspot.com/2013/11/complex-text-handling-in-webkit-part-2.html Fonts
&nbsp; http://litherum.blogspot.com/2014/02/complex-text-handling-in-webkit-part-3.html Codepoint to Glyph
&nbsp; http://litherum.blogspot.com/2014/04/complex-text-handling-in-webkit-part-3.html Line breaking
&nbsp; http://litherum.blogspot.com/2014/11/complex-text-handling-in-webkit-part-5.html Bidi
&nbsp; http://litherum.blogspot.com/2014/11/complex-text-handling-in-webkit-part-5_22.html Run Layout
&nbsp; http://litherum.blogspot.com/2014/11/complex-text-handling-in-webkit-part-7.html Width Calculations
&nbsp; http://litherum.blogspot.com/2015/04/complex-glyph-positioning.html
&nbsp; http://litherum.blogspot.com/2015/07/knuth-plass-line-breaking-algorithm.html
&nbsp; http://litherum.blogspot.com/2015/10/vertical-text.html
&nbsp; http://litherum.blogspot.com/2017/05/relationship-between-glyphs-and-code.html

#### Available libraries that could help with that

For illustration, there is a Lua module that provides the full text rendering stack and use many of these libraries, which is interesting to look at (as it's readable :) and may be the only small complete full stack I found, which shows the order of how things should be done.
https://luapower.com/tr unibreak, fribidi in lua
```https://github.com/fribidi/fribidi/issues/30``` interesting Q/A between the author and the HB people
https://github.com/luapower/tr/blob/master/tr_research.txt some short notes on these same topics

There is also this Lua layout engine, which has "just enough" wrappers, and has many specific tweaks per language:
https://github.com/simoncozens/sile/ (see justenoughharfbuzz.c, languages/fr.lua...)
&nbsp; https://github.com/Yoxem/sile/commits/master w.i.p. chinese zh.lua adapted from ja.lua
https://github.com/michal-h21/luatex-harfbuzz-shaper

##### utf8proc
https://github.com/JuliaStrings/utf8proc
http://juliastrings.github.io/utf8proc/doc/
Provides helpers for Unicode categorization (but a bit limited, as it does not provide all of them, like the unicode script - we can't use it to detect if some char is Chinese or Korean).
https://stackoverflow.com/questions/9868792/find-out-the-unicode-script-of-a-character gist in 1st answer gives a simple implementation for detecting script

##### harfbuzz
https://github.com/harfbuzz/harfbuzz
We already use it for font shaping in kerning "best" mode.
It can also provides useful things like direction and script detection of what we throw at it (https://harfbuzz.github.io/harfbuzz-hb-buffer.html), so it may complement utf8proc for some Unicode categorisation. (It includes UCDN https://harfbuzz.github.io/utilities-ucdn.html, so we get additional functions for free).

##### libunibreak
https://github.com/adah1972/libunibreak implements UAX#14 and UAX#29
&nbsp; https://luapower.com/libunibreak https://github.com/luapower/libunibreak Lua wrapper
&nbsp; ```https://github.com/HOST-Oman/libraqm/pull/76``` open PR to use libunibreak in libraqm
&nbsp; ```https://github.com/adah1972/libunibreak/issues/16``` word breaks is less obvious
This works only on the text nodes in logical order, and could be used in crengine src/lvtextfm.cpp [copyText()](https://github.com/koreader/crengine/blob/a845c2874d245473661b7e1dc166cd9598b4eb5f/crengine/src/lvtextfm.cpp#L822-L961) to set/unset `LCHAR_ALLOW_WRAP_AFTER`, trusting it and removing our explicite check for [isCJKIdeograph()](https://github.com/koreader/crengine/blob/a845c2874d245473661b7e1dc166cd9598b4eb5f/crengine/src/lvtextfm.cpp#L1887-L1892) in [processParagraph()](https://github.com/koreader/crengine/blob/a845c2874d245473661b7e1dc166cd9598b4eb5f/crengine/src/lvtextfm.cpp#L1912-L2311) and other places).
I initially thought our check for `isCJKIdeograph()` was wrong as it is allowing breaks after any korean glyph (which are like syllables), but Korean have words separated by spaces, so we should use spaces like in western scripts. But it looks like korean, even if it has spaced words, allow a line break in the middle of such a word. So we're probably already fine with korean.
libunibreak accepts a language parameter, but it's only used to add a few rules for breaking line specific to that language, mostly related to quotes (list in https://github.com/adah1972/libunibreak/blob/master/src/linebreakdef.c).
So, I discovered that German strangely closes on left angle quotation marks, and opens on right angle quotations marks :) (so, I guess what I put in #237 might give strange things on german text, unless they don't use spaces on both side, and only french does that).
Anyway, I'd like us to not have to detect the document or text segment language, neither to have it to be provided by frontend, to keep things simple. Dunno if that's a viable wish.

Some discussion about reshaping because of line breaks, and some unsafe_to_break flag that could/should complement our "is_ligature_tail" flag:
&nbsp; ```https://github.com/harfbuzz/harfbuzz/issues/1463#issuecomment-505592189```
&nbsp; ```https://github.com/linebender/skribo/issues/4```

We may also need to pass HB_BUFFER_FLAG_BOT / HB_BUFFER_FLAG_EOT  to HarfBuzz for specific shaping at Begin/End Of Text (=paragraph).

Other code using libunibreak:
https://github.com/geometer/FBReader/blob/master/zlibrary/text/src/area/ZLTextParagraphBuilder.cpp FBReader
https://git.enlightenment.org/core/efl.git/tree/src/lib/evas/canvas/evas_object_textblock.c enlightenment

Note: when a word is followed by multiple spaces, libunibreak set the allowed break on the last space - crengine will want it on the first space, the others should be marked as collapsed spaces and should be at the beginning of the next word, where they will be ignored.

##### fribidi
https://github.com/fribidi/fribidi fribidi (implements UAX#9)
This works only on the text buffer in logical order, and fill another buffer (lUint32, so as large as the text buffer) from which we can get the bidi level of each char (because english can be detected to be embedded in some arabic which is itself part of some english paragraph...). Could be used in crengine src/lvtextfm.cpp [copyText()](https://github.com/koreader/crengine/blob/a845c2874d245473661b7e1dc166cd9598b4eb5f/crengine/src/lvtextfm.cpp#L822-L961)  to set that level to each char
We would then need in [measureText()](https://github.com/koreader/crengine/blob/a845c2874d245473661b7e1dc166cd9598b4eb5f/crengine/src/lvtextfm.cpp#L1114-L1265) to split on bidi level change to have a new text segment to measure (like we do when there is a font change) as a single text node can have both latin and hebrew in it, and harfbuzz expects its buffer to have a single direction and script.

After that, I guess, line breaking should be tweaked (in processParagraph()), and may by in addLine()): when processing a text line in the the logical-order, and splitting words, it should re-order the words according to their origin text segment bidi level....
We've seen that harbuzz already RTL each individual word and renders it correctly (not so nicely the way we use it currently, see below). And according to ```https://github.com/fribidi/fribidi/issues/30```, there's quite a bit less things to do about bidi when we use harfbuzz.
So, it looks to me that we should indeed split lines with the logical text order, and as harbuzz renders correctly a RTL word, we just have to re-order the words.
https://github.com/fribidi/linear-reorder/blob/master/linear-reorder.c provides a generic algo. It looks like we could put a crengine `formatted_word_t` in that `run_t` to have them re-ordered. Dunno if that's as simple as that :)
(After that, there may be even more complicated things to have text selection and highlighting work with bidi and RTL...)

Our current harfbuzz implementation ("best") is a bit buggy with text more complex than just western with ligatures.
I thought that it does render RTL words correctly, but even that is not done well: the measurements are all messed up (we don't process correctly clusters, it there's some decomposed unicode), and the way we use the fallback font (no harfbuzz re-shaping, using the main codepoint for all chars parts of the cluster) makes wrong results.

And there are cases where the bidi algo doesn't say anything, like this reordering of soft hyphen (and so, should we hyphenate LTR in bidi test, where the hyphen may be in the middle of the line ? :)
http://unicode.org/pipermail/unicode/2014-April/thread.html#353 Bidi reordering of soft hyphen

http://www.staroceans.org/myprojects/vlc/modules/text_renderer/freetype/text_layout.c one of the rare example of the use of `fribidi_reorder_line`, that I guess we'll have to.

One interesting solution to re-shaping with fallback fonts is how it was done in Chrome:
https://lists.freedesktop.org/archives/harfbuzz/2015-October/005168.html font fallback in Chrome
https://chromium.googlesource.com/chromium/src/+/9f6a2b03ccb7091804f173b70b5facff7dffbd61%5E%21/#F8 chrome improved shaping
See also minikin Layout.cpp code below.

We may also need [freetype rebuilt against harfbuzz](https://github.com/koreader/crengine/pull/230#issuecomment-450808049). 

##### libraqm
https://github.com/HOST-Oman/libraqm
http://gtk.10911.n7.nabble.com/pango-vs-libraqm-td94839.html
_raqm does not do font fallback and line breaking currently, nor does it do font enumeration.  Raqm is designed to add to applications that otherwise have a very simplistic view of text rendering. Ie. they use FreeType and a single font to render single-line text (think, movie subtitles...)._

##### pango
https://developer.gnome.org/pango/stable/
https://github.com/GNOME/pango pango
&nbsp; https://gist.github.com/bert/262331/ sample usage
Pango and libraqm provide higher level functions. They do the full pipeline (unicode preprocess, shaping, bidi, linebreaking, rendering).
But we can't use their high level functions because they don't do as much as crengine (vertical text alignment, inline images, floats), so if we were to use them, we'd need to provide small segments, and we may as well do that with the lower level libraries. Or skip all the crengine services (fonts management, text drawing) and use it instead, and have to re-implement all the crengine higher level functions that pango does not provide. Not my plan :)
Pango has dependencies on glib and fontconfig, which does not look like fun.

The most interesting stuff in pango is in https://github.com/GNOME/pango/blob/master/pango/break.c, where it implements UAX#14 and UAX#29, like libunibreak, but in one single pass, with some additional tweaks for arabic and indian (dunno if libunibreak does that as well or not).
Also in pango-layout.c [justify_words()](https://github.com/GNOME/pango/blob/f578a7dd599b842b29595ba86a8e3cdf04e9f472/pango/pango-layout.c#L5665-L5762): for justification, it does as crengine does: it expands spaces. And [if there is not a single one, it switches to adjusting letter spacing](https://github.com/GNOME/pango/blob/f578a7dd599b842b29595ba86a8e3cdf04e9f472/pango/pango-layout.c#L5749-L5758) (which crengine does not do).

##### Others developments/discussions
https://raphlinus.github.io/rust/skribo/text/2019/02/27/text-layout-kickoff.html work towards a rust library
&nbsp; https://gitlab.redox-os.org/redox-os/rusttype/issues/2

Text rendering/Font fallback in Chrome and other browsers
https://chromium.googlesource.com/chromium/src/+/master/third_party/blink/renderer/platform/fonts/README.md
```https://gist.github.com/CrendKing/c162f5a16507d2163d58ee0cf542e695```

minikin is the library used in Android for text layout with harfbuzz. It's quite tough to find some master authoritative version, cause they are many divergent ones... (and the latest Android one does not include some changes provided by Harfbuzz author, that are available in some other branches or fork). Here a few links (the interesting file is Layout.cpp):
https://android.googlesource.com/platform/frameworks/minikin/ minikin main repo
https://dl.khadas.com/test/github/frameworks/minikin/libs/minikin/Layout.cpp
https://github.com/abarth/minikin
https://github.com/flutter/engine/blob/master/third_party/txt/src/minikin/Layout.cpp
https://source.codeaurora.org/quic/la/platform/frameworks/minikin with changes from HarfBuzz author
https://github.com/CyanogenMod/android_frameworks_minikin/blob/cm-12.0/libs/minikin/Layout.cpp with changes from Harfbuzz author
https://medium.com/mindorks/deep-dive-in-android-text-and-best-practices-part-1-6385b28eeb94 minikin (android text layout)

#### CJK (horizontal) layout

@frankyifei said in https://github.com/koreader/koreader/issues/2844#issuecomment-464493642:
> I would say the original design of crengine did not consider CJK layout. So the code here just makes it work for CJK. The code tries to squeeze the space between characters and it is not the best way for CJK. There are better ways like squeezing punctuation marks to make the lines look even. But it is very hard to implement on current code. Traditional Chinese does not have punctuation, and characters should fill every line with identical spaces. now the language is quite westernized so many people are get used to other layout similar to European languages.
Without rewriting lvtextform, it is very difficult to change things and add features. Is it possible to use pango here?

It looks like there is nothing special for CJK in pango. If a paragraph is pure CJK, if would do as crengine does, spacing each CJK char, and it would result in a nice regular ideographs grid.
If there is a single latin letter or punctuation or normal space, the grid would be broken (like it is in the Wikipedia ZH I use for testing CJK).
It's up to us to do the right thing in lvtextfm.cpp. A few ideas:
- [alignLine()](https://github.com/koreader/crengine/blob/a845c2874d245473661b7e1dc166cd9598b4eb5f/crengine/src/lvtextfm.cpp#L1271-L1343) uses a single flag `LTEXT_WORD_CAN_ADD_SPACE_AFTER` that is set on a space or a CJK ideograph. We could have multiple such flags, with different levels of priority, so a western space is prefered to a CJK ideograph, and a CJK punctuation is prefered to a CJK non-punctuation.
- I don't know how much the appareance of a grid of fixed width CJK is important (I feel like it should be :). If it is, we could think of aligning each new CJK char to the next grid-set pixel on a line, so if a western word happens in CJK, there would be a little more spacing after it (or on each side) so the next CJK glyph is pushed to start on that specific grid-pixel.
- I have the feeling that Korean (which seem to use western normal spaces and not the CJK Full width space) do not care about that grid, so that would have to be triggered depending on the unicode detected script.

Anyway, Pango looks like it does none of that.

Some question: with a pure CJK ideographs paragraph, and some line ending with two (or three) "CJK right punctuation", both after the available width, how should it be dealt with?
If there would be only one, it could be made hanging in the right margin, and the grid would be fine.
But with two? Would the 2 be hanging in the right margin? Or would they be pushed on the next line, with the previous regular char, so making a hole in the grid at the far right on the previous line (or breaking the grid if we justify that line, like crengine would do it seems).
What's the proper way to handle that?

Note: there may be some stuff to fix in crengine to also consider UNICODE_NO_BREAK_SPACE for expending/decreasing spaces for justification (pango has in break.c: `attrs[i].is_expandable_space = (0x0020 == wc || 0x00A0 == wc);`

#### Vertical text layout

Low interest, cause it looks so much more complicated.
Pinging @xelxebar which [showed interest in all that](https://github.com/koreader/koreader/issues/4353#issuecomment-486912374) about vertical text.
Just some questions, cause I have no idea how that should work.
I guess it's the whole block element that makes a vertical text section. What's the effect of a `<BR>`? Go back to the top of the next vertical line? What should happen when there are more `<BR>` than the max nb of vertical lines that could fit on the available width?
I naively thought vertical text can be easily sized:
- Get the width of an ideographic space, add interline space to, divide available width by that to get the nb of vertical lines to lay along the width.
- Count the nb of glyphs, divide by the nb of vlines to get the nb of glyph per vertical lines. - Multiply by some ideographic glyph height and you have the height of a the block.
No idea about how variable font family/style/size, vertical-align, and inline images would work with that :)
- When drawing such a block, switch to some specific drawing code to lay glyph by glyph along that grid.

How that's supposed to work for long paragraphs that would span multiple pages? Should scroll mode be aware of how page mode has cut the blocks, or can it layout text on some possibly infinite vertical length? How do browsers (that don't have pages) do it?

----
If having a go at it, possibly first to fix harfbuzz rendering of embedded RTL, which means implementing full bidi support...
Should this new possible expensive new stuff be used when selecting kerning mode "best" (which is the only one where we use harfbuzz correctly, so a requisite), or would we need a "bestest" switch, which, additionally to harfbuzz, would trigger the use of the probably expensive bidi processing?
Or some additional gTextRenderingFlag to enable or not the use of any of the new features (like done for enhanced block rendering)?
I fear starting all that because of the spaghetti mess it will be with so many `#ifdef USE_FRIBIDI` `#ifdef USE_LIBUNIBREAK` if we want crengine to still be able to compile and work without all these... Or a single `ifdef USE_ENHANCED_TEXT_LIBRARIES` (which should include `USE_HARFBUZZ` ?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced text layout: links, thoughts and discussion #307

Unicode text layout references and algorithms:

Sites with valuable informations about foreign scripts, languages, typography and chars

Line breaking & justification

Hanging punctuation / Optical margin alignment

BIDI / RTL:

Various articles about the text layout process

Available libraries that could help with that

utf8proc

harfbuzz

libunibreak

fribidi

libraqm

pango

Others developments/discussions

CJK (horizontal) layout

Vertical text layout

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhanced text layout: links, thoughts and discussion #307

Description

Unicode text layout references and algorithms:

Sites with valuable informations about foreign scripts, languages, typography and chars

Line breaking & justification

Hanging punctuation / Optical margin alignment

BIDI / RTL:

Various articles about the text layout process

Available libraries that could help with that

utf8proc

harfbuzz

libunibreak

fribidi

libraqm

pango

Others developments/discussions

CJK (horizontal) layout

Vertical text layout

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions