Skip to content

bump crengine: multiple fallback fonts#6090

Merged
poire-z merged 1 commit intokoreader:masterfrom
poire-z:bump_crengine
Apr 25, 2020
Merged

bump crengine: multiple fallback fonts#6090
poire-z merged 1 commit intokoreader:masterfrom
poire-z:bump_crengine

Conversation

@poire-z
Copy link
Copy Markdown
Contributor

@poire-z poire-z commented Apr 24, 2020

Includes koreader/crengine#339 :

  • Simplify libunibreak includes
  • Text: fix read/write outside array bounds
  • lvtextfm: dont adjust space after initial quotation mark/dash (rework)
  • Fonts: allow providing and using multiple fallback fonts

Users can set their prefered fallback font, which will be completed with a few of our shipped fonts for maximum coverage.
Ref #5277 (comment). Closes #5277.


This change is Reviewable

Includes:
- Simplify libunibreak includes
- Text: fix read/write outside array bounds
- lvtextfm: dont adjust space after initial quotation mark/dash (rework)
- Fonts: allow providing and using multiple fallback fonts

Users can set their prefered fallback font, which will be completed
with a few of our shipped fonts for maximum coverage.
@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 24, 2020

@WaseemAlkurdi : for EPUB books text, what would be the prefered font for arabic? Noto Sans Arabic UI (sans, a bit rounded & bold) or FreeSerif (thinner, straight, looks more serious to me :)
edit: comparison at #5545 (comment)

Dunno if we should really include Noto Sans Arabic and Devanagari UI among the fallbacks, which is the first font these users will see their text with.
(although may be, these are more correct/tailored to these languages than the all-in-one FreeSerif)
(or may be I'm old school prefering Serif... After Comic Sans, I guess every Sans UI font looks serious enough)

@WaseemAlkurdi
Copy link
Copy Markdown
Contributor

@poire-z Long time no see!
FreeSerif is a good font, but it's very thin and the its Arabic characters are aesthetically unpleasant ... a better choice for a serif font with Arabic coverage would be Noto Naskh Arabic . I know it's against the small rule of being from our current font preset, but it's less than 500 KB in size, so it won't be that heavy to include ... to This would also mean that we won't have to use Noto Sans Arabic UI as a UI fallback font for the engine.
Then, we could have Arabic and Devanagari UI as fallback font, which would pretty much cover it.

(although may be, these are more correct/tailored to these languages than the all-in-one FreeSerif)

FreeSerif is quite the jack of all trades, master of none ... it sure is universal, but its individual characters don't look that nice (letters especially thin on E Ink)

(or may be I'm old school prefering Serif... After Comic Sans, I guess every Sans UI font looks serious enough)

To be honest, serif fonts are more book-like than sans :-)

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 24, 2020

Thanks for the feedback.
But for today, with our current set of fonts, do you prefer we go with Noto Sans Arabic UI or FreeSerif ?
(We can discuss updating fonts later, it will take more time.)

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 24, 2020

Asking in passing:
I don't know if you can appreciate Urdu drawn with Noto Nastaliq, but this is what we would get with KOReader (on the left) vs Firefox (on the right, which might be not using Noto Nastaliq but another font):
image

It's quite different (less continuity in KOReader). But I don't know if it's still correct or really messed up :)
Any thought?

@Frenzie Frenzie added this to the 2020.05 milestone Apr 25, 2020
@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 25, 2020

Just cut and pasting crengine HarfBuzz drawing debug output, in case some nastaliq expert some day comes around (or someone who at least knows the letter names :) and can find out where/why/how offsets might be wrong)

For these 2 first sentences (from Urdu section in https://r12a.github.io/scripts/phrases)
ضابطہ لسانی عدمیت ، W3C
عالمگیر ویب کو حقیقی طور پر عالمگیر بنانا

With crengine, with split on space disabled, so HB gets each full line and context as a single run:
image

line 2 words
DTHB >>> drawTextString 4555b3c0 len 3 is_rtl=0 [Noto Nastaliq Urdu]
DTHB g0 c0(=t:57) [0 .notdef]   advance=(24,0)
DTHB g1 c1(=t:33) [0 .notdef]   advance=(24,0)
DTHB g2 c2(=t:43) [0 .notdef]   advance=(24,0)
DTHB ---
DTHB g0 c0: g0 c0 notdef, g1 c1 -, g2 c2 -, [...]
DTHB ### drawing past notdef with fallback font 0>3  => 0 > 3
DTHB >>> drawTextString 4555b3c0 len 3 is_rtl=0 [Noto Sans CJK SC]
DTHB g0 c0(=t:57) [38 ] advance=(23,0)
DTHB g1 c1(=t:33) [14 ] advance=(14,0)
DTHB g2 c2(=t:43) [24 ] advance=(17,0)
DTHB ---
DTHB g0 c0: all found, regular g0>1: 38(x=302+0,w=23)
DTHB g1 c1: all found, regular g1>2: 14(x=325+0,w=14)
DTHB g2 c2: regular g2>3: 24(x=339+1,w=17)
DTHB ### drawn past notdef > X+= 54
[...]
DTHB >>> drawTextString 4555b370 len 20 is_rtl=1 [Noto Nastaliq Urdu]
DTHB g0 c19(=t:20) [3 space]    advance=(4,0)
DTHB g1 c18(=t:60c) [cb CommaArabic]  advance=(10,0)
DTHB g2 c17(=t:20) [3 space]    advance=(4,0)
DTHB g3 c16(=t:62a) [10 TwoDotsAboveNS] advance=(0,0)   offset=(16,-9)
DTHB g4 c16(=t:62a) [e6 BehxFin] advance=(32,0)
DTHB g5 c15(=t:6cc) [13 TwoDotsBelowNS] advance=(0,0)   offset=(7,4)
DTHB g6 c15(=t:6cc) [171 BehxMed.inT2outT2]     advance=(8,0)   offset=(0,7)
DTHB g7 c14(=t:645) [3d7 sp0]   advance=(0,0)
DTHB g8 c14(=t:645) [1da MeemIni.outT2] advance=(14,0)  offset=(0,10)
DTHB g9 c13(=t:62f) [eb DalFin] advance=(10,0)
DTHB g10 c12(=t:639) [3d7 sp0]  advance=(0,0)
DTHB g11 c12(=t:639) [23c AinIni]     advance=(11,0)
DTHB g12 c11(=t:20) [3 space]   advance=(4,0)
DTHB g13 c10(=t:6cc) [143 YehxFin.inD2alt]      advance=(14,0)
DTHB g14 c9(=t:646) [f OneDotAboveNS] advance=(0,0)     offset=(-1,3)
DTHB g15 c9(=t:646) [3d7 sp0]   advance=(0,0)
DTHB g16 c9(=t:646) [146 BehxIni.outD2Y]        advance=(6,0)   offset=(0,10)
DTHB g17 c8(=t:627) [e4 AlefFin] advance=(7,0)
DTHB g18 c7(=t:633) [33c SeenMed.inT2outT1]     advance=(15,0)
DTHB g19 c6(=t:644) [3d7 sp0]   advance=(0,0)
DTHB g20 c6(=t:644) [19f LamIni.outT2]  advance=(8,0)   offset=(0,6)
DTHB g21 c5(=t:20) [3 space]    advance=(4,0)
DTHB g22 c4(=t:6c1) [12c HehFin.wide] advance=(9,0)
DTHB g23 c3(=t:637) [131 TahMed.inD1outS1]      advance=(1,0)
DTHB g24 c2(=t:628) [12 OneDotBelowNS]  advance=(0,0)   offset=(14,0)
DTHB g25 c2(=t:628) [3d7 sp0]   advance=(0,0)
DTHB g26 c2(=t:628) [21a BehxIni.outD1] advance=(18,0)  offset=(0,1)
DTHB g27 c1(=t:627) [e4 AlefFin] advance=(7,0)
DTHB g28 c0(=t:636) [f OneDotAboveNS] advance=(0,0)     offset=(13,-4)
DTHB g29 c0(=t:636) [3d7 sp0]   advance=(0,0)
DTHB g30 c0(=t:636) [1e2 SadIni] advance=(21,0)
DTHB ---
DTHB g0 c19: all found, regular g0>1: 3(x=356+0,w=4)
DTHB g1 c18: all found, regular g1>2: cb(x=360+2,w=10)
DTHB g2 c17: all found, regular g2>3: 3(x=370+0,w=4)
DTHB g3 c16: all found, regular g3>5: 10(x=374+11,w=0) e6(x=374+1,w=32)
DTHB g5 c15: all found, regular g5>7: 13(x=406+2,w=0) 171(x=406+-1,w=8)
DTHB g7 c14: all found, regular g7>9: 3d7(x=414+0,w=0) 1da(x=414+-1,w=14)
DTHB g9 c13: all found, regular g9>10: eb(x=428+-2,w=10)
DTHB g10 c12: all found, regular g10>12: 3d7(x=438+0,w=0) 23c(x=438+-2,w=11)
DTHB g12 c11: all found, regular g12>13: 3(x=449+0,w=4)
DTHB g13 c10: all found, regular g13>14: 143(x=453+0,w=14)
DTHB g14 c9: all found, regular g14>17: f(x=467+-4,w=0) 3d7(x=467+0,w=0) 146(x=467+-1,w=6)
DTHB g17 c8: all found, regular g17>18: e4(x=473+3,w=7)
DTHB g18 c7: all found, regular g18>19: 33c(x=480+-2,w=15)
DTHB g19 c6: all found, regular g19>21: 3d7(x=495+0,w=0) 19f(x=495+-1,w=8)
DTHB g21 c5: all found, regular g21>22: 3(x=503+0,w=4)
DTHB g22 c4: all found, regular g22>23: 12c(x=507+0,w=9)
DTHB g23 c3: all found, regular g23>24: 131(x=516+-2,w=1)
DTHB g24 c2: all found, regular g24>27: 12(x=517+11,w=0) 3d7(x=517+0,w=0) 21a(x=517+-2,w=18)
DTHB g27 c1: all found, regular g27>28: e4(x=535+3,w=7)
DTHB g28 c0: regular g28>31: f(x=542+10,w=0) 3d7(x=542+0,w=0) 1e2(x=542+-2,w=21)

line 1 words
DTHB >>> drawTextString 454a47a0 len 41 is_rtl=1 [Noto Nastaliq Urdu]
DTHB g0 c40(=t:627) [102 AlefFin.narrow]        advance=(6,0)
DTHB g1 c39(=t:646) [f OneDotAboveNS] advance=(0,0)     offset=(4,-11)
DTHB g2 c39(=t:646) [3d7 sp0]   advance=(0,0)
DTHB g3 c39(=t:646) [11a BehxIni.A]   advance=(6,0)
DTHB g4 c38(=t:627) [e4 AlefFin] advance=(7,0)
DTHB g5 c37(=t:646) [f OneDotAboveNS] advance=(0,0)     offset=(5,-11)
DTHB g6 c37(=t:646) [16f BehxMed.inT2outT1]     advance=(7,0)
DTHB g7 c36(=t:628) [12 OneDotBelowNS]  advance=(0,0)   offset=(3,0)
DTHB g8 c36(=t:628) [3d8 sp1]   advance=(0,0)
DTHB g9 c36(=t:628) [12d BehxIni.outT2] advance=(6,0)   offset=(0,4)
DTHB g10 c35(=t:20) [3 space]   advance=(4,0)
DTHB g11 c34(=t:631) [11e RehFin]     advance=(10,0)
DTHB g12 c33(=t:6cc) [13 TwoDotsBelowNS]        advance=(0,0)   offset=(1,-2)
DTHB g13 c33(=t:6cc) [17f BehxMed.inS1outS1]    advance=(2,0)   offset=(0,4)
DTHB g14 c32(=t:6af) [1c8 GafMed.outS1] advance=(9,0)   offset=(0,4)
DTHB g15 c31(=t:645) [339 MeemMed.inD2outT1]    advance=(8,0)   offset=(0,6)
DTHB g16 c30(=t:644) [3d7 sp0]  advance=(0,0)
DTHB g17 c30(=t:644) [1b7 LamIni.outD2MM]       advance=(5,0)   offset=(0,15)
DTHB g18 c29(=t:627) [e4 AlefFin]     advance=(7,0)
DTHB g19 c28(=t:639) [3d7 sp0]  advance=(0,0)
DTHB g20 c28(=t:639) [23c AinIni]     advance=(11,0)
DTHB g21 c27(=t:20) [3 space]   advance=(4,0)
DTHB g22 c26(=t:631) [149 RehFin.inD4B] advance=(11,0)
DTHB g23 c25(=t:67e) [14 ThreeDotsDownBelowNS]  advance=(0,0)   offset=(0,2)
DTHB g24 c25(=t:67e) [3d7 sp0]  advance=(0,0)
DTHB g25 c25(=t:67e) [2c4 BehxIni.outD4]        advance=(4,0)   offset=(0,10)
DTHB g26 c24(=t:20) [3 space]   advance=(4,0)
DTHB g27 c23(=t:631) [ec RehSep] advance=(11,0)
DTHB g28 c22(=t:648) [16c WawFin.cut] advance=(7,0)
DTHB g29 c21(=t:637) [3d7 sp0]  advance=(0,0)
DTHB g30 c21(=t:637) [161 TahIni.outT3] advance=(17,0)  offset=(0,6)
DTHB g31 c20(=t:20) [3 space]   advance=(4,0)
DTHB g32 c19(=t:6cc) [113 YehxFin]    advance=(14,0)
DTHB g33 c18(=t:642) [10 TwoDotsAboveNS]        advance=(0,0)   offset=(4,3)
DTHB g34 c18(=t:642) [237 FehxMed.inT3outD2Y]   advance=(6,0)   offset=(0,10)
DTHB g35 c17(=t:6cc) [13 TwoDotsBelowNS]        advance=(0,0)   offset=(6,10)
DTHB g36 c17(=t:6cc) [18a BehxMed.inT2outT3]    advance=(8,0)   offset=(0,12)
DTHB g37 c16(=t:642) [10 TwoDotsAboveNS]        advance=(0,0)   offset=(6,10)
DTHB g38 c16(=t:642) [22d FehxMed.inT3outT2]    advance=(9,0)   offset=(0,16)
DTHB g39 c15(=t:62d) [3d7 sp0]  advance=(0,0)
DTHB g40 c15(=t:62d) [1f2 HahIni.outT3] advance=(14,0)  offset=(0,19)
DTHB g41 c14(=t:20) [3 space]   advance=(4,0)
DTHB g42 c13(=t:648) [16d WawFin.inD2]  advance=(10,0)
DTHB g43 c12(=t:6a9) [3d7 sp0]  advance=(0,0)
DTHB g44 c12(=t:6a9) [1a6 KafIni.outD2] advance=(7,0)   offset=(0,9)
DTHB g45 c11(=t:20) [3 space]   advance=(4,0)
DTHB g46 c10(=t:628) [12 OneDotBelowNS] advance=(0,0)   offset=(16,-4)
DTHB g47 c10(=t:628) [100 BehxFin.soft] advance=(32,0)
DTHB g48 c9(=t:6cc) [13 TwoDotsBelowNS] advance=(0,0)   offset=(-1,2)
DTHB g49 c9(=t:6cc) [3d7 sp0]   advance=(0,0)
DTHB g50 c9(=t:6cc) [12e BehxIni.outT2B]        advance=(4,0)   offset=(0,8)
DTHB g51 c8(=t:648) [f6 WawSep] advance=(11,0)
DTHB g52 c7(=t:20) [3 space]    advance=(4,0)
DTHB g53 c6(=t:631) [11e RehFin] advance=(10,0)
DTHB g54 c5(=t:6cc) [13 TwoDotsBelowNS] advance=(0,0)   offset=(1,-2)
DTHB g55 c5(=t:6cc) [17f BehxMed.inS1outS1]     advance=(2,0)   offset=(0,4)
DTHB g56 c4(=t:6af) [1c8 GafMed.outS1]  advance=(9,0)   offset=(0,4)
DTHB g57 c3(=t:645) [339 MeemMed.inD2outT1]     advance=(8,0)   offset=(0,6)
DTHB g58 c2(=t:644) [3d7 sp0]   advance=(0,0)
DTHB g59 c2(=t:644) [1b7 LamIni.outD2MM]        advance=(5,0)   offset=(0,15)
DTHB g60 c1(=t:627) [e4 AlefFin] advance=(7,0)
DTHB g61 c0(=t:639) [3d7 sp0]   advance=(0,0)
DTHB g62 c0(=t:639) [23c AinIni] advance=(11,0)
DTHB ---
DTHB g0 c40: all found, regular g0>1: 102(x=237+3,w=6)
DTHB g1 c39: all found, regular g1>4: f(x=243+1,w=0) 3d7(x=243+0,w=0) 11a(x=243+-2,w=6)
DTHB g4 c38: all found, regular g4>5: e4(x=249+3,w=7)
DTHB g5 c37: all found, regular g5>7: f(x=256+2,w=0) 16f(x=256+-2,w=7)
DTHB g7 c36: all found, regular g7>10: 12(x=263+0,w=0) 3d8(x=263+0,w=0) 12d(x=263+-1,w=6)
DTHB g10 c35: all found, regular g10>11: 3(x=269+0,w=4)
DTHB g11 c34: all found, regular g11>12: 11e(x=273+-6,w=10)
DTHB g12 c33: all found, regular g12>14: 13(x=283+-4,w=0) 17f(x=283+-1,w=2)
DTHB g14 c32: all found, regular g14>15: 1c8(x=285+-1,w=9)
DTHB g15 c31: all found, regular g15>16: 339(x=294+-2,w=8)
DTHB g16 c30: all found, regular g16>18: 3d7(x=302+0,w=0) 1b7(x=302+-2,w=5)
DTHB g18 c29: all found, regular g18>19: e4(x=307+3,w=7)
DTHB g19 c28: all found, regular g19>21: 3d7(x=314+0,w=0) 23c(x=314+-2,w=11)
DTHB g21 c27: all found, regular g21>22: 3(x=325+0,w=4)
DTHB g22 c26: all found, regular g22>23: 149(x=329+-2,w=11)
DTHB g23 c25: all found, regular g23>26: 14(x=340+-5,w=0) 3d7(x=340+0,w=0) 2c4(x=340+-5,w=4)
DTHB g26 c24: all found, regular g26>27: 3(x=344+0,w=4)
DTHB g27 c23: all found, regular g27>28: ec(x=348+-2,w=11)
DTHB g28 c22: all found, regular g28>29: 16c(x=359+0,w=7)
DTHB g29 c21: all found, regular g29>31: 3d7(x=366+0,w=0) 161(x=366+-1,w=17)
DTHB g31 c20: all found, regular g31>32: 3(x=383+0,w=4)
DTHB g32 c19: all found, regular g32>33: 113(x=387+0,w=14)
DTHB g33 c18: all found, regular g33>35: 10(x=401+-1,w=0) 237(x=401+-1,w=6)
DTHB g35 c17: all found, regular g35>37: 13(x=407+1,w=0) 18a(x=407+-1,w=8)
DTHB g37 c16: all found, regular g37>39: 10(x=415+1,w=0) 22d(x=415+-1,w=9)
DTHB g39 c15: all found, regular g39>41: 3d7(x=424+0,w=0) 1f2(x=424+-1,w=14)
DTHB g41 c14: all found, regular g41>42: 3(x=438+0,w=4)
DTHB g42 c13: all found, regular g42>43: 16d(x=442+0,w=10)
DTHB g43 c12: all found, regular g43>45: 3d7(x=452+0,w=0) 1a6(x=452+-1,w=7)
DTHB g45 c11: all found, regular g45>46: 3(x=459+0,w=4)
DTHB g46 c10: all found, regular g46>48: 12(x=463+13,w=0) 100(x=463+1,w=32)
DTHB g48 c9: all found, regular g48>51: 13(x=495+-6,w=0) 3d7(x=495+0,w=0) 12e(x=495+-1,w=4)
DTHB g51 c8: all found, regular g51>52: f6(x=499+0,w=11)
DTHB g52 c7: all found, regular g52>53: 3(x=510+0,w=4)
DTHB g53 c6: all found, regular g53>54: 11e(x=514+-6,w=10)
DTHB g54 c5: all found, regular g54>56: 13(x=524+-4,w=0) 17f(x=524+-1,w=2)
DTHB g56 c4: all found, regular g56>57: 1c8(x=526+-1,w=9)
DTHB g57 c3: all found, regular g57>58: 339(x=535+-2,w=8)
DTHB g58 c2: all found, regular g58>60: 3d7(x=543+0,w=0) 1b7(x=543+-2,w=5)
DTHB g60 c1: all found, regular g60>61: e4(x=548+3,w=7)
DTHB g61 c0: regular g61>63: 3d7(x=555+0,w=0) 23c(x=555+-2,w=11)

Some possibly related issues elsewhere:
https://bugzilla.mozilla.org/show_bug.cgi?id=761442
https://bugzilla.mozilla.org/show_bug.cgi?id=1562733
https://bugs.freedesktop.org/show_bug.cgi?id=89992

@WaseemAlkurdi
Copy link
Copy Markdown
Contributor

WaseemAlkurdi commented Apr 25, 2020

@poire-z

Thanks for the feedback.
But for today, with our current set of fonts, do you prefer we go with Noto Sans Arabic UI or FreeSerif ?
(We can discuss updating fonts later, it will take more time.)

You're welcome!
It's a tough call ... but I'd prefer Noto Sans Arabic UI. A font that's "not book-like" is better than a font that might make users think something is "broken".

(We can discuss updating fonts later, it will take more time.)

Cool, I'd be waiting. But basically, that's would be the only addition needed for Arabic fonts. A good serif font ( Noto Naskh Arabic ) and a good sans font ( Noto Sans Arabic UI ).

I don't know if you can appreciate Urdu drawn with Noto Nastaliq, but this is what we would get with KOReader (on the left) vs Firefox (on the right, which might be not using Noto Nastaliq but another font):

There's definitely something wrong in there. Though Urdu uses script akin to "cursive" (nastaliq) , it's still part of the Arabic family of scripts. And I can't help but appreciate your eyesight, having spotted the lack of continuity. The letters, which normally "hold hands", are all over the place in relation to the "line".
It's "legible", but with strain ... and that's only because the Firefox rendering was there.

Don't know if this would help, but I can feel that this bug is related to another very minor bug in KOReader (more of a nitpick than a bug), where the Arabic diacritics (in both menus and rendering) are drawn too "low", enough for them to mix with the dots above the letters and sometimes with the letters themselves depending on the context. Initially, I ignored this, but now I feel that this might have something to do with the issue at hand, therefore I brought it up.
Quick example: try this word:

الشّبكة

This is the word الشبكة with the (optionally written, obligatorily pronounced) shadda ّ over the ش (connected form: ـشـ ).
(edit: diacritic might be too small to notice, might help to zoom in)

In KOReader, the diacritic would be "mixed" or "joined" with the letter itself ... it only needs some padding.

Just cut and pasting crengine HarfBuzz drawing debug output, in case some nastaliq expert some day comes around (or someone who at least knows the letter names :) and can find out where/why/how offsets might be wrong)

To understand nastaliq , one has to understand the etymology of the word. It's a portmanteau of naskh (literally "copy", meaning "copy script" here, an example of which is the passage in simple script you copied above) and ta'liq , meaning "hanging". The letters "hang" from their edge and "cascade" as opposed to copy script where they "flow" along the same line (hoping my explanation makes any sense).
However, I'm not sure how computers process nastaliq (as pre-drawn glyphs or processing on the fly) ... perhaps a speaker of Urdu or Farsi can help here.

The issue is very visible once you have that in mind. The first word from the right ضابطہ has nothing wrong. The first letter from the right in the next word, لسانی , has the letter لـ hanging alone, separated from the rest of the word.
The third word, عدمیت , has the ميـ part broken off from the rest of the word.
The next sentence is literally full of letters that hang either above or below the line, almost too many to list :-(

If there's anything else you want, you can as always ask me and I'll explain! :D

@WaseemAlkurdi
Copy link
Copy Markdown
Contributor

WaseemAlkurdi commented Apr 25, 2020

This picture highlights the difference between naskh and nastaliq.
The nastaliq here is intentionally "likened" to the two examples of naskh (and "uglified", aesthetically speaking, with that simplification) to make the difference more obvious ...
Hope it helps!
image

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 25, 2020

Thank you for the constant cultural insights :)

Yep, I got Nastaliq was decrescendo, and I see that we cut the flow to start again from above in a same word - and that we should not.

In KOReader, the diacritic would be "mixed" or "joined" with the letter itself ... it only needs some padding.
However, I'm not sure how computers process nastaliq (as pre-drawn glyphs or processing on the fly) ... perhaps a speaker of Urdu or Farsi can help here.

Well, for all this, we don't do anything specific: we trust what Harfbuzz does and tells us (and Harfbuzz trust what the font does and tells it) - so a bug on our side would be us giving it bad input, or badly processing its output.

For example, for drawing this: image
Harbuzz tells us (we are still drawing left to right, from the arabic end to the start):

g=glyphnum c=logical char order (=t:char codepoint) [glyph index + nickname] advance offset
DTHB g0 c40(=t:627) [102 AlefFin.narrow]        advance=(6,0)
DTHB g1 c39(=t:646) [f OneDotAboveNS] advance=(0,0)     offset=(4,-11)
DTHB g2 c39(=t:646) [3d7 sp0]   advance=(0,0)
DTHB g3 c39(=t:646) [11a BehxIni.A]   advance=(6,0)
DTHB g4 c38(=t:627) [e4 AlefFin] advance=(7,0)
DTHB g5 c37(=t:646) [f OneDotAboveNS] advance=(0,0)     offset=(5,-11)
DTHB g6 c37(=t:646) [16f BehxMed.inT2outT1]     advance=(7,0)
DTHB g7 c36(=t:628) [12 OneDotBelowNS]  advance=(0,0)   offset=(3,0)
DTHB g8 c36(=t:628) [3d8 sp1]   advance=(0,0)
DTHB g9 c36(=t:628) [12d BehxIni.outT2] advance=(6,0)   offset=(0,4)
DTHB g10 c35(=t:20) [3 space]   advance=(4,0)

It's with the advance and offset that it tells us where to draw each glyph (so, for the OneDotAboveNS glyph, it tells us to draw it a x+4 (or +5) and y-11 from the normal x/y position, and to not advance (advance=(0,0)), so the next glyph is drawn at the same x/y.
I guess it gets that from some inner rules, and from some tables/hints in the font.

So, if you think these dots are too near the letters, may be it's the font, Noto sans Arabic "UI" may be having these kind of rules to make it more "UI" so it does not overflow menu items ? :)
Do you get the same feeling with a more proper book arabic font ?

I think it should be the same for nastaliq: Harfbuzz or the font should know a word start and end and length, and decide to decrease the y offset (as we still draw left to right) as we progress in drawing the word, so each glyph would be a bit upper than the previous. Depending on the length of the word, it might either start with a big (low) y offset, or use a smaller offset step. Dunno, that's just speculation :)

And dunno why this is broken for us:
For this really brokenimage - that should rise from left to right as in Firefox: image

We get (if I got that right, not sure):

DTHB g31 c20(=t:20) [3 space]   advance=(4,0)
DTHB g32 c19(=t:6cc) [113 YehxFin]    advance=(14,0)
DTHB g33 c18(=t:642) [10 TwoDotsAboveNS]        advance=(0,0)   offset=(4,3)
DTHB g34 c18(=t:642) [237 FehxMed.inT3outD2Y]   advance=(6,0)   offset=(0,10)
DTHB g35 c17(=t:6cc) [13 TwoDotsBelowNS]        advance=(0,0)   offset=(6,10)
DTHB g36 c17(=t:6cc) [18a BehxMed.inT2outT3]    advance=(8,0)   offset=(0,12)
DTHB g37 c16(=t:642) [10 TwoDotsAboveNS]        advance=(0,0)   offset=(6,10)
DTHB g38 c16(=t:642) [22d FehxMed.inT3outT2]    advance=(9,0)   offset=(0,16)
DTHB g39 c15(=t:62d) [3d7 sp0]  advance=(0,0)
DTHB g40 c15(=t:62d) [1f2 HahIni.outT3] advance=(14,0)  offset=(0,19)
DTHB g41 c14(=t:20) [3 space]   advance=(4,0)

So, increasing y offsets (0, 10, 12, 16, 19), which make it go down (drawing these glyphs left to right)...

t's a tough call ... but I'd prefer Noto Sans Arabic UI. A font that's "not book-like" is better than a font that might make users think something is "broken".
Cool, I'd be waiting. But basically, that's would be the only addition needed for Arabic fonts. A good serif font ( Noto Naskh Arabic ) and a good sans font ( Noto Sans Arabic UI ).

OK, so let's go with Arabic UI as the book arabic fallback (as in this PR).
About fonts, I guess most people add their own prefered fonts, and may be few are using our shipped Noto Sans or Noto Serif.
Would Noto Naskh Arabic be fine/enough for most arabic readers - or everybody as some kind of native prefered arabic font (Baghdad, Nadeem, Sheherazade, I have seen) and few would reall stick with a standard Noto Naskh Arabic ?

@NiLuJe
Copy link
Copy Markdown
Member

NiLuJe commented Apr 25, 2020

Yep, the UI variants are specifically designed for constrained vertical space, so it's a likely interpretation ;).

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 25, 2020

DTHB g31 c20(=t:20) [3 space]   advance=(4,0)
DTHB g32 c19(=t:6cc) [113 YehxFin]    advance=(14,0)
DTHB g33 c18(=t:642) [10 TwoDotsAboveNS]        advance=(0,0)   offset=(4,3)
DTHB g34 c18(=t:642) [237 FehxMed.inT3outD2Y]   advance=(6,0)   offset=(0,10)
DTHB g35 c17(=t:6cc) [13 TwoDotsBelowNS]        advance=(0,0)   offset=(6,10)
DTHB g36 c17(=t:6cc) [18a BehxMed.inT2outT3]    advance=(8,0)   offset=(0,12)
DTHB g37 c16(=t:642) [10 TwoDotsAboveNS]        advance=(0,0)   offset=(6,10)
DTHB g38 c16(=t:642) [22d FehxMed.inT3outT2]    advance=(9,0)   offset=(0,16)
DTHB g39 c15(=t:62d) [3d7 sp0]  advance=(0,0)
DTHB g40 c15(=t:62d) [1f2 HahIni.outT3] advance=(14,0)  offset=(0,19)
DTHB g41 c14(=t:20) [3 space]   advance=(4,0)

Same section with LD_LIBRARY_PATH=...koreader/libs ~/hb-shape --output-format=text --text-file=urdu.txt NotoNastaliqUrdu-Regular.ttf:

[<glyph name or index>=<glyph cluster index within input>
  @<horizontal displacement>,<vertical displacement>
  +<horizontal advance>,<vertical advance>|...]
space=20+270
YehxFin=19+1067
TwoDotsAboveNS=18@265,263+0
FehxMed.inT3outD2Y=18@0,778+455
TwoDotsBelowNS=17@470,739+0
BehxMed.inT2outT3=17@0,922+619
TwoDotsAboveNS=16@479,762+0
FehxMed.inT3outT2=16@0,1194+668
sp0=15+0
HahIni.outT3=15@0,1423+1096
space=14+270

The vertical displacements are increasing similarly as with my code (cre or xtext).

@WaseemAlkurdi : just to be sure I'm not messing my arabic reading :) can you confirm that in this:
حقیقی image
the letters involved, in the visual LTR order that we see here, are somehow named as printed above:
Final Yehx + Fehx + Behx + Fehx + Hah initial

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 25, 2020

Uh !!!
Just thought I had a try with (in case I messed up my arithmetic :):

                             buf->Draw(x + item->origin_x + FONT_METRIC_TO_PX(glyph_pos[i].x_offset),
-                                      y + _baseline - item->origin_y + FONT_METRIC_TO_PX(glyph_pos[i].y_offset),
+                                      y + _baseline - item->origin_y - FONT_METRIC_TO_PX(glyph_pos[i].y_offset),
                                       item->bmp,

and... (KOReader | Firefox):
image

which is a lot better ! :)

Just to be sure I don't have to + > - the x_offset, tell me this is bad, the dots should be misplaced:
image

@NiLuJe
Copy link
Copy Markdown
Member

NiLuJe commented Apr 25, 2020

I do remember that some stuff is top-to-bottom while other is bottom-to-top where metrics/bbox are involved, but I don't recall the details OTOH :s.

(i.e., if oversight there was, I'd only expect it to be on the y axis).

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 25, 2020

Yep, https://www.freetype.org/freetype2/docs/glyphs/glyphs-2.html

The grid is always oriented like the traditional mathematical two-dimensional plane, i.e., the X axis goes from the left to the right, and the Y axis from bottom to top.

I guess I didn't think much, everything must get a offset_y of 0 from HB in our latin world - just using the font origin_y which had the correct sign.

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented Apr 25, 2020

OK, let's merge this one.
I'll fix/bump offset_y in crengine and xtext tomorrow - which might solve the issue of diacritics too near the word that @WaseemAlkurdi mentionned.

(xtext does not need any fixing, textwidget.lua and textboxwidget.lua do, that's where we use y_offset)

@poire-z poire-z merged commit 7d83a0c into koreader:master Apr 25, 2020
@poire-z poire-z deleted the bump_crengine branch April 25, 2020 18:20
@WaseemAlkurdi
Copy link
Copy Markdown
Contributor

@poire-z

You've literally fixed it!

Just to be sure I don't have to + > - the x_offset, tell me this is bad, the dots should be misplaced:

And exactly, the dots are misplaced! (For instance, look at the second line, last word from the left. The dots should be shifted about one letter to the right)

@WaseemAlkurdi : just to be sure I'm not messing my arabic reading :) can you confirm that in this:
حقیقی image
the letters involved, in the visual LTR order that we see here, are somehow named as printed above:
Final Yehx + Fehx + Behx + Fehx + Hah initial

Again, this is correct! :-)

Would Noto Naskh Arabic be fine/enough for most arabic readers - or everybody as some kind of native prefered arabic font (Baghdad, Nadeem, Sheherazade, I have seen) and few would reall stick with a standard Noto Naskh Arabic ?

Naskh fonts are what is used in all (paper) books, the Arabic parallel of a serif font (not a serif font itself, but the typographical equivalent of serif). Therefore, it is pretty enough for Arabic as most books are printed in that.
Above and beyond that, it's only two things: personal taste in fonts (users would install fonts) and nastaliq for Farsi and Urdu (these two can be written with naskh, but the use of nastaliq is more widespread there than with Arabic or Kurdish. It's not an issue at all now with your fixes)

@ptrm
Copy link
Copy Markdown
Contributor

ptrm commented May 1, 2020

Thanks a lot for embedded lang tags support! Now I will try to kake Calibre detect foreign languagaes during conversiion :D

Btw, does the tag mean html tag, or also html attrubute (e.g. <blockquote lang="fr">)?

Edit: having read the long press description, I assume they're just meta tags ;>

@Frenzie
Copy link
Copy Markdown
Member

Frenzie commented May 1, 2020

@ptrm I think you're looking for #6069 and/or koreader/crengine#337

Parse and store values from lang= attributes, so we can
propagate a TextlangCfg object to all calls dealing with
text, which will allow to:

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented May 1, 2020

That might be explained in a bit more detailed in the "typography" PRs : #6069 and #6072 .
So, if you don't set a "default" typography language, it will try to guess it from the book metadata lang tag - and use that (or the "fallback" typography language you have set if none guessed).
Then if you enable "Respect embedded lang tags" (the default is enabled), it will adjust to any <blockquote lang="fr"> it meets (if disabled, it won't and will use the main typography lang for all text, ignoring their lang tag).
When respecting, it will use the french hyphenation dict and line breaking rules (and won't prevent single letter at end of line for these blocks)), even if your book is polish.

Btw, any feedback with the added typography rules for polish? No bad side effects?

having read the long press description, I assume they're just meta tags ;>

Damned :( We were so bad at describing the feature then ?! :)

@ptrm
Copy link
Copy Markdown
Contributor

ptrm commented May 1, 2020

That might be explained in a bit more detailed in the "typography" PRs : #6069 and #6076 .
So, if you don't set a "default" typography language, it will try to guess it from the book metadata lang tag - and use that (or the "fallback" typography language you have set if none guessed).
Then if you enable "Respect embedded lang tags" (the default is enabled), it will adjust to any <blockquote lang="fr"> it meets (if disabled, it won't and will use the main typography lang for all text, ignoring their lang tag).
When respecting, it will use the french hyphenation dict and line breaking rules (and won't prevent single letter at end of line for these blocks)), even if your book is polish.

That's what I wanted, yes. You mean it's already working :D ? Because to test it I would need sime semi-automatic work on my epubs, mist publishers don't care about adding lang attributes ;)

Btw, any feedback with the added typography rules for polish? No bad side effects?

So far I'm awestruck with the multifallbacks, paging through books with math and phonetic notations without changing any settings :D But Polish books look cool so far (though they're all published with &nbsp;s after one-letter words, so testing is harder this way ;)

having read the long press description, I assume they're just meta tags ;>

Damned :( We were so bad at describing the feature then ?! :)

The plural "tags" mislead me. Now I know it refers to all potentially opened books, but I thought it referred to many tags in one book ;) Hard to avoid, I think.

@poire-z
Copy link
Copy Markdown
Contributor Author

poire-z commented May 4, 2020

Well, we're not as good as Firefox :/

KOReader | Firefox:

image

Even with more interline space:
image

Source : https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
:)

edit: a bit better with NotoSans as the first fallback font:
image
or with only NotoSerif
image
NotoSans only with no added interline space:
image

NotoSansCJK, FreeSans and FreeSerif all are bad at that.
But NotoSansCJK has many (other) issues, so I feel like pushing NotoSans or NotoSerif first (but they feel lighter than FreeSans vs my preferred font).
But we dont know if the user preferred main font is a Sans or Serif... So a bit hard to impose one.

NiLuJe added a commit to NiLuJe/koreader-fonts that referenced this pull request Jun 7, 2020
re koreader/koreader#6090 (comment)

https://github.com/googlefonts/noto-fonts @ 0fa1dfabd6e3746bb7463399e2813f60d3f1b256

(i.e., hinted, not the Phase III WIP builds).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cannot display some unicode characters

5 participants