Skip to content

fix cjk fullwidth character display#1632

Closed
novload wants to merge 4 commits intoMudlet:developmentfrom
novload:fix-cjk-fullwidth-display
Closed

fix cjk fullwidth character display#1632
novload wants to merge 4 commits intoMudlet:developmentfrom
novload:fix-cjk-fullwidth-display

Conversation

@novload
Copy link
Copy Markdown

@novload novload commented Apr 30, 2018

Fix when a line contains full-width character, some characters may be overlap other characters.

before:
before

after:
after

@novload novload requested a review from a team as a code owner April 30, 2018 22:18
@novload novload requested a review from a team April 30, 2018 22:18
@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 1, 2018

This is an interesting approach to solving the problem but it seems to be based around wchar_t whose size is implementation dependent - specifically can this algorithm handle non-BMP Unicode code-points because the underlying Qt QString type is UTF-16 and not ucs2 - does it handle the case when the QChar passed to (int) mk_wcwidth_cjk(...) is a High or Low surrogate?

I would note that I have a pending PR that also partially works in this area which actually measures the advance for each grapheme drawn (and which thus allows the use of proportional fonts as that may be necessary should there not be a mono-spaced font with the wanted glyph to represent a grapheme) - which works quite well for non-BMP and/or combining diacriticals, but does produce an output must less uniformly spaced than this one does!

I would like to hold off on directly using this PR as is - but I would like to investigate more closely to see if it is possible to find a way to access this half-width/full-width directly from the QChar class - although superficially it seems that the East Asian Width property (30) not one that is accessible currently.

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 1, 2018

@novload thanks so much for the contribution! 👍

@SlySven is this something you will get to soon then? You have a lot of other PRs open and in-process, so perhaps @novload could adjust this one meanwhile to handle those other edgecases and get it merged, since his produces correct uniform output already.

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 1, 2018

Humm, the algorithm dates back to 2007 and only goes up to Unicode 5.0 (currently it is at 10.0) so there are codes that are not covered. Also it examines only a single QChar at a time and I can confirm it breaks on Non-BMP characters (which need two QChars to convey the more than 2^16 Unicode code-points)
fix_cjk_1
fix_cjk_2
This is the current development code showing the same things:
development_1
development_2
I've quickly rebased the prototype code onto the current development and get this:
prototype_1
prototype_2
One important factor is that by limiting the font selection to monspace fonts it is impossible to choose a font that actually supports East Asian writing systems(scripts) - so this code allows the selection of proportional fonts and to filter them by Script:
prototype_3
So that the Phukka sample looks like:
prototype_2_chinese_font

I think the limit to @novload solution is that it is working on a per QChar basis and using effectively a massive lookup table to say whether to use 1 or 2 spaces for each of them - the algorithm is the one referred to in the first answer to this S.E. question "How to know the preferred display width in columns of Unicode characters?" and the author of the algorithm does point out that it is experimental here.

What I think is the best solution is what I am doing (measuring the width of each glyph - which can comprise a number of QChars) but to reproduce the effect that this proposal does is to then compare that width to a threshold of 150% of the (int) QFontMetrics::averageCharWidth() and then use either 1 or 2 times that averageCharWidth rather than the actual width. This depends on the hopefully not too unreasonable expectation that any glyph will have a non-zero width and can be drawn centred in a space of one or two to reproduce a mono (or actually duo-spaced) appearance - however I now realise that TAB characters will have to exceptionally bypass this process.

@novload novload closed this May 2, 2018
@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 2, 2018

@SlySven I'm not sure what is the action you're proposing?

@vadi2 vadi2 reopened this May 2, 2018
@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 2, 2018

👍

* @param nextChar [out] character unicode
* @return number of character of grapheme
*/
inline int TTextEdit::getNextGraphemeSize(const QString& str, int startOffset, /* out */ uint *character)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too simplistic - you will get a more thorough solution with a QTextBoundaryFinder set to hunt for a QTextBoundaryFinder::Grapheme type boundary - it is what I use...! :neckbeard:

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

{
int charWidth = mk_wcwidth_cjk(unicode) == 2 ? 2 : 1;

bool isBold = charStyle.flags & TCHAR_BOLD;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this is all going to clash rather with #1633 ... 🤷‍♂️

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 3, 2018

I'm not sure that code "Markus Kuhn -- 2007-05-26 (Unicode 5.0)" is any good ten years later because it considers each USC2 (i.e. BMP only) code individually. @novload has since pushed up another commit that might work to add support for characters beyond that - i.e that use a Surrogate pair (of QChars) to carry it in the QString - but I haven't inspected it yet to see the effectiveness of it. I want to see how it works with arbitrary (and possibly multiple) combining diacriticals; in that case, although the diacriticals do not add anything to the overall glyph width, but the presence of more than one (e.g. in the Thai language which may use two onto a base glyph) might throw off lesser quality code; also there are more than one set of combining characters to worry about (at least in the latest Unicode 10) see Wikipedia on Combining character for more details.

I was thinking of taking the underlying premise of the OP - only using 1 or 2 spaces for each glyph but using a single const static QHash<quint32, quint8> where the key is the Unicode Codepoint and quint is a code to represent the East-Asian width and to write a simple parser to read a copy of the latest version of EastAsianWidth.txt from the Unicode organisation (and to store the resultant table in the Mudlet home directory for later reloading) and then use the largest width from any of the code-points in the grapheme/glyph to provide a width for the whole glyph - with an exemption to bypass the tab character because that can be of arbitrary width depending on where it is drawn in the line of text (and we cannot control where the tab stops are with the QPainter::drawText(...) methods).

<aside>The most recent output from my attempt on this area of the code (which just tries to round the measured width for the glyph when it is painted onto a dummy QPixMap background) and is not yet shown here was not as lined up as I had hoped because the widths that I was seeing from the proportional font chosen because it supported the Simplified Chinese writing system (none of the Deja Vu/Bitstream etc. fonts do BTW) were not all wide enough to meet the threshold (1.5 x QPainter::fontMetrics.averageWidth()) that I was using to force them to be drawn on a 2 x QPainter::fontMetrics.averageWidth() space as opposed to a 1 x one.</aside>

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 3, 2018

Ah, this seems to be the document that sort of clarifies how the duospaced stuff is worked out - at least it explains some of it and what the East Asian Width classification of Unicode code-points means: http://www.unicode.org/reports/dtr11.html .

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 3, 2018

@SlySven I'm not clear what would you like @novload to do besides switch TTextEdit::getNextGraphemeSize to use QTextBoundaryFinder::Grapheme - is there anything else?

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 4, 2018

@novload could you merge latest development in? That will fix the macOS Travis build failing.

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 4, 2018

The code that @novload is using is Markus Kuhn's wcwidth.c and seems to be dated 2007-05-26 (Unicode 5.0), and I think, is likely to be out of date in the ranges it uses to say whether something should use one or two characters {my Qt 5.9 is using Unicode 8.0 and the current standard is actually 10.0}. I have been experimenting with trying to use something that is generated from the most up to date data which is held in the Unicode organisation's EastAsianWidth.txt file - this property is described as informative. That file details 480 odd ranges (some of only a single code point) in a deliberately parsable format and assigns one of the following values to each codepoint:

  • F - Full width (Japanese "zenkaku" 1 Em) - always 2 characters width
  • H - Half width (Japanese "hankaku" 1/2 Em) - always 1 character width
  • N - Neutral - (Not in East Asian baliwick so treat as 1 character wide!)
  • Na - Narrow - implicitly 1 characters width (comes from a 1-byte encoding in Shift-JIS - has W counterpart
  • W - Wide - implicitly 2 characters width - may have a H counterpart
  • A - Ambiguous - could be 1 or 2 characters in width, the characters are in the "compatibility" zone but are not explicitly marked H or F and which to use depends on whether they are being used in an East-Asian context or not.

So far I have a method that uses a couple of QRegularExpressions and can extract the first and last codepoint in each range. However I am still trying to come up with an algorithm that, given a Unicode codepoint, can return one of the above as a value in a fast and compact manner. A QHash<quint32, quin8> key being codepoint (could alternatively be a ucs4), value being a coded number and the smallest type usable I believe is one solution. However such a table will need to hold 0x10FFFF which is 1,114,111 values - which is going to be fast but NOT compact...

In practice I'd also use a QCache with, say, around 2^8 spaces for each Host between a central shared table and each profile's use of it. I'd allow a binary (bool) option for each profile and use it to allow the user to say whether 'A' codepoints are drawn in a half or one Em-space (or perhaps 1 or 2 times QFontMetrics::averageWidth()). Ultimately then, each grapheme would be drawn in the widest value that each codepoint would produce when put through such a look up - with the important exception being the tab character which would have to be handled separately - i.e. we'd print it on-screen and then measure the width it took and update the positioning details for the following characters as needed.

I think this represents the best way to get this PR into the code-base - I'll try to get a PR onto @novload repo ASAP - by 2018/05/07 12:00 UTC Monday!)

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 4, 2018

Hm, just checking, you are not re-implementing parts of libicu, are you? :)

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 5, 2018

Thanks so much for doing the update @novload 😄

@SlySven anything else needed for this? I know you'd like to extend the support to other languages as well, but just having cjk support in to begin with would be really nice.

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 7, 2018

Sorry but I haven't got to this by my self imposed deadline - gimme another 12 hours?

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 7, 2018

Right, there is an #include <QTextBondaryFinder> missing from the top of /src/TTextEdit.cpp and there are ranges of characters that are not being spaced correctly - I'm not sure yet whether this is because they are not included in the right test ranges in that port of wcwide.c or because they are too new, though, given that the main offender here is U+23AF {Horizontal Line Extension} er, got that wrong it was U+2500 {Box Drawing Light Horizonatl} and that has been in since Unicode 3.21.1apparently, I do not think the latter is the case:
fix_cjk_3

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 7, 2018

Right - this PR's scope is cjk characters, do they all work?

Happy to have a follow-up PR that extends the scope, but let's not scope creep here

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 7, 2018

Ah, I think I have an idea what is going on, we are now using mk_wcswidth_cjk about which the original coder says (with my emphasis):

The following functions [mk_wcwidth_cjk and mk_wcswidth_cjk] are the same as mk_wcwidth() and mk_wcswidth(), except that spacing characters in the East Asian Ambiguous (A) category as defined in Unicode Technical Report #11 have a column width of 2. This variant might be useful for users of CJK legacy encodings who want to migrate to UCS without changing the traditional terminal character-width behaviour. It is not otherwise recommended for general use.

Guess what! The box drawing characters are amongst those character that are of Unicode East-Asian Width catagory "A" - Ambiguous, that are listed at the start of (int)) mk_wcwidth_cjk(uint ucs).

The trouble is that the current code is permanently set into the above mode - what I will attempt to do is make the use of the _cjk variants dependent on the discovery of other non-ambiguous wide characters in any line and use that to control whether to use them in, initially for testing purpose, in that line.

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 7, 2018

Ok. What do you think of libicu?

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 7, 2018

I am trying not to - I guess it might be included internally somewhere in the Qt libraries but I'm just going to keep coding away - I think I can get the EAWidth reliably now (shamelessly using the binary(?) search to search in an array of ranges of Unicode codepoint and a copy of the current widths texts that I have converted via a spreadsheet into a static table (actually a static const QVector<unicodeProperties> mUnicodeRanges) where unicodeProperties is:

struct unicodeProperties {
    uint_fast32_t firstCodepoint;
    uint_fast32_t lastCodepoint;
    EAWidthType width;
    QChar::Category catagory;
};

and EAWidthType is:

enum class EAWidthType : uint_fast8_t {
    widthUnknown = 0x0,
    WidthAmbiguous = 0x1,
    WidthFull = 0x2,
    WidthHalf = 0x4,
    WidthNarrow = 0x8,
    WidthNeutral = 0x10,
    WidthTab = 0x20,  // Special handling needed
    WidthWide = 0x40
};

I'm just trying to weld that into the code so that each grapheme use 2 spaces if there is a WidthFull OR WidthWide OR (we are in an East Asian context AND there is WidthAmbiguous) Codepoint in the grapheme and using 1 space otherwise (except for tabs obviously). This should cover CJK as well as it covers Latinate ones - which the PR current doesn't do so good on, as it is breaking on those box drawing and other ambiguous width codepoints because it is forced, as it were, into the CJK mode. As far as I can see the current PR code sort of considers the first codepoint in the grapheme but not the remaining ones which will be an issue if they are what forces the grapheme to need to be a wide one. I am getting there, slowly, I think.

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 8, 2018

It does not seem to me like it's a good idea to write our own Unicode library... it's far more complicated to telnet and you'd like to switch away from our homegrown telnet code.

How maintainable will this be by others? I don't understand half of the concepts here, I don't know Unicode that intimately and I haven't spent time on it. If there's a bug, we'll all be dependent on you as the only one who can really solve it, nobody else can help. Plus, performance - libicu is almost literally used by everything in the world - so they must have ironed out the speed of it pretty darn well.

Have a think about this, it might not be such a good thing.

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 11, 2018

@SlySven what do we need to do to push this along?

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 11, 2018

Why can't we automate it? Majority of people will not fully understand what the message means and what problem is it describing.

It is not clear, exactly, when it will be needed - so rather than trying to do it automatically and getting it wrong, leave it up to a setting that can be changed if needed.

⌚️ I've put up a PR on @novload Repository: https://github.com/novload/Mudlet/pull/1 to do this - after they have git pulled in recent changes to the development branch here into their branch all but the last two commits there should go away and then they can merge that PR into their branch so that it will show up here. The added control takes effect retroactively on the display to change the drawing of the ambiguous width glyphs... 😀

I (or someone else) will need to check and update the ranges of characters in the added C code from Markus Kuhn from 2007-05-26 (Unicode 5.0) to bring it up to date.

@SlySven SlySven assigned novload and unassigned SlySven May 12, 2018
@SlySven SlySven added the Waiting for other PR Merge This PR is awaiting another PR to merge firstly. label May 12, 2018
@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 12, 2018

Anyone playing on CJK will have to enable this before they'll see stuff correctly, isn't that the case?

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 12, 2018

I think that is the case but that is independent of the Server encoding (well, apart from the fact that the whole ambiguous width characters is related to the transcoding of encodings such as Shift-JIS to Unicode) - it is more likely to be related to the locale - but that is not available yet! 😊

The effects do seem to be reversible - change the control that I am proposing in the PR to novload's repository and the display changes to match the setting including content already displayed (though word wrapping might be an issue - but that is already iffy for non-ASCII characters) - unlike the Server encoding where if it is wrong the display is garbage and cannot be fixed - and the errors do show up as spacing/layout issues rather than making the content unreadable.

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 12, 2018

Mudlet needs to work out of the box - and not displaying the characters correct is not working out of the box. How can we automate it?

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 12, 2018

We might be able to when we allow/respond to users' locales. But fundamentally I do not think we can get this right in advance for every case - that is why the characters that give this spacing problem are labelled as having an ambiguous East-Asian width, in some settings they should be drawn as half-width/one space wide and in other contexts as full-width/two spaces wide. It is only a problem I think for situations where the text is being displayed in a duo-spaced mode rather than the conventional (these days) proportional manner - unfortunately that hits our application because we/MUDs are trying to emulate an old-school mono-spaced terminal.

Getting it wrong means the layout is not perfect - as the previous samples above show with some-things being draw with the wrong spacing. {Note this is not where texts are being drawn on top of each other - that is just bad painting code that is our fault} - but the messed up display, well, it is generally annoying rather than fatal to the application...

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 12, 2018

Why can't we tie this to the encoding used? Won't work for utf-8, but it'll work for the more selective ones, no?

Locale is not a reliable measure - a Chinese user in US could have their locale set to English and want to play a Chinese MUD back home.

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 12, 2018

Well if they set their locale to en-US they are saying they want the language used to be American but their name could still be written in Traditional Chinese characters (Oh, that would make it very hard for US Keyboard users to type kill <Chinese player's name> as they wouldn't have the input method to enter the name! 🤣)

How would you propose mapping encoding to narrow/wide width for ambiguous EAWidth characters? Don't forget that GB18030 supports all the characters that UTF-8 does but it does not necessarily imply that using that encoding should be mapped to a wide width for those ambiguous characters IMHO. Can @novload and/or @markx offer any input on this?

Oh, further places where this sort of thing has cropped up:

In summery - I do not think we can assume anything about which users need which setting and we should just say - "If the text from the MUD does not look to be laid out correctly - and you are using an East Asian [what ever that means!] MUD you may get a better layout with this option checked."

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 12, 2018

If a person picked GB18030, it is very likely they will be playing with CJK characters, and it is extremely likely they would want them rendered correctly.

So, when the user picks that and related encodings, this option should be auto-enabled.

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 12, 2018

Found another case which might be interesting reading: Extended-width characters only display in a single column

Okay, I can accept that that might be a reasonable default but I would not want to rigidly lock it into place - let me see how that can be done (I think I can tristate the checkbox option).

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 12, 2018

Don't need a tristate - just enable the option by default when that encoding is selected. The user can still disable it manually afterwards if that's incorrect.

Tristate would be complicated to work with :(

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 13, 2018

No it would not and it is actually a nice UI way to do it - I mean using the third (the tri-) state to allow automatic selection:

  • Qt::partiallyChecked (default) "automatic" setting: use "wide" width for ambiguous characters when the encoding is GBK or GB18030; use "narrow" otherwise. (Using the encoding for now, until we get locale control in for changeable GUI language.)
  • Qt::unchecked "manual" setting: use 'narrow'" width for ambiguous characters whatever the encoding (but probably will be used for Western MUDS).
  • Qt::checked "manual setting use 'wide' width for ambiguous characters whatever the encoding (but probably will be used for EastAsian MUDS).

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 13, 2018

Fair enough, lets have a look.

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 13, 2018

OK the tri-state control is in place on my PR to @novload 's development branch - we now need https://github.com/novload/Mudlet/pull/1 to be merged into that branch - then it will show up here.

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 15, 2018

Something went wrong at novload's repo end (my best guess is that they thought they had don't something wrong and reverted using the GitHub tools the merge of my PR into their development branch which is what this PR is) - it seems that GitHub then forgets about the reverted change and neither the wanted commit 78deaba or the immediately post-revert commit (now showing on the repo as revert-1-fix-cjk-fullwidth-display) ae57742 show up here in this PR even though the novoload development branch - which is currently at the first of these - is supposedly what should be here, now. 😖 - the small print above says "Add more commits by pushing to the fix-cjk-fullwidth-display branch on novload/Mudlet - but there is NOT such a branch there."

To fix this I am going to copy the commits into a new PR against my repository - the only change I plan to do is to add the * Copyright (C) 2018 by ????? ????? - novload@outlook.com * lines into the files that they have touched - I just need @novload to tell me here or on their repo what goes into the "????? ?????" bit should be...? 😀

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 15, 2018 via email

@novload
Copy link
Copy Markdown
Author

novload commented May 16, 2018

My name is Huadong Qi.
@SlySven Thank you.

@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 16, 2018

Ah, it sounds like you are a speaker of one of the CJK (East Asian) Languages!

We really need good users of those languages to help get Mudlet 4.0 ready for I18n and we are currently talking about which languages we can do our selves on Discord {This is a permanent invite link to the mudlet-development channel.}

@SlySven SlySven removed the Waiting for other PR Merge This PR is awaiting another PR to merge firstly. label May 16, 2018
@SlySven
Copy link
Copy Markdown
Member

SlySven commented May 16, 2018

Closed because of a problem in combining code from two places - which have been copied and combined at #1668 - so please go there.

Huadong Qi (please advise me, should I say "Huadong" or "Qi" or "novload" as a short name? Which one is best?)

I am "Stephen"...

Can you look at PR 1668 as well as @vadi2 and say if you think it is 👍 or 👎 ? 😄

@SlySven SlySven closed this May 16, 2018
@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 22, 2018

@novload your improvement is in: 37eeb41

It'll be available in the next Mudlet release (in a couple of weeks now). Thanks so much for your contribution, hope to see more! :)

@vadi2
Copy link
Copy Markdown
Member

vadi2 commented May 28, 2018

@SlySven doing some profiling and look at who shows up:

selection_303

Looks like you were right, we have access to HarfBuzz already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants