fix cjk fullwidth character display#1632
fix cjk fullwidth character display#1632novload wants to merge 4 commits intoMudlet:developmentfrom novload:fix-cjk-fullwidth-display
Conversation
|
This is an interesting approach to solving the problem but it seems to be based around I would note that I have a pending PR that also partially works in this area which actually measures the advance for each grapheme drawn (and which thus allows the use of proportional fonts as that may be necessary should there not be a mono-spaced font with the wanted glyph to represent a grapheme) - which works quite well for non-BMP and/or combining diacriticals, but does produce an output must less uniformly spaced than this one does! I would like to hold off on directly using this PR as is - but I would like to investigate more closely to see if it is possible to find a way to access this half-width/full-width directly from the |
|
@novload thanks so much for the contribution! 👍 @SlySven is this something you will get to soon then? You have a lot of other PRs open and in-process, so perhaps @novload could adjust this one meanwhile to handle those other edgecases and get it merged, since his produces correct uniform output already. |
|
Humm, the algorithm dates back to 2007 and only goes up to Unicode 5.0 (currently it is at 10.0) so there are codes that are not covered. Also it examines only a single I think the limit to @novload solution is that it is working on a per What I think is the best solution is what I am doing (measuring the width of each glyph - which can comprise a number of |
|
@SlySven I'm not sure what is the action you're proposing? |
|
👍 |
src/TTextEdit.cpp
Outdated
| * @param nextChar [out] character unicode | ||
| * @return number of character of grapheme | ||
| */ | ||
| inline int TTextEdit::getNextGraphemeSize(const QString& str, int startOffset, /* out */ uint *character) |
There was a problem hiding this comment.
I think this is too simplistic - you will get a more thorough solution with a QTextBoundaryFinder set to hunt for a QTextBoundaryFinder::Grapheme type boundary - it is what I use...! ![]()
| { | ||
| int charWidth = mk_wcwidth_cjk(unicode) == 2 ? 2 : 1; | ||
|
|
||
| bool isBold = charStyle.flags & TCHAR_BOLD; |
There was a problem hiding this comment.
BTW this is all going to clash rather with #1633 ... 🤷♂️
|
I'm not sure that code "Markus Kuhn -- 2007-05-26 (Unicode 5.0)" is any good ten years later because it considers each USC2 (i.e. BMP only) code individually. @novload has since pushed up another commit that might work to add support for characters beyond that - i.e that use a Surrogate pair (of I was thinking of taking the underlying premise of the OP - only using 1 or 2 spaces for each glyph but using a single <aside>The most recent output from my attempt on this area of the code (which just tries to round the measured width for the glyph when it is painted onto a dummy QPixMap background) and is not yet shown here was not as lined up as I had hoped because the widths that I was seeing from the proportional font chosen because it supported the Simplified Chinese writing system (none of the Deja Vu/Bitstream etc. fonts do BTW) were not all wide enough to meet the threshold (1.5 x |
|
Ah, this seems to be the document that sort of clarifies how the duospaced stuff is worked out - at least it explains some of it and what the East Asian Width classification of Unicode code-points means: http://www.unicode.org/reports/dtr11.html . |
|
@novload could you merge latest |
|
The code that @novload is using is Markus Kuhn's wcwidth.c and seems to be dated 2007-05-26 (Unicode 5.0), and I think, is likely to be out of date in the ranges it uses to say whether something should use one or two characters {my Qt 5.9 is using Unicode 8.0 and the current standard is actually 10.0}. I have been experimenting with trying to use something that is generated from the most up to date data which is held in the Unicode organisation's EastAsianWidth.txt file - this property is described as informative. That file details 480 odd ranges (some of only a single code point) in a deliberately parsable format and assigns one of the following values to each codepoint:
So far I have a method that uses a couple of In practice I'd also use a I think this represents the best way to get this PR into the code-base - I'll try to get a PR onto @novload repo ASAP - by 2018/05/07 12:00 UTC Monday!) |
|
Hm, just checking, you are not re-implementing parts of libicu, are you? :) |
|
Sorry but I haven't got to this by my self imposed deadline - gimme another 12 hours? |
|
Right - this PR's scope is cjk characters, do they all work? Happy to have a follow-up PR that extends the scope, but let's not scope creep here |
|
Ah, I think I have an idea what is going on, we are now using
Guess what! The box drawing characters are amongst those character that are of Unicode East-Asian Width catagory "A" - Ambiguous, that are listed at the start of The trouble is that the current code is permanently set into the above mode - what I will attempt to do is make the use of the |
|
Ok. What do you think of libicu? |
|
I am trying not to - I guess it might be included internally somewhere in the Qt libraries but I'm just going to keep coding away - I think I can get the EAWidth reliably now (shamelessly using the binary(?) search to search in an array of ranges of Unicode codepoint and a copy of the current widths texts that I have converted via a spreadsheet into a static table (actually a and I'm just trying to weld that into the code so that each grapheme use 2 spaces if there is a |
|
It does not seem to me like it's a good idea to write our own Unicode library... it's far more complicated to telnet and you'd like to switch away from our homegrown telnet code. How maintainable will this be by others? I don't understand half of the concepts here, I don't know Unicode that intimately and I haven't spent time on it. If there's a bug, we'll all be dependent on you as the only one who can really solve it, nobody else can help. Plus, performance - libicu is almost literally used by everything in the world - so they must have ironed out the speed of it pretty darn well. Have a think about this, it might not be such a good thing. |
|
@SlySven what do we need to do to push this along? |
It is not clear, exactly, when it will be needed - so rather than trying to do it automatically and getting it wrong, leave it up to a setting that can be changed if needed. ⌚️ I've put up a PR on @novload Repository: https://github.com/novload/Mudlet/pull/1 to do this - after they have I (or someone else) will need to check and update the ranges of characters in the added C code from Markus Kuhn from 2007-05-26 (Unicode 5.0) to bring it up to date. |
|
Anyone playing on CJK will have to enable this before they'll see stuff correctly, isn't that the case? |
|
I think that is the case but that is independent of the Server encoding (well, apart from the fact that the whole ambiguous width characters is related to the transcoding of encodings such as Shift-JIS to Unicode) - it is more likely to be related to the locale - but that is not available yet! 😊 The effects do seem to be reversible - change the control that I am proposing in the PR to novload's repository and the display changes to match the setting including content already displayed (though word wrapping might be an issue - but that is already iffy for non-ASCII characters) - unlike the Server encoding where if it is wrong the display is garbage and cannot be fixed - and the errors do show up as spacing/layout issues rather than making the content unreadable. |
|
Mudlet needs to work out of the box - and not displaying the characters correct is not working out of the box. How can we automate it? |
|
We might be able to when we allow/respond to users' locales. But fundamentally I do not think we can get this right in advance for every case - that is why the characters that give this spacing problem are labelled as having an ambiguous East-Asian width, in some settings they should be drawn as half-width/one space wide and in other contexts as full-width/two spaces wide. It is only a problem I think for situations where the text is being displayed in a duo-spaced mode rather than the conventional (these days) proportional manner - unfortunately that hits our application because we/MUDs are trying to emulate an old-school mono-spaced terminal. Getting it wrong means the layout is not perfect - as the previous samples above show with some-things being draw with the wrong spacing. {Note this is not where texts are being drawn on top of each other - that is just bad painting code that is our fault} - but the messed up display, well, it is generally annoying rather than fatal to the application... |
|
Why can't we tie this to the encoding used? Won't work for utf-8, but it'll work for the more selective ones, no? Locale is not a reliable measure - a Chinese user in US could have their locale set to English and want to play a Chinese MUD back home. |
|
Well if they set their locale to How would you propose mapping encoding to narrow/wide width for ambiguous EAWidth characters? Don't forget that GB18030 supports all the characters that UTF-8 does but it does not necessarily imply that using that encoding should be mapped to a wide width for those ambiguous characters IMHO. Can @novload and/or @markx offer any input on this? Oh, further places where this sort of thing has cropped up:
In summery - I do not think we can assume anything about which users need which setting and we should just say - "If the text from the MUD does not look to be laid out correctly - and you are using an East Asian [what ever that means!] MUD you may get a better layout with this option checked." |
|
If a person picked GB18030, it is very likely they will be playing with CJK characters, and it is extremely likely they would want them rendered correctly. So, when the user picks that and related encodings, this option should be auto-enabled. |
|
Found another case which might be interesting reading: Extended-width characters only display in a single column Okay, I can accept that that might be a reasonable default but I would not want to rigidly lock it into place - let me see how that can be done (I think I can tristate the checkbox option). |
|
Don't need a tristate - just enable the option by default when that encoding is selected. The user can still disable it manually afterwards if that's incorrect. Tristate would be complicated to work with :( |
|
No it would not and it is actually a nice UI way to do it - I mean using the third (the tri-) state to allow automatic selection:
|
|
Fair enough, lets have a look. |
|
OK the tri-state control is in place on my PR to @novload 's development branch - we now need https://github.com/novload/Mudlet/pull/1 to be merged into that branch - then it will show up here. |
|
Something went wrong at novload's repo end (my best guess is that they thought they had don't something wrong and reverted using the GitHub tools the merge of my PR into their development branch which is what this PR is) - it seems that GitHub then forgets about the reverted change and neither the wanted commit 78deaba or the immediately post-revert commit (now showing on the repo as revert-1-fix-cjk-fullwidth-display) ae57742 show up here in this PR even though the novoload development branch - which is currently at the first of these - is supposedly what should be here, now. 😖 - the small print above says "Add more commits by pushing to the fix-cjk-fullwidth-display branch on novload/Mudlet - but there is NOT such a branch there." To fix this I am going to copy the commits into a new PR against my repository - the only change I plan to do is to add the |
|
Novload what name do you want to use for credits?
…On Tue, 15 May 2018, 4:58 pm Stephen Lyons, ***@***.***> wrote:
Something went wrong at novload's repo end (my best guess is that they
thought they had don't something wrong and reverted using the GitHub tools
the merge of my PR into their development branch which is what this PR is)
- it seems that GitHub then forgets about the reverted change and neither
the wanted commit 78deaba
<78deaba>
or the immediately post-revert commit (now showing on the repo as
*revert-1-fix-cjk-fullwidth-display*) ae57742
<ae57742>
show up here in this PR even though the novoload development branch - which
is currently at the first of these - is supposedly what should be here,
now. 😖 - the small print above says "Add more commits by pushing to the
*fix-cjk-fullwidth-display* branch on *novload/Mudlet* - but there is NOT
such a branch there."
To fix this I am going to copy the commits into a new PR against my
repository - the only change I plan to do is to add the * Copyright (C)
2018 by ????? ????? - ***@***.*** * lines into the files that
they have touched - I just need @novload <https://github.com/novload> to
tell me here or on their repo what goes into the "????? ?????" bit should
be...? 😀
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1632 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGxjDoVGNdFDu2SHFl1Y6Ru5GM_gsGGks5tyu0hgaJpZM4TtTe2>
.
|
|
My name is Huadong Qi. |
|
Ah, it sounds like you are a speaker of one of the CJK (East Asian) Languages! We really need good users of those languages to help get Mudlet 4.0 ready for I18n and we are currently talking about which languages we can do our selves on Discord {This is a permanent invite link to the mudlet-development channel.} |
|
Closed because of a problem in combining code from two places - which have been copied and combined at #1668 - so please go there. Huadong Qi (please advise me, should I say "Huadong" or "Qi" or "novload" as a short name? Which one is best?) I am "Stephen"... Can you look at PR 1668 as well as @vadi2 and say if you think it is 👍 or 👎 ? 😄 |
|
@SlySven doing some profiling and look at who shows up: Looks like you were right, we have access to HarfBuzz already. |










Fix when a line contains full-width character, some characters may be overlap other characters.
before:

after:
