Enhance: add text transcoding#969
Conversation
Found during testing on the Spanish Realms of Legends MUD that we did not support ISO-8859-1 (Latin-1) encoding but could if we made a small change to the way we process the incoming text bytes in the TBuffer::translateToPlainText(...) method. Instead of appending each byte as a char onto TBuffer::mMudline - which works for (7-bit) ASCII characters we can convert the upper 128 code characters to the "correct" QChars by going via a QChar::fromLatin1(...) intermediate step. It was then found possible to extend this process for other simple encoding tables by recording the differences between the ISO 8859-1 and various other encodings. Included in this commit is support for: * ISO 8859-1 * ISO 8859-2 * ISO 8859-3 * ISO 8859-4 * ISO 8859-10 * ISO 8859-15 * ISO 8859-16 * WINDOWS-1250 * WINDOWS-1251 * WINDOWS-1252 The above (and UTF-8) can be selected from the profile preferences and will be saved on a per profile basis once set and used when the profile is used again. UTF-8 is manually decoded by examining the first byte to determine how many more bytes are expected for it and then checking all of the indicated bytes for validity. Checks are included to guard against invalid code points (beyond U+10FFFF) and "overlong" sequences. During testing against some data from a real MUD it was found that the UTF-8 BOM sequence although not required or recommended for this encoding was being converted to no QChars at all, technically the position in a stream where it occurs could replace it with some form of joiner or it could actually be discarded however a script or trigger might be interested in it so the equivalent QChar is manually reinserted into the decoded stream - this is also useful as the code structure wants to insert a TChar instance into the corresponding data storage every time around the parsing loop for a single byte (for all but UTF-8) encoding or Utf-8 byte sequence for a codepoint (for UTF-8). NOTE THAT THE DECODER WILL HANDLE NON-BMP BYTE SEQUENCES AND DECODES THEM TO TWO QChars (a High surrogate followed by a Low one) AS QT DOES (It is based on QStrings that are Utf-16). To allow code that follows this stage to keep a one-to-one match between QChars in the text data and TChars that record the rendering aspects (colour, font styling) a duplicate TChar is added after the normal one for non-BMP codepoints. I feel this will help to port the existing code over to support such non-BMP codepoints...! Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
Found missing braces was allowing some code to execute when the item it was trying to reference did not exist. Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
|
Something broke on encoding for me, but just once... when I turned on replay logging and read through the help file again, it was fine. Broken - boxes and Asian characters are obviously not part of Cyrillic: Not changing anything and just reading the help file again was fine, so it was a transient error... Didn't have replay logging going at the time, so it's just something we'll have to watch out for. |
|
Managed to get the corruption recorded: 27-04-2017#18-31-03.dat.zip |
vadi2
left a comment
There was a problem hiding this comment.
It works! I'll help you with the tooltip and will do more testing on this...
src/ctelnet.cpp
Outdated
| // Report on main console the change in ALL cases, as this can change | ||
| // things for the user (e.g. theoretically what characters are | ||
| // acceptable to the MUD Server). | ||
| QString message = tr("[ ALERT ] - Encoding/decoding of MUD Server traffic set to: \"%1\"...") |
There was a problem hiding this comment.
Ellipsis is used to indicate there is more that more needs to happen before the action is complete, which isn't the case here.
How about this simpler text: Game text encoding/decoding set to \"%1\". ?
There was a problem hiding this comment.
We might have to clarify that this applies to text to and from the MUD server {and to feedTriggers(...) although the latter is not protected against Utf-8 byte sequences being split but than it doesn't need it}. Might "Game text" be too ambiguous?
In hindsight - agree about "..." not being appropriate.
There was a problem hiding this comment.
I don't think we need that clarification - it's an option the users set and they expect everything to just "work", so no need to spell out that everything.
There was a problem hiding this comment.
Also, what if we use transcoding to mean encoding/decoding? That's more appropriate and less wordy.
There was a problem hiding this comment.
Well to my ears "transcoding" seems to be like "transforming" one code into another, which is what is happening I suppose; but separating it into encoding / decoding helps to emphasis that two separate operations are happening on traffic (i.e. data) flowing in two separate directions.
Of course, we perhaps do not want to be too specific if someone asks "Well what is it being transformed into here in Mudlet" as that might not be so easy to say. 😜
src/ui/profile_preferences.ui
Outdated
| <bool>true</bool> | ||
| </property> | ||
| <property name="title"> | ||
| <string>Character Encoding</string> |
There was a problem hiding this comment.
Should be Character encoding, but how about Text encoding for something less technical? Not that encoding isn't - but there's no way around that.
There was a problem hiding this comment.
I was just using the text from the old code that I pulled out with git! 😀
There was a problem hiding this comment.
Would Data encoding be a reasonable compromise - it implies something a "bit" technical and not something to be messed with causally - and given that Mudlet is all about responding to and handling text the word "Text" seems a bit too general here?
There was a problem hiding this comment.
I'll think on it.
Btw another reason to move off "Character encoding" is to a clueless person, a character in a MUD is the little guy they're playing.
src/TBuffer.cpp
Outdated
| void TBuffer::translateToPlainText( std::string & s, const bool isFromServer ) | ||
| { | ||
| //cout << "TRANSLATE<"<<s<<">"<<endl; | ||
| std::string localBuffer; // As well as enabling the prepending of left-over |
There was a problem hiding this comment.
Why not a comment above the text - it'll take up less lines and be more clang-format friendly.
src/TBuffer.cpp
Outdated
| else if (mpHost->mTelnet.getEncoding() == QLatin1String("WINDOWS-1252")) { | ||
| mMudLine.append(QString(decodeByteToWindows_1252(ch))); | ||
| } | ||
| else if (mpHost->mTelnet.getEncoding() == QLatin1String("UTF-8")) { |
There was a problem hiding this comment.
This is over my head and I won't catch up on this in time next release so leaving to @ahmedcharles to verify.
src/TBuffer.cpp
Outdated
| switch( utf8SequenceLength ) { | ||
| case 4: | ||
| if ((localBuffer[msPos+3] & 0xC0) != 0x80) { | ||
| qDebug() << "TBuffer::translateToPlainText(...) byte 4 in sequence as UTF-8 rejected!"; |
There was a problem hiding this comment.
As we work out the sequence length by examining the first byte this is pointing out that the fourth byte does not meet with the expectation that the high bits are not 10.... as the encoding demands.
Edited to correct logic error!
There was a problem hiding this comment.
Oh. I thought it literally said 4. The error message still is not very useful: it does not say what the byte was or why was it rejected. How about "4th byte in UTF-8 sequence is invalid"
| newEncoding = QString::fromUtf8(lua_tostring(L,1)); | ||
| } | ||
|
|
||
| QPair<bool, QString> results = pHost->mTelnet.setEncoding(newEncoding); |
There was a problem hiding this comment.
This errors if you try calling with an encoding that's already set. That shouldn't be an error.
setServerEncoding: bad argument #1 value (no change to encoding, it already was: "UTF-8"
There was a problem hiding this comment.
Not sure there is a black and white answer to this. Given that you can determine the current encoding changing it unnecessarily might have an impact on the saved state of the outgoing encoder. so the code takes care to only make a change if it is one.
Using a command and it does not have any effect could be regarded as a run-time issue in some points of view so I felt it was safest to reserve "true" for "this command succeeded in setting the encoding to the given value"...
There was a problem hiding this comment.
Code taking care to only change it if it needs changing is good. Forcing the user to check what it is before allowing them to change it is unnecessary. With that philosophy, you could redo the whole of Mudlets API design to error if the function doesn't produce any changes...
There was a problem hiding this comment.
Ah, without the success message is it an issue that then there would not be clarity whether anything happened or not as a result of the command. Perhaps not, then, perhaps the user will be concerned with now and not what happened before, and of course if they WERE concerned they'd check it themselves so - OK - will change this.
src/TLuaInterpreter.cpp
Outdated
|
|
||
| if(results.first) { | ||
| lua_pushboolean(L, true); | ||
| lua_pushfstring(L, "setServerEncoding: success %s!", |
There was a problem hiding this comment.
No need for the string... the code will know what encoding it's setting to.
There was a problem hiding this comment.
Unless it uses the empty string - but that is mapped to "ASCII" anyhow - so yeah that could go.
💭 Am I going to have to code a sysServerEncodingChange event?
There was a problem hiding this comment.
Not sure about sysServerEncodingChange yet. Let's leave it be for now.
| return 0; | ||
| } | ||
|
|
||
| int TLuaInterpreter::setServerEncoding(lua_State * L) |
There was a problem hiding this comment.
Why server - might as well be getEncoding and setEncoding, it's not like we'll have a client encoding - that's bundled in the same setting.
There was a problem hiding this comment.
May be an issue when we allow the GUI to be language dependent - wanted to keep the two things clearly separate. Also, at least keyboards could have encodings and the term encoding might be used in other contexts e.g. cryptography by tying it to server I hope to avoid later confusions.
| return 2; | ||
| } | ||
|
|
||
| int TLuaInterpreter::getServerEncoding(lua_State * L) |
There was a problem hiding this comment.
Need a function to get the list of available encodings as well. I think it should be an encoding = true map so people can easily check if a certain encoding is available on users Mudlet - @keneanung thoughts on indexed vs map?
There was a problem hiding this comment.
Well the error case (for an invalid string) reports the valid ones...
There was a problem hiding this comment.
That's good - but that's not how you should be checking for the valid ones.
src/ui/profile_preferences.ui
Outdated
| <bool>true</bool> | ||
| </property> | ||
| <property name="toolTip"> | ||
| <string><html><head/><body><p>The way in which data is sent to and received from the MUD Server can vary in different parts of the World and usually has a relation to the language used by players on it. Whilst ASCII is the original form that the underlying Telnet protocol once used it (strictly speaking) only uses 7-bits and only supports 128 character codes (and a quarter of them are reserved for control purposes) so when it became possible to use all 8-bits in a byte various different encoding were developed. Most of these use the upper 128 character codes for extra characters needed for specific languages but which means that it is necessary to use the right encoding to get the right symbols for a particular language. This is the case for the <b>ISO 8859-*</b>and <b>WINDOWS-*</b> ranges; however the <b>UTF-8</b> uses a much more advanced encoding system and uses those same 128 codes to provide for up to 1,112,064 characters in a more complex manner that uses up to four bytes but in a way that the original 128 characters are still represented by single bytes.</p><p>MUDs capable of operating with languages other than English will likely offer a choice of encodings, possibly with a <i>charset</i> command or similar, review the choices that the MUD offers and choose a matching option from the list here. This setting can also be set and read via the Lua commands <i>setServerEncoding</i>(&quot;<b>encoding name</b><i>&quot;)</i> and <i>getServerEncoding()</i>.</p><p>In all cases Mudlet will spot incoming data that does not seem to fit the current encoding and each occurance will be marked with a <b>�</b> if you see these then it is possible that this setting does not match what the MUD Server is sending out - or there is a defect in the way Mudlet is handling it.</p><p><i>This feature is planned to be introduced in Mudlet 3.1.0 and is currently not known to be bug-free - please report any defects (including consistantly incorrect single characters for any of the encodings other than <b>ASCII</b>, <b>ISO 8859-<u>1</u></b> or <b>UTF-8</b> as it is possible for there to be errors in the other encoding).</span></p></body></html></string> |
There was a problem hiding this comment.
Very thorough description but it'll go over most people's heads - I'll help get you a more relatable text later.
|
So you are saying that the second read does seem correct (and you can read it OK?) |
|
Yes, the second read is fine. |
That is worrying in that I thought recently that |
|
@SlySven this was not the first read of the help file, by far - I read it many times before it got corrupted. |
|
Ah ha - I have found the source for Qt's own encoding tables src/corelib/codecs/qsimplecodec.cpp - now I can find out what they use for |
|
Brilliant! Could we make use of them instead of duplicating? |
|
I do not think that source material is exposed - on the other hand I can do some stuff with a spreadsheet and be much more certain about values used (and the differences between each one and Latin1) - and as it is GPLed we can publish the data in some backwater part of our Wiki... |
|
I want to look at this but don't have time at the moment. Probably over the weekend. |
|
I've resolved a merge conflict for you in this. |
|
Thanks...! |
|
@vadi2 Sure, but I can only test it for Linux. I will build it myself, no problem. |
|
@ecam85 There is support for other encodings and I have a few more (including "MACINTOSH") and what I think will be a faster system for all the others (apart from "ISO 8859-1", "UTF-8" and what will become the new default unless changed, good ol' 7-bit "ASCII") to follow shortly... |
|
Do we have MUDs that actually use those encodings? Adding more isn't going to be a feature if we have to add tests for all of them and nobody us gonna use them, it'll be a maintenance burden on us. |
|
Yes, I was about to mention that. |
|
Okay, they're in use, sounds good to me.
…On Fri, Apr 28, 2017 at 7:05 PM ecam85 ***@***.***> wrote:
Yes, I was about to mention that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#969 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGxjPSulwx27ZiKWZEOP_lq3I4BDg2Eks5r0hxigaJpZM4NJuB->
.
|
Added getServerEncodingsList() which provides a list with the exact strings to use - and which allows us to revised them if necessary. Revise setServerEncoding() to not regard setting the same encoding that is already being used from being a run-time error - now it will just return the same true as if the setting had been made - but as no change is actually taking place, no message will appear on the main console. This message is now an "[ OK ] -" form one without an ending ellipses whereas it was an "[ ALERT ] -" with one. Whether a change does occur or not a text message is not provided, only on the run-time error case of specifying an incorrect encoding will a nil + error-message now be produced. Comment at start of TBuffer::translateToPlainText(...) redone. Renamed data input argument to above function to ensure that the local version which can be slightly longer if there are bytes "left-over" from the previous packet that have been prepended to it. Doing this allowed me to find and fix places where that was not being used. Revised some debugging messages to be clearer. Revised the error handling for where the decoded UTF-8 data yields more than two QChars from when converted by a QString constructor - it should not happen but being paranoid is not unreasonable with initial coding attempts...! Also arranged that more error cases report the actual bytes that seemed to be in error. Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
Changed to use a C++11 ranged "for" instead of Qt macro "foreach" in
TBuffer constructor.
Changed format of run-time error for an invalid "encoding" string supplied to
to the Lua setServerEncoding("encoding") function.
Reordered encodings list so that UTF-8 follows immediately after ASCII
rather than being on the end of the list.
Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
src/TLuaInterpreter.cpp
Outdated
| else { | ||
| lua_pushnil(L); | ||
| lua_pushfstring(L, "setServerEncoding: bad argument #1 value %s", | ||
| lua_pushfstring(L, "Error: %s.", |
There was a problem hiding this comment.
Not even an Error - it's obvious it is an error, it is the 2nd return which can only be an error! just the actual error message itself please, leave the styling to the API.
src/ui/profile_preferences.ui
Outdated
| <bool>true</bool> | ||
| </property> | ||
| <property name="title"> | ||
| <string>Server Data Encoding</string> |
There was a problem hiding this comment.
A case of Needless Capitalisation! See other grouping names on the same menu.
| <property name="enabled"> | ||
| <bool>true</bool> | ||
| </property> | ||
| <property name="toolTip"> |
There was a problem hiding this comment.
This needs to be reworded, as it is unlikely to be understood by the vast majority of the population. What are you trying to communicate here - I can help you reword it but I'm not sure what's the desired outcome is.
There was a problem hiding this comment.
Going through what I put there by paragraph:
-
The first paragraph was an attempt to explain why this control is necessary and how the various choices come about, it indirectly suggests why UTF-8 is the "one-encoding to, in the darkness, bind them" and also suggest that it is a more complex thing to handle!
-
Invites the user to see what the settings that the MUD offers and suggests they try the word "charset" as one possibility. Having found out they will need to choose the one they want and tell the MUD server to use it before setting Mudlet to match.
-
Advises them what the symptoms of a mis-match between Mudlet and the MUD server will show up as (characters getting replaced with the '�' symbol).
-
Is temporary to warn that if there are any characters that are getting consistently displayed wrongly we might have an error in one of those encoding tables - (IIRC there are one or two characters that may be subject to "regional variations" and it could be that the user is not in the right "region" for a particular table to be right for them - this is not applicable for ASCII, UTF-8 or ISO 8859-1 but might be an issue for the other encodings currently). The paragraph is also intended to emphasis that this currently only sorts out the on-screen display and that other parts such as the trigger engine have not yet been fixed up to work with non-ASCII text.
-
Is also temporary and advises against using anything other than ASCII or ISO-8859-1 - eventually it may become a recommendation to use UTF-8!
There was a problem hiding this comment.
Thinking this through, we'd like Mudlet to work straight out of the box, just install and play - and the user will only ever need to use this option if the encoding is not set correctly already. So this really is a fix, a workaround, for when Mudlet didn't do its job right. So now we have an unhappy user because text isn't displayed right and we'd like to minimise the amount of time they're unhappy, so they can enjoy Mudlet and get into the game ASAP.
In order to do that, we need to make the text as short as possible so they've got the least amount of reading to do, and simple as possible so they have the best chance of understanding it. Otherwise, we'll just be keeping them in the unhappy state longer if they have to do a lot of reading and if they don't understand what's being said, they'll definitely be even more unhappy.
This is the reasoning behind not giving as much information about the general topic as you do, but instead being super concise and using simple language to get straight to the point.
I got distracted adding in this to the wiki so I'll provide the revised text later, but thought I'd provide the reasoning first so you'd know where I'm coming from.
There was a problem hiding this comment.
I got distracted...
I'd say so! That is quite a chunk of UI goodness...
There was a problem hiding this comment.
Here's a tooltip that I believe achieves the goals I've set out above. Do you you think?
I did also remove the temporary text because it's an unstated assumption that anything in the development branch is under development. The whole thing might or might not work and we don't support anyone running unreleased code.
There was a problem hiding this comment.
At present the trigger system is not going to fly with non-ASCII text - which is going to get in the way of anyone wanting to use the development code-base as anything other than a "technology preview" in those situations where they are using anything other than ASCII / ISO 8859-1 (or UTF-8 with just the characters from those charsets) - I still think a "this might make it look right but other things are not currently functional for anything other than that first pair of encodings" message is a requirement at the moment! 😮
There was a problem hiding this comment.
I don't think a tooltip is the right place for this - we'll have it in the release notes and such. Otherwise, this is a very well-hidden place to advertise such lack of functionality and we can't possibly depend on users to find it :\
There was a problem hiding this comment.
I've given it more thought and I've changed my mind, it'll be good to mention in both places that not everything is ready yet.
Run-time error string on supplying a bad encoding to setServerEncoding(...) is now: "Encoding \"XXXX\" does not exist; use one of the following: \"ASCII\", <list of other encodings (with UTF-8 at the top)>." Profile Preferences title for encoding is now "Server data encoding" - it has lost the extra capitalisation I originally used. 8-| Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
If the last bytes of a packet are multi-byte UTF-8 sequence it is necessary to store them if there is not enough bytes in the packet to complete the processing of them in TBuffer::translateToPlainText(...) unfortunately I had made an error in assessing whether there WAS enough bytes so that the code to store the incomplete bytes was being triggered when the last character in the packet exactly ended at the end of the packet, i.e. it DID fit in! Also: * Added a couple of QDebug()s - conditional on DEBUG_UTF8_PROCESSING that: report that an incomplete sequence has been detected and how many bytes are being stored; and report when they are being prepended onto the data in the next call to the method concerned. * Revised the code that prepended those bytes to slightly optimise the tests that detect it is needed. * Whilst using Mudlet replays to debug this issue I noticed that the replay starting and ending messages did not have the classification tag of the form "[ XXXX ] -" so did not display in any special way, I revised them to become an "[ INFO ]" message for replay starting and "[ OK ]" at the end...! Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
|
Here's another replay with the Log of how it looks like in html, incorrect bit is: You can tell it's extra as it throws the last | out of alignment. |
vadi2
left a comment
There was a problem hiding this comment.
I'd like to get this merged in, so approving it now!
I know there is a minor bug with encoding, but it should go in just so we don't delay merging it into development until the very end and don't get much testing on it at all because it was hidden away in a branch.
Conflicts resolved in: src/TLuaInterpreter.cpp src/TLuaInterpreter.h src/ui/profile_preferences.ui Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
|
Hang-on what is wrong now (I just merged the development branch into this PR - and had to resolve a couple of issues:
So how to resolve this "This branch cannot be rebased due to conflicts" now? |
|
That "19-05-2017#16-21-55.dat" file that was pointed out to me is showing corruption - and it is happening exactly at the point where the "there is not enough bytes to complete a UTF-8 sequence, we need to carry some over to the next packet" code is activated - so it does give me something to work with... |
Found a local automatic variable "msPos" in
TBuffer::trasnlateToPlainText(...) was shadowing an unnecessary TBuffer class
member so renamed/retyped the local to be more appropriate, also found that
a second class member was not needed outside the method so added a local
automatic to do the same job:
* (int) msPos ==> (size_t) localBufferPosition
* msLength ==> (size_t) localBufferLength
Also remove now unused and/or shadowed:
* (int) TBuffer::lookupColor(const QString&, int)
* (int) TBuffer::msLength
* (int) TBuffer::msPos
* (QString) TBuffer::mFormatSequenceRest
I found that there was some MXP related code that was still working on the
incoming string data rather than the possibly pre-pended local copy which
would cause (further) problems in the future for non-ASCII that use MXP! I
changed those to also work on the local version.
Within TBuffer::translateToPlainText(...) I also spotting some single line
if(...) {...} else {...} which I braced as appropriate, a couple of places
where different spacing helps to line up related code and some other
redundant code fragments which I removed.
Signed-off-by: Stephen Lyons <slysven@virginmedia.com>
|
So this should be good to go - but I cannot see how to get around the conflicts... |
|
What conflicts? Github says this branch is mergeable. |
|
Ah. Well, go ahead and squash, the issue in previous replays seems
addressed!
…On Mon, 22 May 2017 5:25 pm Stephen Lyons, ***@***.***> wrote:
Ah it was greyed out *for "Rebase and merge"*
[image: mudlet_snapshot19]
<https://cloud.githubusercontent.com/assets/6163092/26315970/b7af92ec-3f0a-11e7-83c1-a2404bed7720.png>
but that is slightly misleading because it can be "Squashed and merged" -
the drop down triangle to select the operation - being black on light grey
- also looks to be disabled *but it is not* - it just turns green when
the other option is selected:
[image: mudlet_snapshot20]
<https://cloud.githubusercontent.com/assets/6163092/26316124/243ab8b0-3f0b-11e7-948a-81d587270f76.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#969 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGxjMeKgLJqpO22fmHqK2qPeErEs4zWks5r8ajVgaJpZM4NJuB->
.
|








This should start us on the road to I18n as it mean Mudlet can now at least display UTF-8 encoded MUD Server data and additionally adds support for a number of other encodings that make use of the upper character numbers as simple "Extended ASCII".
By default this will expect to be processing pure ASCII and will put in a "replacement character" for any byte with it's most significant bit set for those that makes it's way through the Telnet protocol decoding to head towards the screen.
This might help with #505 and #702 and another Issue that I cannot find that wanted "unprintable" characters to be marked on the screen...
The selected text encoding (controllable both via a saved per profile setting AND lua commands) is also used for outgoing text data - though the behaviour should the wanted outgoing text contain characters that cannot be encoded in the selected manner might not be exactly clear at the moment.
Note that text that contains non-ASCII characters cannot yet be detected by the trigger system as that needs a good going over.