Enhancement - optional transparent homoglyph encoding of a few characters in certain languages for more compact and efficient text messages#4491
Conversation
|
|
|
That is the best MR overview I've read in a long time. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4491 +/- ##
=========================================
+ Coverage 9.21% 10.79% +1.58%
=========================================
Files 427 429 +2
Lines 14351 14408 +57
Branches 2389 2389
=========================================
+ Hits 1322 1555 +233
+ Misses 12776 12558 -218
- Partials 253 295 +42 ☔ View full report in Codecov by Sentry. |
DaneEvans
left a comment
There was a problem hiding this comment.
Good implementation too,
A few minor changes, but this looks pretty good to merge.
…for strings with some amount of homoglyphs, only homoglyphs, no homoglyphs, only latin characters, and only non-latin characters that are impossible to be presented by latin homoglyphs (PR 4491)
jamesarich
left a comment
There was a problem hiding this comment.
All good on my end, thanks for the thorough writeup and clean submission.
I'll ask the robot to do a review pass, but it shouldn't come up with much 👍
There was a problem hiding this comment.
Pull request overview
Adds an optional, user-controlled “homoglyph encoding” feature intended to reduce UTF-8 byte size of Cyrillic text messages by substituting visually identical ASCII characters, improving effective message capacity within byte limits.
Changes:
- Adds a new application settings toggle for “Compact encoding for Cyrillic” backed by a new
HomoglyphPrefsSharedPreferences store. - Introduces a
HomoglyphCharacterStringTransformerand wires preference state into messaging UI/view models. - Adds unit tests for the transformer and updates
feature/messagingtest dependencies.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| feature/settings/src/main/kotlin/org/meshtastic/feature/settings/radio/RadioConfigViewModel.kt | Injects homoglyph prefs and exposes a Flow/toggle API for Settings UI. |
| feature/settings/src/main/kotlin/org/meshtastic/feature/settings/SettingsScreen.kt | Adds a switch in App Settings to enable/disable homoglyph encoding. |
| feature/messaging/src/main/kotlin/org/meshtastic/feature/messaging/MessageViewModel.kt | Injects homoglyph prefs and exposes an enabled Flow to the UI layer. |
| feature/messaging/src/main/kotlin/org/meshtastic/feature/messaging/Message.kt | Uses the pref to alter byte-counting behavior in the message input UI. |
| feature/messaging/src/main/kotlin/org/meshtastic/feature/messaging/HomoglyphCharacterStringTransformer.kt | New transformer implementing Cyrillic→ASCII substitutions. |
| feature/messaging/src/test/kotlin/org/meshtastic/feature/messaging/HomoglyphCharacterTransformTest.kt | New unit tests covering transformer behavior and byte-size shrinkage claims. |
| feature/messaging/build.gradle.kts | Adds JUnit/MockK test dependencies for the messaging module. |
| core/strings/src/commonMain/composeResources/values/strings.xml | Adds the user-facing setting label string. |
| core/prefs/src/main/kotlin/org/meshtastic/core/prefs/homoglyph/HomoglyphPrefs.kt | New preference interface + implementation for the toggle. |
| core/prefs/src/main/kotlin/org/meshtastic/core/prefs/di/PrefsModule.kt | Registers homoglyph prefs + SharedPreferences provider in Hilt DI. |
Comments suppressed due to low confidence (1)
feature/messaging/src/main/kotlin/org/meshtastic/feature/messaging/Message.kt:479
- Homoglyph encoding is only used to compute
canSend/byte length, but the text that gets sent is stillmessageInputState.text(raw). This can enable sending messages that exceedmaxByteSizeand the optimization won't actually be applied to transmitted content. Apply the same transformation to the string passed toSendMessage(or move the transformation into the send pipeline, e.g.,MessageViewModel.sendMessage, so all send paths are covered).
MessageInput(
isEnabled = connectionState.isConnected(),
isHomoglyphEncodingEnabled = homoglyphEncodingEnabled,
textFieldState = messageInputState,
onSendMessage = {
val messageText = messageInputState.text.toString().trim()
if (messageText.isNotEmpty()) {
onEvent(MessageScreenEvent.SendMessage(messageText, replyingToPacketId))
}
Hi, James, thanks for the reply! I see that Copilot has raised some fair points, like the extra de facto unused dependencies I have copy-pasted from other modules, and a few other things. I gonna fix that and reply here again |
…e feature.messaging module (PR 4491)
…ment of cyrillic capital letter "ze" with a digit (PR 4491)
…RadioConfigState data class. Made homoglyphEncodingEnabledFlow immutable (PR 4491)
…sure that the feature is effective and consistent across indirect approaches to sending messages as well (quick-chat, reply, etc.) (PR 4491)
|
@DaneEvans @jamesarich I went through the list of issues that Copilot highlighted and committed the fixes. I would be grateful if you could take a look and give feedback when it is convenient for you. Thanks! |
jamesarich
left a comment
There was a problem hiding this comment.
Looks good, thanks again @LuigiVampa92 ! 👍
It is a pleasure :) Thank you again for your work. I hope that I will come up with some more ideas in future |
Hi everyone. First of all - thank you for an awesome project, I have been using it for a while and it works great, fun to tinker with, and very useful to solve some of my real life problems. I express my great gratitude to all the contributors.
I would like to suggest a small enhancement to the project.
The general idea - we can reduce the binary size of the text message by replacing some 2 byte characters to 1 byte characters without actually losing anything in the process.
The basic problem this change solves is this:
The text message that can be transmitted by Meshtastic has a maximum size limit, depending on the preset. In case of usual LongFast preset it is 200 characters. This is a reasonable amount for a short message. The thing is - it not literally the symbol count that matters, it is its binary size, and this is where the things get tricky.
In UTF8 unicode the characters of the latin alphabet occupy exactly 1 byte each, because they fit into the single byte without its highest bit (less than 128), and that gives us the originally intended capacity of 200 symbols for a single message. However, in the various languages their national alphabets use different non latin characters, and in their case each character occupies at least 2 bytes, which reduces the effective maximum transmitted message size by half - only to 100 non-latin symbols per a single message. In practice it is around 115-120 * characters (200/2 = 100 + some characters are spaces and special signs that occupy only 1 byte), but still that is not too much, and sometimes that is not enough to combine a message with a single complete thought, so it would be nice to have some more available capacity for that.
So the idea I came with is this:
In various national alphabets there are letters that basically repeat latin letters, but they still have their own dedicated character codes in unicode.
For instance, in cyrillic alphabets (also in greek and a few others) many characters have exactly the same visual appearance as the latin characters, but due to usage of their dedicated character codes they occupy 2 bytes each, though such letters can be easily replaced with their latin analogues without any loss of their meaning or visual change.
Some example of such cyrillic letters: "А", "В", "С", "Е", etc. So they can naturally be replaced with corresponding latin "A", "B", "C", "E" and we could save one byte per each of their occasion in the message.
These are fairly common symbols in cyrillic alphabtes - half of the vowels and very frequently used consonants. According to statistics, they form about 20-25% of all letters in the average text. Replacing them with latin characters will save the user the same proportion of available message space and can fit more letters into the message or to deliver the same text message with higher reliability (because it will become shorter). The average transmitted message volume can then be increased from ~115-120 characters to ~140-145 and sometimes even more.
I gonna give you a couple of examples:
(1) Let's take a look at the cyrillic text "Мештастик - это проект с открытым исходным кодом" ("Meshtastic is an open source project"). The original byte size of this message is 88 bytes: 40 cyrillic letters, 2 bytes each, + 7 spaces, 1 byte each + 1 special character, also 1 byte). By replacing the cyrillic characters with their corresponding latin character homoglypghs we will get this: "Meштacтик - этo пpoeкт c oткpытым иcxoдным кoдoм". It looks absolutely the same, but now it is just 72 bytes, because we have replaced letters "M", "e", "a", "c", "o", "p", "x" to latin and now we have only 24 cyrillic letters, 2 bytes each, 16 latin letters, 1 byte each, and spaces and special characters remain the same - 7 and 1 respectfully. So as you can see we have saved 16 bytes by using 16 latin letters instead of cyrillic, and that is roughly 20% - not bad.
Let's look at another example:
(2) Let's take this cyrillic text this time - "Всему свое время" ("All in good time"). The original is 30 bytes: 14 cyrillic letters, 2 bytes each + 2 spaces. After the homoglyph transform: "Bceмy cвoe вpeмя" is now just 21 bytes: just 5 cyrillic letters, 2 bytes each, 9 latin letters and 2 spaces. We have replaced "Bce y c oe pe" this time, 9 bytes saved, which is 30% of the message binary size this time.
In short - it is great! I have been using this for a while with my personal node, it works flawlessly, and I thought that it is a good idea to contribute this to the main project code.
So this pull request contain the following changes:
(1) I have added a new toggle to the settings tab, to the "application" group. The feature is can be turned on and off there and is disabled by default.
(2) I have added a new prefs file that stores the current state of this setting - whether it is turned off or on
(3) I have added the object that has a static method that accepts a string as a parameters and returns the transformed string
(4) In the Message implementation I check whether this feature is enabled and if it is, then all the characters from predefined list are replaced by corresponding latin characters
For now the replaced characters are from the basic cyrillic block (U+0400-U+04FF), and are used mostly in the languages of the eastern Europe, Balkan, some central Asian countries, etc, but maybe there are something else that can be handled the same way. I have left all the links to unicode characters descriptions in the comments inside the transformer implementation.
Please note that only 100% "reliable", completely visually identical characters are replaced. The characters that look like latin but have minor differences like various descenders, hooks, strokes, etc are not replaced with "simplified" latin appearance and will remain 2 byte unicode, as usual (though maybe in future it might make sense to make some "aggressive" mode that will replace them as well)
Here are some screenshots of how it works:
(settings screen update)
And some examples of me typing messages with disabled and enabled feature respectfully
Example 1:
Example 2:
Example 3:
I have added a basic test that covers this implementation. Spotless, Detekt and tests are all passed, everything is ok. Hope I didn't miss anything.
I hope that you will consider merging this enhancement, since it is a huge iprovement for those who use languages that use cyrillic alphabets.
Thank you again!