Conversation
Codecov Report
@@ Coverage Diff @@
## main #295 +/- ##
==========================================
- Coverage 80.69% 80.51% -0.19%
==========================================
Files 30 30
Lines 2829 2858 +29
==========================================
+ Hits 2283 2301 +18
- Misses 546 557 +11
Continue to review full report at Codecov.
|
|
👍 Question: how are compound emoji and utf8 sequences handled? I can see the tests and it seems to be magically handled by |
|
TextWrap does wrap inside compound characters, which will break emoji and possibly umlauts, but it correctly handles multi-byte codepoints (if those are the correct terms). So the wrapped document will be valid UTF-8 instead of byte-garbage and only might contain symbols that wouldn't be rendered correctly, which is fixed by unwrapping the lines during parsing. from textwrap import TextWrapper
WRAP = TextWrapper(
width=4, initial_indent="", subsequent_indent=" ", break_long_words=True, break_on_hyphens=True,
expand_tabs=False, replace_whitespace=False, fix_sentence_endings=False, drop_whitespace=False
)
# https://emojipedia.org/couple-with-heart-woman-man-light-skin-tone-dark-skin-tone/
EMOJI = '\U0001f469\U0001f3fb\u200d\u2764\ufe0f\u200d\U0001f468\U0001f3ff'
print(EMOJI, len(EMOJI))
print(WRAP.fill(EMOJI * 10))
# https://en.wikipedia.org/wiki/Zalgo_text
ZALGO = '\u0074\u0334\u0313\u031b\u0307\u0351\u030d\u0309\u031a\u0309\u0300\u0308\u0307\u030c\u033f\u030c\u0355\u034d\u0316\u032e\u031f\u033a\u035a\u0326\u0322\u0326\u0320\u0316\u0317\u031c\u0318\u0348\u0355\u0317\u035c\u033c\u034d\u032b\u0325\u0354\u033b\u032f\u0331\u031e\u0333\u031c\u0332\u032b\u0356\u0359\u0333\u0348\u031f\u0354\u0326\u0353\u0359\u0329\u0328\u0326\u0065\u0335\u0358\u0308\u034b\u0311\u0308\u0352\u0306\u0303\u0303\u030f\u030b\u0351\u0343\u0300\u0304\u032f\u033a\u0322\u0349\u032e\u031c\u031f\u0333\u0317\u0353\u033c\u0353\u0317\u032d\u0317\u033a\u0353\u035c\u032f\u0326\u032f\u0353\u034d\u0073\u0337\u030e\u0303\u0346\u035d\u0307\u0307\u0312\u0360\u0301\u033f\u0304\u0307\u030f\u0313\u030f\u0308\u0302\u0341\u0314\u0315\u031b\u0308\u0303\u0305\u031a\u0341\u035d\u0310\u0313\u0358\u0344\u0351\u034c\u030b\u030d\u034b\u0352\u0340\u0333\u034e\u0356\u032e\u0316\u031f\u0347\u0355\u0353\u032d\u033c\u0320\u0331\u0345\u032d\u0353\u034d\u0316\u0359\u0326\u031e\u0324\u0323\u0320\u0319\u0332\u0329\u0345\u032e\u0330\u0328\u033b\u0325\u0355\u031e\u033b\u0356\u031d\u0354\u032e\u031f\u0322\u0349\u0074\u0334\u0307\u0343\u034e\u032f\u032b\u0333\u0323\u035a\u034e\u0321\u031c\u0339\u0318\u032d\u0316\u0327\u031f\u0354\u035a\u0317\u032c\u032a\u035c\u034d\u031f\u0331\u034d\u0323\u0321\u0349\u032e\u0329\u0319'
print(ZALGO, len(ZALGO))
print(WRAP.fill(ZALGO))output👩🏻❤️👨🏿 8👩🏻❤ ️👨 🏿👩🏻 ❤️ 👨🏿 👩🏻 ❤️ 👨🏿👩 🏻❤ ️👨 🏿👩🏻 ❤️ 👨🏿 👩🏻 ❤️ 👨🏿👩 🏻❤ ️👨 🏿👩🏻 ❤️ 👨🏿 👩🏻 ❤️ 👨🏿👩 🏻❤ ️👨 🏿 t̴̢̨̛͕͍̖̮̟̺͚̦̦̠̖̗̜̘͈͕̗̼͍̫̥͔̻̯̱̞̳̜̲̫͖͙̳͈̟͔̦͓͙̩̦̓̇͑̍̉̉̀̈̇̌̿̌̚͜ë̵̢̯̺͉̮̜̟̳̗͓̼͓̗̭̗̺͓̯̦̯͓͍͋̑̈͒̆̃̃̏̋͑̓̀̄͘͜s̷̨̢̛̳͎͖̮̖̟͇͕͓̭̼̠̱̭͓͍̖͙̦̞̤̣̠̙̲̩̮̰̻̥͕̞̻͖̝͔̮̟͉̎̃͆̇̇̒́̿̄̇̏̓̏̈̂́̔̈̃̅́̐̓̈́͑͌̋̍͋͒̀̕̚͘͝͠͝ͅͅṫ̴̡̧̡͎̯̫̳̣͚͎̜̹̘̭̖̟͔͚̗̬̪͍̟̱͍̣͉̮̩̙̓͜ 218 t̴̛̓ ̇͑̍ ̉̉̚ ̀̈̇ ̌̿̌ ͕͍̖ ̮̟̺ ̢͚̦ ̦̠̖ ̗̜̘ ͈͕̗ ̼͍͜ ̫̥͔ ̻̯̱ ̞̳̜ ̲̫͖ ͙̳͈ ̟͔̦ ͓͙̩ ̨̦e ̵̈͘ ͋̑̈ ͒̆̃ ̃̏̋ ͑̓̀ ̯̺̄ ̢͉̮ ̜̟̳ ̗͓̼ ͓̗̭ ̗̺͓ ̯̦͜ ̯͓͍ s̷̎ ̃͆͝ ̇̇̒ ́̿͠ ̄̇̏ ̓̏̈ ̂́̔ ̛̈̕ ̃̅̚ ́̐͝ ̓̈́͘ ͑͌̋ ̍͋͒ ̳͎̀ ͖̮̖ ̟͇͕ ͓̭̼ ̠̱ͅ ̭͓͍ ̖͙̦ ̞̤̣ ̠̙̲ ̩̮ͅ ̨̰̻ ̥͕̞ ̻͖̝ ͔̮̟ ̢͉t ̴̇̓ ͎̯̫ ̳̣͚ ̡͎̜ ̹̘̭ ̧̖̟ ͔͚̗ ̬̪͜ ͍̟̱ ̡͍̣ ͉̮̩ ̙ |
|
OK so |
|
Exactly! The intermediate representation still seems to be valid so that no other parser should have problems reading it and the tests show that we actually always get the exact same data back after unwrapping. |
fixes #215