fix: make Ellipsis operate on runes instead of bytes to prevent Unicode truncation#796
Conversation
…de truncation The Ellipsis function previously used byte-based length counting (len(str)) and byte-based slicing (str[:length-3]), which could split multi-byte Unicode characters in the middle, producing garbled output. This changes the function to use []rune conversion so the length parameter counts Unicode code points instead of bytes. Emoji, CJK ideographs, and other multi-byte characters are now never split in the middle. Fixes samber#520 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Optimized code (without convertions) func Ellipsis(str string, length int) string {
str = strings.TrimSpace(str)
const ellipsis = "..."
cutPosition := 0
for i := range str {
if length == len(ellipsis) {
cutPosition = i
}
if length--; length < 0 {
return strings.TrimSpace(str[:cutPosition]) + ellipsis
}
}
return str
} |
|
Nice optimization! Using |
Use range-based iteration to count runes without allocating a []rune slice, per reviewer suggestion. The early-return for length < 3 is kept explicit for clarity.
|
In my opinion An early return that almost completely duplicates the main algorithm I even abandoned the check however, at least reuse the |
|
Fair point -- the early return does duplicate logic for a rare edge case. I've simplified it: removed the early return and just use the |
|
Thanks guys, for the work on this 👏 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #796 +/- ##
=======================================
Coverage 94.29% 94.30%
=======================================
Files 18 18
Lines 2857 2860 +3
=======================================
+ Hits 2694 2697 +3
Misses 149 149
Partials 14 14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I had surprise by using runes Here the code won't do what is expected with Unicode character made of multiple Unicode char |
|
As mentioned here:
Originally posted by @yasuaki640 in #520 (comment) Using @rivo uniseg could be a solution. But as stated before adding a dependency is not recorded by @samber, so here checking what uniseg code do abd replicate it could be a solution |
|
Another option would be to add @ccoVeille i'm going to add your example as a commented test case. If you want to make a PR to try to fix it, please do ! |
Summary
Ellipsispreviously used byte-based length counting (len(str)) and byte-based slicing (str[:length-3]), which could split multi-byte Unicode characters (emoji, CJK, accented letters) in the middle, producing garbled output.Ellipsisto use[]runeconversion so thelengthparameter counts Unicode code points (runes) instead of bytes. Multi-byte characters are now never split.README.md,string.gogodoc, anddocs/data/core-ellipsis.md.Details
The core change is minimal (3 lines in
string.go):Note on grapheme clusters: This fix addresses rune-level truncation. Complex grapheme clusters composed of multiple code points (e.g.,
🧜♂️which uses zero-width joiners) may still be split between their constituent runes. Full grapheme segmentation would require either an external dependency or a substantial internal implementation, which the maintainer indicated should be avoided. The rune-based approach fixes the vast majority of real-world Unicode truncation issues.Test plan
TestEllipsistests continue to pass (rune count == byte count for ASCII)世界,你好)🏠🐶🐱🌟🎉🌈)café)go test ./...🤖 Generated with Claude Code