Skip to content

fix: make Ellipsis operate on runes instead of bytes to prevent Unicode truncation#796

Merged
samber merged 3 commits intosamber:masterfrom
veeceey:fix/ellipsis-unicode-truncation
Feb 21, 2026
Merged

fix: make Ellipsis operate on runes instead of bytes to prevent Unicode truncation#796
samber merged 3 commits intosamber:masterfrom
veeceey:fix/ellipsis-unicode-truncation

Conversation

@veeceey
Copy link
Contributor

@veeceey veeceey commented Feb 10, 2026

Summary

  • Fixes Ellipsis: avoid Unicode character truncation #520: Ellipsis previously used byte-based length counting (len(str)) and byte-based slicing (str[:length-3]), which could split multi-byte Unicode characters (emoji, CJK, accented letters) in the middle, producing garbled output.
  • Changed Ellipsis to use []rune conversion so the length parameter counts Unicode code points (runes) instead of bytes. Multi-byte characters are now never split.
  • Added comprehensive Unicode test cases covering CJK characters, emoji, and accented characters.
  • Updated documentation in README.md, string.go godoc, and docs/data/core-ellipsis.md.

Details

The core change is minimal (3 lines in string.go):

// Before (byte-based, breaks Unicode):
if len(str) > length {
    return strings.TrimSpace(str[:length-3]) + "..."
}

// After (rune-based, Unicode-safe):
runes := []rune(str)
if len(runes) > length {
    return strings.TrimSpace(string(runes[:length-3])) + "..."
}

Note on grapheme clusters: This fix addresses rune-level truncation. Complex grapheme clusters composed of multiple code points (e.g., 🧜‍♂️ which uses zero-width joiners) may still be split between their constituent runes. Full grapheme segmentation would require either an external dependency or a substantial internal implementation, which the maintainer indicated should be avoided. The rune-based approach fixes the vast majority of real-world Unicode truncation issues.

Test plan

  • All existing ASCII-based TestEllipsis tests continue to pass (rune count == byte count for ASCII)
  • New test cases verify correct behavior with CJK characters (世界, 你好)
  • New test cases verify correct behavior with emoji (🏠🐶🐱🌟🎉🌈)
  • New test cases verify correct behavior with accented characters (café)
  • Full test suite passes: go test ./...

🤖 Generated with Claude Code

…de truncation

The Ellipsis function previously used byte-based length counting (len(str))
and byte-based slicing (str[:length-3]), which could split multi-byte
Unicode characters in the middle, producing garbled output.

This changes the function to use []rune conversion so the length parameter
counts Unicode code points instead of bytes. Emoji, CJK ideographs, and
other multi-byte characters are now never split in the middle.

Fixes samber#520

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@d-enk
Copy link
Contributor

d-enk commented Feb 10, 2026

Optimized code (without convertions)

func Ellipsis(str string, length int) string {
	str = strings.TrimSpace(str)

	const ellipsis = "..."

	cutPosition := 0
	for i := range str {
		if length == len(ellipsis) {
			cutPosition = i
		}

		if length--; length < 0 {
			return strings.TrimSpace(str[:cutPosition]) + ellipsis
		}
	}

	return str
}

@veeceey
Copy link
Contributor Author

veeceey commented Feb 12, 2026

Nice optimization! Using range to iterate runes avoids the []rune(str) allocation entirely. I've adopted your approach with one tweak: added an early return for the length < 3 edge case before the loop to keep the intent explicit, rather than relying on cutPosition staying at 0. Updated the PR -- thanks for the suggestion!

Use range-based iteration to count runes without allocating a []rune
slice, per reviewer suggestion. The early-return for length < 3 is
kept explicit for clarity.
@d-enk
Copy link
Contributor

d-enk commented Feb 12, 2026

In my opinion

An early return that almost completely duplicates the main algorithm
for the sake of an extremely rare use case seems unnecessary.

I even abandoned the check if cutPosition == 0 {}

however, at least reuse the const ellipsis = "..."

@veeceey
Copy link
Contributor Author

veeceey commented Feb 14, 2026

Fair point -- the early return does duplicate logic for a rare edge case. I've simplified it: removed the early return and just use the ellipsis const for the fallback. Cleaner this way, thanks for the feedback.

@veeceey
Copy link
Contributor Author

veeceey commented Feb 17, 2026

Hi @samber, just a gentle ping on this one — I've incorporated the optimized rune iteration approach that @d-enk suggested and simplified the edge case handling. The fix prevents Unicode truncation in Ellipsis by operating on runes instead of bytes. Would love your review!

@samber
Copy link
Owner

samber commented Feb 21, 2026

Thanks guys, for the work on this 👏

@samber samber merged commit 0b4623d into samber:master Feb 21, 2026
10 of 11 checks passed
@codecov
Copy link

codecov bot commented Feb 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.30%. Comparing base (d269b33) to head (ad37e77).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #796   +/-   ##
=======================================
  Coverage   94.29%   94.30%           
=======================================
  Files          18       18           
  Lines        2857     2860    +3     
=======================================
+ Hits         2694     2697    +3     
  Misses        149      149           
  Partials       14       14           
Flag Coverage Δ
unittests 94.30% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ccoVeille
Copy link
Contributor

I had surprise by using runes

Here the code won't do what is expected with Unicode character made of multiple Unicode char

https://go.dev/play/p/s-pb7mlpQVj

@ccoVeille
Copy link
Contributor

As mentioned here:

This bug could be fixed by counting the string grapheme cluster at github.com/rivo/uniseg .

If this policy is acceptable, I'll work on a PR for the fix!

Originally posted by @yasuaki640 in #520 (comment)

Using @rivo uniseg could be a solution. But as stated before adding a dependency is not recorded by @samber, so here checking what uniseg code do abd replicate it could be a solution

@samber
Copy link
Owner

samber commented Feb 21, 2026

Another option would be to add github.com/rivo/uniseg in the comment of the helper for advanced cases.

@ccoVeille i'm going to add your example as a commented test case. If you want to make a PR to try to fix it, please do !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ellipsis: avoid Unicode character truncation

4 participants