implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar/utflastof functions for substring extraction in UTF-8 strings by luadebug · Pull Request #7241 · xmake-io/xmake

luadebug · 2026-01-18T14:14:55Z

Resolves: #7238

Achieve similiar behavior as python3 slicing...

>>> "🎆🎉🌮"[:-2]
'🎆'
>>> "star"[:-2]
'st'
>>> "ЗВЕЗДА"[:-2]
'ЗВЕЗ'
>>> "源文件"[:-2]
'源'

Perhaps maybe we need more functions to implement? Let me know...

Also we should decide what should be shadowed... As we discussed string.sub should not be shadowed that is why we going to have string.sub and string.utfsub, instead of just string.sub (with implementation of string.utfsub)... What about utflen/utffind/utfreverse... should they shadow len/find/reverse?

gemini-code-assist · 2026-01-18T14:15:10Z

Summary of Changes

Hello @luadebug, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's string manipulation capabilities by introducing a utfsub function. This function correctly handles substring extraction for UTF-8 encoded strings, addressing the limitations of byte-based substring operations when dealing with multi-byte characters. This ensures proper text processing for internationalized content and improves overall string utility robustness.

Highlights

New UTF-8 Substring Function: Implemented a new utfsub function for accurate substring extraction in UTF-8 encoded strings, correctly handling multi-byte characters.
Lua Module Integration: The new utfsub functionality has been integrated into the Lua string module, making it directly accessible to users as string.utfsub.
Dedicated Implementation File: A new C source file, core/src/xmake/string/utfsub.c, was added to house the core logic for UTF-8 character offset calculation and substring extraction.
Comprehensive Unit Tests: Extensive unit tests have been added in tests/modules/string/test.lua to validate the utfsub function's behavior across various character sets, including ASCII, Cyrillic, Chinese, and emoji characters, covering different indexing scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a utfsub function for extracting substrings from UTF-8 encoded strings, making it available in Lua scripts. The implementation correctly handles multi-byte characters.

My review focuses on improving the maintainability of the C code and increasing test coverage.

In core/src/xmake/string/utfsub.c, I've suggested refactoring the xm_utf_offset function to eliminate code duplication, which will make the code easier to read and maintain.
In tests/modules/string/test.lua, I've recommended adding more test cases to cover negative indices and various edge cases, ensuring the function's robustness.

Overall, this is a valuable addition. Addressing these points will further improve the quality of the implementation.

core/src/xmake/string/utfsub.c

tests/modules/string/test.lua

luadebug · 2026-01-18T16:09:32Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces utfsub and utflen functions for UTF-8 string manipulation, mimicking Python's slicing behavior. The implementation is functionally correct and includes a good set of tests. My review focuses on improving code maintainability by addressing duplicated code and enhancing performance for certain edge cases. I've suggested extracting the UTF-8 character processing logic into a helper function to be reused across the new files. Additionally, I've pointed out a performance optimization for handling negative indices in utfsub and recommended adding a test case to make the behavior of a special end_idx value explicit.

core/src/xmake/string/utflen.c

core/src/xmake/string/utfsub.c

tests/modules/string/test.lua

…ponding tests

luadebug · 2026-01-18T17:22:28Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces useful UTF-8 string manipulation functions (utfsub, utflen, utffind), which is a great addition. The implementation is functionally correct and includes a good set of tests.

My main feedback is regarding significant code duplication across the newly added files (utffind.c, utflen.c, utfsub.c). The logic for counting UTF-8 characters and calculating byte offsets is repeated in multiple places. I've left specific comments suggesting that this shared logic be extracted into a common utility file (e.g., utf_utils.c with a corresponding header). This will greatly improve the maintainability and readability of the code.

Once the duplication is addressed, this will be a solid contribution.

core/src/xmake/string/utffind.c

core/src/xmake/string/utflen.c

core/src/xmake/string/utfsub.c

… functions

…improved readability and reuse

luadebug · 2026-01-18T17:39:12Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces several useful UTF-8 string manipulation functions. The implementations are generally correct and the added tests for utfsub, utflen, and utffind are comprehensive.

My main feedback is about code duplication. The logic for parsing UTF-8 characters is repeated in all four new C files (utffind.c, utflen.c, utfreverse.c, utfsub.c). This should be refactored into common helper functions in a shared location to improve maintainability.

Additionally, I've pointed out a small redundancy in utffind.c that can be simplified.

Finally, a critical point is the lack of tests for the new utfreverse function. Please add tests to ensure it works as expected.

tests/modules/string/test.lua

core/src/xmake/string/utffind.c

core/src/xmake/string/utflen.c

core/src/xmake/string/utfreverse.c

core/src/xmake/string/utfsub.c

…r utfreverse function

…the beginning of the string

luadebug · 2026-01-18T18:19:12Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces useful UTF-8 string manipulation functions (utfsub, utflen, utffind, utfreverse), which is a great addition. The implementations correctly handle multi-byte characters and edge cases like negative indices, and the accompanying tests are comprehensive.

My main feedback is regarding significant code duplication across the new files. The logic for iterating through UTF-8 characters and calculating character counts/offsets is repeated in utffind.c, utflen.c, utfreverse.c, and utfsub.c. Consolidating this logic into shared helper functions would greatly improve maintainability and reduce redundancy. I've left specific comments pointing out these duplications.

Additionally, there's a minor point about error handling for lua_newuserdata in utfreverse.c.

Overall, this is a solid contribution. Addressing the code duplication will make it even better.

core/src/xmake/string/utffind.c

core/src/xmake/string/utflen.c

core/src/xmake/string/utfreverse.c

core/src/xmake/string/utfsub.c

core/src/xmake/string/utfreverse.c

…et calculations for UTF-8 substring extraction

luadebug · 2026-01-18T19:24:53Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces valuable UTF-8 string manipulation functions (utfsub, utflen, utffind, utfreverse), which is a great addition. The implementations are functionally correct and the test coverage is good.

My main feedback revolves around significant code duplication across the new C files. The logic for determining UTF-8 character byte length and for counting characters in a string is repeated in multiple places. To enhance maintainability and robustness, I strongly recommend centralizing this logic into shared helper functions. For instance, the xm_utf_char_count and xm_utf_offset functions in utffind.c could be made non-static and declared in a shared header (like core/src/xmake/string/prefix.h) for reuse in utflen.c and utfsub.c. This would make the codebase cleaner and easier to manage.

core/src/xmake/string/utffind.c

core/src/xmake/string/utflen.c

core/src/xmake/string/utfsub.c

gemini-code-assist · 2026-01-18T19:27:11Z

core/src/xmake/string/utffind.c

+            if (b >= 0xF0) len = 4;
+            else if (b >= 0xE0) len = 3;
+            else if (b >= 0xC0) len = 2;


This logic for determining the byte length of a UTF-8 character is duplicated within this file (in xm_utf_offset) and across other new files. It's a good candidate for extraction into its own static helper function, for example xm_utf_char_len(). This would improve code clarity and make it easier to maintain. Also, the nested if can be simplified since the b >= 0xC0 check is already done.

Suggested change

if (b >= 0xF0) len = 4;

else if (b >= 0xE0) len = 3;

else if (b >= 0xC0) len = 2;

if (b >= 0xF0) len = 4;

else if (b >= 0xE0) len = 3;

else len = 2;

gemini-code-assist · 2026-01-18T19:27:11Z

core/src/xmake/string/utfreverse.c

+            if (b >= 0xF0) len = 4;
+            else if (b >= 0xE0) len = 3;
+            else if (b >= 0xC0) len = 2;


This logic for determining UTF-8 character length is duplicated from other files. Please consider extracting it to a shared helper function to improve maintainability. The nested if can also be simplified.

Suggested change

if (b >= 0xF0) len = 4;

else if (b >= 0xE0) len = 3;

else if (b >= 0xC0) len = 2;

if (b >= 0xF0) len = 4;

else if (b >= 0xE0) len = 3;

else len = 2;

gemini-code-assist · 2026-01-18T19:27:11Z

core/src/xmake/string/utfsub.c

+            if (b >= 0xF0) len = 4;
+            else if (b >= 0xE0) len = 3;
+            else if (b >= 0xC0) len = 2;


This logic for determining UTF-8 character length is duplicated from other files. Please consider extracting it into a shared helper function to improve maintainability. The nested if can also be simplified.

if (b >= 0xF0) len = 4; else if (b >= 0xE0) len = 3; else len = 2;

luadebug · 2026-01-18T21:02:57Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a comprehensive set of UTF-8 string manipulation functions, which is a great addition. The implementations are generally correct and the accompanying tests provide good coverage for various scenarios, including edge cases and negative indexing.

The main area for improvement is the significant code duplication across the newly added C files. Helper functions for UTF-8 character counting and offset calculation are either duplicated or re-implemented in multiple places. Consolidating this logic into shared helper functions would greatly improve maintainability and reduce the chance of future bugs. I've provided specific suggestions for refactoring.

core/src/xmake/string/utfbyte.c

core/src/xmake/string/utffind.c

core/src/xmake/string/utflen.c

core/src/xmake/string/utfreverse.c

core/src/xmake/string/utfsub.c

core/src/xmake/string/utfchar.c

…a UTF-8 string and implement tests

waruqi · 2026-01-19T01:53:49Z

There are many APIs that overlap with string, only differing in whether or not UTF-8 support is provided. This leads to confusion in current string usage, making it difficult for users to distinguish when to use string.sub lastof and when to use utfsub utflastof. Furthermore, implementing these directly based on UTF-8 is extremely inefficient.

We should provide better support through a new ustring module. This can be discussed in detail in a new issue regarding its design and implementation.

Additionally, any new API additions must first be created as a new feature request in an issue, undergo detailed API design, and be approved before a pull request is opened for implementation.

like this #7184 (comment)

waruqi · 2026-01-19T02:25:13Z

#7242

waruqi · 2026-01-21T02:19:12Z

#7246

implement utfsub function for substring extraction in UTF-8 strings

07364b1

gemini-code-assist bot reviewed Jan 18, 2026

View reviewed changes

core/src/xmake/string/utfsub.c Outdated Show resolved Hide resolved

tests/modules/string/test.lua Show resolved Hide resolved

luadebug added 2 commits January 18, 2026 17:22

fixup

ed03a8b

Add utflen function for UTF-8 string length calculation and update tests

d6679a3

luadebug changed the title ~~implement utfsub function for substring extraction in UTF-8 strings~~ implement utfsub/utflen functions for substring extraction in UTF-8 strings Jan 18, 2026

gemini-code-assist bot reviewed Jan 18, 2026

View reviewed changes

core/src/xmake/string/utflen.c Show resolved Hide resolved

core/src/xmake/string/utfsub.c Show resolved Hide resolved

core/src/xmake/string/utfsub.c Outdated Show resolved Hide resolved

tests/modules/string/test.lua Show resolved Hide resolved

luadebug added 2 commits January 18, 2026 19:17

Try optimize

ead90ec

Add utffind function for substring search in UTF-8 strings and corres…

d4983ff

…ponding tests

luadebug changed the title ~~implement utfsub/utflen functions for substring extraction in UTF-8 strings~~ implement utfsub/utflen/utffind functions for substring extraction in UTF-8 strings Jan 18, 2026

gemini-code-assist bot reviewed Jan 18, 2026

View reviewed changes

core/src/xmake/string/utffind.c Show resolved Hide resolved

core/src/xmake/string/utflen.c Show resolved Hide resolved

core/src/xmake/string/utfsub.c Outdated Show resolved Hide resolved

core/src/xmake/string/utfsub.c Outdated Show resolved Hide resolved

Add utfreverse function for reversing UTF-8 strings and update string…

a92f822

… functions

luadebug changed the title ~~implement utfsub/utflen/utffind functions for substring extraction in UTF-8 strings~~ implement utfsub/utflen/utffind/utfreverse functions for substring extraction in UTF-8 strings Jan 18, 2026

Refactor UTF-8 character counting logic into a separate function for …

f35c669

…improved readability and reuse

gemini-code-assist bot reviewed Jan 18, 2026

View reviewed changes

luadebug added 2 commits January 18, 2026 20:47

Remove negative index handling from utffind function and add tests fo…

2f9f3f3

…r utfreverse function

Refactor xm_string_utffind to calculate absolute index directly from …

ba67437

…the beginning of the string

gemini-code-assist bot reviewed Jan 18, 2026

View reviewed changes

luadebug added 2 commits January 18, 2026 21:54

Remove buffer allocation check in xm_string_utfreverse function

e6829bc

Refactor xm_string_utfsub to handle negative indices and improve offs…

755b722

…et calculations for UTF-8 substring extraction

gemini-code-assist bot reviewed Jan 18, 2026

View reviewed changes

luadebug added 2 commits January 18, 2026 23:08

Add utfbyte function for UTF-8 byte extraction and implement tests

7ca9068

Add utfchar function for UTF-8 character creation and implement tests

7dfb95b

luadebug changed the title ~~implement utfsub/utflen/utffind/utfreverse functions for substring extraction in UTF-8 strings~~ implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar functions for substring extraction in UTF-8 strings Jan 18, 2026

gemini-code-assist bot reviewed Jan 18, 2026

View reviewed changes

Add utflastof function to find the last occurrence of a substring in …

9e1152c

…a UTF-8 string and implement tests

luadebug changed the title ~~implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar functions for substring extraction in UTF-8 strings~~ implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar/utflastof functions for substring extraction in UTF-8 strings Jan 18, 2026

waruqi mentioned this pull request Jan 20, 2026

Add utf8 module #7246

Merged

waruqi closed this Jan 21, 2026

luadebug deleted the utf branch January 21, 2026 09:09

Uh oh!

Conversation

luadebug commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

luadebug commented Jan 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luadebug commented Jan 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luadebug commented Jan 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luadebug commented Jan 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luadebug commented Jan 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

luadebug commented Jan 18, 2026

Uh oh!

gemini-code-assist bot left a comment

luadebug commented Jan 18, 2026 •

edited

Loading

waruqi commented Jan 19, 2026 •

edited

Loading