Skip to content

implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar/utflastof functions for substring extraction in UTF-8 strings#7241

Closed
luadebug wants to merge 14 commits intoxmake-io:devfrom
luadebug:utf
Closed

implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar/utflastof functions for substring extraction in UTF-8 strings#7241
luadebug wants to merge 14 commits intoxmake-io:devfrom
luadebug:utf

Conversation

@luadebug
Copy link
Member

@luadebug luadebug commented Jan 18, 2026

Resolves: #7238

Achieve similiar behavior as python3 slicing...

>>> "🎆🎉🌮"[:-2]
'🎆'
>>> "star"[:-2]
'st'
>>> "ЗВЕЗДА"[:-2]
'ЗВЕЗ'
>>> "源文件"[:-2]
'源'

Perhaps maybe we need more functions to implement? Let me know...

Also we should decide what should be shadowed... As we discussed string.sub should not be shadowed that is why we going to have string.sub and string.utfsub, instead of just string.sub (with implementation of string.utfsub)... What about utflen/utffind/utfreverse... should they shadow len/find/reverse?

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @luadebug, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's string manipulation capabilities by introducing a utfsub function. This function correctly handles substring extraction for UTF-8 encoded strings, addressing the limitations of byte-based substring operations when dealing with multi-byte characters. This ensures proper text processing for internationalized content and improves overall string utility robustness.

Highlights

  • New UTF-8 Substring Function: Implemented a new utfsub function for accurate substring extraction in UTF-8 encoded strings, correctly handling multi-byte characters.
  • Lua Module Integration: The new utfsub functionality has been integrated into the Lua string module, making it directly accessible to users as string.utfsub.
  • Dedicated Implementation File: A new C source file, core/src/xmake/string/utfsub.c, was added to house the core logic for UTF-8 character offset calculation and substring extraction.
  • Comprehensive Unit Tests: Extensive unit tests have been added in tests/modules/string/test.lua to validate the utfsub function's behavior across various character sets, including ASCII, Cyrillic, Chinese, and emoji characters, covering different indexing scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a utfsub function for extracting substrings from UTF-8 encoded strings, making it available in Lua scripts. The implementation correctly handles multi-byte characters.

My review focuses on improving the maintainability of the C code and increasing test coverage.

  • In core/src/xmake/string/utfsub.c, I've suggested refactoring the xm_utf_offset function to eliminate code duplication, which will make the code easier to read and maintain.
  • In tests/modules/string/test.lua, I've recommended adding more test cases to cover negative indices and various edge cases, ensuring the function's robustness.

Overall, this is a valuable addition. Addressing these points will further improve the quality of the implementation.

@luadebug
Copy link
Member Author

/gemini review

@luadebug luadebug changed the title implement utfsub function for substring extraction in UTF-8 strings implement utfsub/utflen functions for substring extraction in UTF-8 strings Jan 18, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces utfsub and utflen functions for UTF-8 string manipulation, mimicking Python's slicing behavior. The implementation is functionally correct and includes a good set of tests. My review focuses on improving code maintainability by addressing duplicated code and enhancing performance for certain edge cases. I've suggested extracting the UTF-8 character processing logic into a helper function to be reused across the new files. Additionally, I've pointed out a performance optimization for handling negative indices in utfsub and recommended adding a test case to make the behavior of a special end_idx value explicit.

@luadebug luadebug changed the title implement utfsub/utflen functions for substring extraction in UTF-8 strings implement utfsub/utflen/utffind functions for substring extraction in UTF-8 strings Jan 18, 2026
@luadebug
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces useful UTF-8 string manipulation functions (utfsub, utflen, utffind), which is a great addition. The implementation is functionally correct and includes a good set of tests.

My main feedback is regarding significant code duplication across the newly added files (utffind.c, utflen.c, utfsub.c). The logic for counting UTF-8 characters and calculating byte offsets is repeated in multiple places. I've left specific comments suggesting that this shared logic be extracted into a common utility file (e.g., utf_utils.c with a corresponding header). This will greatly improve the maintainability and readability of the code.

Once the duplication is addressed, this will be a solid contribution.

@luadebug luadebug changed the title implement utfsub/utflen/utffind functions for substring extraction in UTF-8 strings implement utfsub/utflen/utffind/utfreverse functions for substring extraction in UTF-8 strings Jan 18, 2026
@luadebug
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several useful UTF-8 string manipulation functions. The implementations are generally correct and the added tests for utfsub, utflen, and utffind are comprehensive.

My main feedback is about code duplication. The logic for parsing UTF-8 characters is repeated in all four new C files (utffind.c, utflen.c, utfreverse.c, utfsub.c). This should be refactored into common helper functions in a shared location to improve maintainability.

Additionally, I've pointed out a small redundancy in utffind.c that can be simplified.

Finally, a critical point is the lack of tests for the new utfreverse function. Please add tests to ensure it works as expected.

@luadebug
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces useful UTF-8 string manipulation functions (utfsub, utflen, utffind, utfreverse), which is a great addition. The implementations correctly handle multi-byte characters and edge cases like negative indices, and the accompanying tests are comprehensive.

My main feedback is regarding significant code duplication across the new files. The logic for iterating through UTF-8 characters and calculating character counts/offsets is repeated in utffind.c, utflen.c, utfreverse.c, and utfsub.c. Consolidating this logic into shared helper functions would greatly improve maintainability and reduce redundancy. I've left specific comments pointing out these duplications.

Additionally, there's a minor point about error handling for lua_newuserdata in utfreverse.c.

Overall, this is a solid contribution. Addressing the code duplication will make it even better.

@luadebug
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable UTF-8 string manipulation functions (utfsub, utflen, utffind, utfreverse), which is a great addition. The implementations are functionally correct and the test coverage is good.

My main feedback revolves around significant code duplication across the new C files. The logic for determining UTF-8 character byte length and for counting characters in a string is repeated in multiple places. To enhance maintainability and robustness, I strongly recommend centralizing this logic into shared helper functions. For instance, the xm_utf_char_count and xm_utf_offset functions in utffind.c could be made non-static and declared in a shared header (like core/src/xmake/string/prefix.h) for reuse in utflen.c and utfsub.c. This would make the codebase cleaner and easier to manage.

Comment on lines +44 to +46
if (b >= 0xF0) len = 4;
else if (b >= 0xE0) len = 3;
else if (b >= 0xC0) len = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for determining the byte length of a UTF-8 character is duplicated within this file (in xm_utf_offset) and across other new files. It's a good candidate for extraction into its own static helper function, for example xm_utf_char_len(). This would improve code clarity and make it easier to maintain. Also, the nested if can be simplified since the b >= 0xC0 check is already done.

Suggested change
if (b >= 0xF0) len = 4;
else if (b >= 0xE0) len = 3;
else if (b >= 0xC0) len = 2;
if (b >= 0xF0) len = 4;
else if (b >= 0xE0) len = 3;
else len = 2;

Comment on lines +55 to +57
if (b >= 0xF0) len = 4;
else if (b >= 0xE0) len = 3;
else if (b >= 0xC0) len = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for determining UTF-8 character length is duplicated from other files. Please consider extracting it to a shared helper function to improve maintainability. The nested if can also be simplified.

Suggested change
if (b >= 0xF0) len = 4;
else if (b >= 0xE0) len = 3;
else if (b >= 0xC0) len = 2;
if (b >= 0xF0) len = 4;
else if (b >= 0xE0) len = 3;
else len = 2;

Comment on lines +83 to +85
if (b >= 0xF0) len = 4;
else if (b >= 0xE0) len = 3;
else if (b >= 0xC0) len = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for determining UTF-8 character length is duplicated from other files. Please consider extracting it into a shared helper function to improve maintainability. The nested if can also be simplified.

            if (b >= 0xF0) len = 4;
            else if (b >= 0xE0) len = 3;
            else len = 2;

@luadebug luadebug changed the title implement utfsub/utflen/utffind/utfreverse functions for substring extraction in UTF-8 strings implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar functions for substring extraction in UTF-8 strings Jan 18, 2026
@luadebug
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of UTF-8 string manipulation functions, which is a great addition. The implementations are generally correct and the accompanying tests provide good coverage for various scenarios, including edge cases and negative indexing.

The main area for improvement is the significant code duplication across the newly added C files. Helper functions for UTF-8 character counting and offset calculation are either duplicated or re-implemented in multiple places. Consolidating this logic into shared helper functions would greatly improve maintainability and reduce the chance of future bugs. I've provided specific suggestions for refactoring.

@luadebug luadebug changed the title implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar functions for substring extraction in UTF-8 strings implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar/utflastof functions for substring extraction in UTF-8 strings Jan 18, 2026
@waruqi
Copy link
Member

waruqi commented Jan 19, 2026

There are many APIs that overlap with string, only differing in whether or not UTF-8 support is provided. This leads to confusion in current string usage, making it difficult for users to distinguish when to use string.sub lastof and when to use utfsub utflastof. Furthermore, implementing these directly based on UTF-8 is extremely inefficient.

We should provide better support through a new ustring module. This can be discussed in detail in a new issue regarding its design and implementation.

Additionally, any new API additions must first be created as a new feature request in an issue, undergo detailed API design, and be approved before a pull request is opened for implementation.

like this #7184 (comment)

@waruqi
Copy link
Member

waruqi commented Jan 19, 2026

#7242

@waruqi waruqi mentioned this pull request Jan 20, 2026
@waruqi
Copy link
Member

waruqi commented Jan 21, 2026

#7246

@waruqi waruqi closed this Jan 21, 2026
@luadebug luadebug deleted the utf branch January 21, 2026 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

string.sub non-English characters via xmake l string.sub?

2 participants