implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar/utflastof functions for substring extraction in UTF-8 strings#7241
implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar/utflastof functions for substring extraction in UTF-8 strings#7241luadebug wants to merge 14 commits intoxmake-io:devfrom
Conversation
Summary of ChangesHello @luadebug, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the project's string manipulation capabilities by introducing a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a utfsub function for extracting substrings from UTF-8 encoded strings, making it available in Lua scripts. The implementation correctly handles multi-byte characters.
My review focuses on improving the maintainability of the C code and increasing test coverage.
- In
core/src/xmake/string/utfsub.c, I've suggested refactoring thexm_utf_offsetfunction to eliminate code duplication, which will make the code easier to read and maintain. - In
tests/modules/string/test.lua, I've recommended adding more test cases to cover negative indices and various edge cases, ensuring the function's robustness.
Overall, this is a valuable addition. Addressing these points will further improve the quality of the implementation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces utfsub and utflen functions for UTF-8 string manipulation, mimicking Python's slicing behavior. The implementation is functionally correct and includes a good set of tests. My review focuses on improving code maintainability by addressing duplicated code and enhancing performance for certain edge cases. I've suggested extracting the UTF-8 character processing logic into a helper function to be reused across the new files. Additionally, I've pointed out a performance optimization for handling negative indices in utfsub and recommended adding a test case to make the behavior of a special end_idx value explicit.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces useful UTF-8 string manipulation functions (utfsub, utflen, utffind), which is a great addition. The implementation is functionally correct and includes a good set of tests.
My main feedback is regarding significant code duplication across the newly added files (utffind.c, utflen.c, utfsub.c). The logic for counting UTF-8 characters and calculating byte offsets is repeated in multiple places. I've left specific comments suggesting that this shared logic be extracted into a common utility file (e.g., utf_utils.c with a corresponding header). This will greatly improve the maintainability and readability of the code.
Once the duplication is addressed, this will be a solid contribution.
…improved readability and reuse
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces several useful UTF-8 string manipulation functions. The implementations are generally correct and the added tests for utfsub, utflen, and utffind are comprehensive.
My main feedback is about code duplication. The logic for parsing UTF-8 characters is repeated in all four new C files (utffind.c, utflen.c, utfreverse.c, utfsub.c). This should be refactored into common helper functions in a shared location to improve maintainability.
Additionally, I've pointed out a small redundancy in utffind.c that can be simplified.
Finally, a critical point is the lack of tests for the new utfreverse function. Please add tests to ensure it works as expected.
…r utfreverse function
…the beginning of the string
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces useful UTF-8 string manipulation functions (utfsub, utflen, utffind, utfreverse), which is a great addition. The implementations correctly handle multi-byte characters and edge cases like negative indices, and the accompanying tests are comprehensive.
My main feedback is regarding significant code duplication across the new files. The logic for iterating through UTF-8 characters and calculating character counts/offsets is repeated in utffind.c, utflen.c, utfreverse.c, and utfsub.c. Consolidating this logic into shared helper functions would greatly improve maintainability and reduce redundancy. I've left specific comments pointing out these duplications.
Additionally, there's a minor point about error handling for lua_newuserdata in utfreverse.c.
Overall, this is a solid contribution. Addressing the code duplication will make it even better.
…et calculations for UTF-8 substring extraction
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces valuable UTF-8 string manipulation functions (utfsub, utflen, utffind, utfreverse), which is a great addition. The implementations are functionally correct and the test coverage is good.
My main feedback revolves around significant code duplication across the new C files. The logic for determining UTF-8 character byte length and for counting characters in a string is repeated in multiple places. To enhance maintainability and robustness, I strongly recommend centralizing this logic into shared helper functions. For instance, the xm_utf_char_count and xm_utf_offset functions in utffind.c could be made non-static and declared in a shared header (like core/src/xmake/string/prefix.h) for reuse in utflen.c and utfsub.c. This would make the codebase cleaner and easier to manage.
| if (b >= 0xF0) len = 4; | ||
| else if (b >= 0xE0) len = 3; | ||
| else if (b >= 0xC0) len = 2; |
There was a problem hiding this comment.
This logic for determining the byte length of a UTF-8 character is duplicated within this file (in xm_utf_offset) and across other new files. It's a good candidate for extraction into its own static helper function, for example xm_utf_char_len(). This would improve code clarity and make it easier to maintain. Also, the nested if can be simplified since the b >= 0xC0 check is already done.
| if (b >= 0xF0) len = 4; | |
| else if (b >= 0xE0) len = 3; | |
| else if (b >= 0xC0) len = 2; | |
| if (b >= 0xF0) len = 4; | |
| else if (b >= 0xE0) len = 3; | |
| else len = 2; |
| if (b >= 0xF0) len = 4; | ||
| else if (b >= 0xE0) len = 3; | ||
| else if (b >= 0xC0) len = 2; |
There was a problem hiding this comment.
This logic for determining UTF-8 character length is duplicated from other files. Please consider extracting it to a shared helper function to improve maintainability. The nested if can also be simplified.
| if (b >= 0xF0) len = 4; | |
| else if (b >= 0xE0) len = 3; | |
| else if (b >= 0xC0) len = 2; | |
| if (b >= 0xF0) len = 4; | |
| else if (b >= 0xE0) len = 3; | |
| else len = 2; |
| if (b >= 0xF0) len = 4; | ||
| else if (b >= 0xE0) len = 3; | ||
| else if (b >= 0xC0) len = 2; |
There was a problem hiding this comment.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of UTF-8 string manipulation functions, which is a great addition. The implementations are generally correct and the accompanying tests provide good coverage for various scenarios, including edge cases and negative indexing.
The main area for improvement is the significant code duplication across the newly added C files. Helper functions for UTF-8 character counting and offset calculation are either duplicated or re-implemented in multiple places. Consolidating this logic into shared helper functions would greatly improve maintainability and reduce the chance of future bugs. I've provided specific suggestions for refactoring.
…a UTF-8 string and implement tests
|
There are many APIs that overlap with We should provide better support through a new Additionally, any new API additions must first be created as a new feature request in an issue, undergo detailed API design, and be approved before a pull request is opened for implementation. like this #7184 (comment) |
Resolves: #7238
Achieve similiar behavior as python3 slicing...
Perhaps maybe we need more functions to implement? Let me know...
Also we should decide what should be shadowed... As we discussed string.sub should not be shadowed that is why we going to have string.sub and string.utfsub, instead of just string.sub (with implementation of string.utfsub)... What about utflen/utffind/utfreverse... should they shadow len/find/reverse?