Skip to content

add string case conversion functions: lower and upper#7235

Merged
waruqi merged 43 commits intoxmake-io:devfrom
luadebug:case
Jan 18, 2026
Merged

add string case conversion functions: lower and upper#7235
waruqi merged 43 commits intoxmake-io:devfrom
luadebug:case

Conversation

@luadebug
Copy link
Member

@luadebug luadebug commented Jan 16, 2026

Resolves: #7234

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @luadebug, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the string manipulation capabilities of the system by integrating new lower and upper functions. These additions provide a standardized way to change the case of strings, with a focus on internationalization by correctly handling various character encodings, thereby improving the flexibility and global compatibility of string operations.

Highlights

  • New String Case Conversion Functions: Introduced lower and upper functions for string objects, allowing conversion of strings to their lowercase and uppercase equivalents.
  • Unicode Support: The new case conversion functions are implemented with robust support for multi-byte character sets, specifically handling UTF-8 input by converting to wide characters (UTF-16 on Windows, UTF-32 on other systems) for conversion, and then back to UTF-8.
  • New Source File: A new C source file, core/src/xmake/string/case.c, has been added to encapsulate the logic for these string case conversions.
  • Unit Tests: Comprehensive unit tests have been added in tests/modules/string/test.lua to verify the correctness of the lower and upper functions, including tests for both ASCII and non-ASCII (Cyrillic) characters.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces string.lower() and string.upper() functions for case conversion, with support for Unicode characters. The implementation correctly handles character set conversions.

My review focuses on improving robustness and maintainability. I've identified potential integer overflow vulnerabilities in memory allocation calculations that could lead to security issues with very large input strings. I've also suggested refactoring the main logic to simplify error handling and improve code readability.

The added tests are good and cover both ASCII and non-ASCII strings.

@luadebug
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces lower and upper string case conversion functions, which is a valuable addition. The implementation correctly handles Unicode by converting strings to a wide-character format before processing, and this is validated with tests for both ASCII and non-ASCII character sets. I have provided one suggestion to refactor the core logic in case.c to improve its readability, maintainability, and memory efficiency. Overall, the changes are well-implemented.

@luadebug luadebug marked this pull request as draft January 17, 2026 00:24
@luadebug luadebug marked this pull request as ready for review January 17, 2026 02:46
@luadebug
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces string.lower() and string.upper() functions for case conversion of strings, with full UTF-8 support. This is achieved by adding the utf8proc library as a new dependency. The implementation is solid, correctly handling UTF-8 characters and even preserving invalid byte sequences. The integration into the build system and the addition of comprehensive test cases are well done. I have one minor suggestion to improve code hygiene in the new implementation file.

@waruqi
Copy link
Member

waruqi commented Jan 17, 2026

Please do not include third-party dependencies. xmake strictly controls the inclusion of third-party libraries and may gradually remove some existing third-party library dependencies in the future. tbox has built-in encoding handling and cross-platform support; we should improve tbox as much as possible to improve and implement these features.

@luadebug
Copy link
Member Author

Please do not include third-party dependencies. xmake strictly controls the inclusion of third-party libraries and may gradually remove some existing third-party library dependencies in the future. tbox has built-in encoding handling and cross-platform support; we should improve tbox as much as possible to improve and implement these features.

OK. What is a solution for this situation?

@waruqi
Copy link
Member

waruqi commented Jan 17, 2026

Supporting full character set conversion seems like a huge undertaking, potentially increasing the xmake binary by over 1-2MB.

https://github.com/JuliaStrings/utf8proc/blob/e5e799221b45bbb90f5fdc5c69b6b8dfbf017e78/utf8proc_data.c (2.25MB)

We should improve the charset module of tbox to provide tolower/toupper interfaces.

Then, for some system platforms, we can use existing system APIs to support it (e.g., Windows CharUpperBuffW/CharLowerBuffW, Apple NSString uppercaseString/lowercaseString, CFStringLowercase). If no existing system API support is available, we can fall back to our own simplified implementation, only providing conversions for some commonly used language character sets.

This might be quite complex and time-consuming. If you can't implement it in a short time, you can leave it for now. I'll try to support it when I have more time later.

@luadebug
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds string.lower() and string.upper() functions with Unicode support. The implementation is well-designed, with an optimization for ASCII strings and correct handling of Unicode characters via wide-character conversion. The approach to managing locales across different platforms is thorough. However, I've found a potential bug in the setlocale-based implementation where a failure to save the current locale could lead to it not being restored, which could have side effects. I've included a suggestion to make this more robust. The new tests provide good coverage for various character sets.

@luadebug
Copy link
Member Author

wait for this patch. tboox/tbox#300

then we can use tb_charset_utf8_tolower/tb_charset_utf8_toupper, it will call tb_wcslwr/tb_wcsupr to do it.

It makes some optimizations. Since 90% of the time the strings are ASCII strings, executing the Unicode tolower/toupper completely every time would significantly impact performance, so we don't need to do that.

tb_long_t tb_charset_utf8_tolower(tb_char_t* s, tb_size_t n)
{
    // check
    tb_assert_and_check_return_val(s, -1);

    // try ascii tolower first
    tb_char_t* p = s;
    tb_char_t* e = s + n;
    while (p < e && *p)
    {
        if ((*p) & 0x80) break; 
        *p = tb_tolower(*p);
        p++;
    }
    if (p == e || !*p) return p - s;

    // convert the suffix to wchar_t
    tb_long_t   r = -1;
    tb_size_t   wn = e - p + 1;
    tb_wchar_t  wb[256];
    tb_wchar_t* w = (wn <= 256)? wb : (tb_wchar_t*)tb_malloc(wn * sizeof(tb_wchar_t));
    if (w)
    {
        // to lower
        if (tb_mbstowcs(w, p, wn) != -1)
        {
            tb_wcslwr(w);
            r = tb_wcstombs(p, w, wn);
            if (r != -1) r += (p - s);
        }

        // free it
        if (w != wb) tb_free(w);
    }
    return r;
}

Looks cool, would be nice to have it clean without windows.h or some appleframework?

@SirLynix
Copy link
Member

I don't know if tb_wchar_t maps to wchar_t but beware on Windows wchar_t is UTF-16 and has surrogate pairs, which means a single wchar_t may be half a character.

@waruqi
Copy link
Member

waruqi commented Jan 18, 2026

I don't know if tb_wchar_t maps to wchar_t but beware on Windows wchar_t is UTF-16 and has surrogate pairs, which means a single wchar_t may be half a character.

Therefore, I would prioritize using the CharUpperW API to implement complete string conversion, as is the case on Apple's platform. A while-loop + towupper is the fallback solution. @luadebug

@luadebug
Copy link
Member Author

I don't know if tb_wchar_t maps to wchar_t but beware on Windows wchar_t is UTF-16 and has surrogate pairs, which means a single wchar_t may be half a character.

Therefore, I would prioritize using the CharUpperW API to implement complete string conversion, as is the case on Apple's platform. A while-loop + towupper is the fallback solution. @luadebug

Well sounds good as we have them included at any case.

@waruqi
Copy link
Member

waruqi commented Jan 18, 2026

We also need to update tbox.config.h for each platform to add detection results such as wcsupr and towlower. Usually only towlower and towupper exist.

@waruqi
Copy link
Member

waruqi commented Jan 18, 2026

wait for this patch #7237

@waruqi
Copy link
Member

waruqi commented Jan 18, 2026

rebase dev and try it again.

tb_size_t i;
for (i = 0; i < size; i++) {
// Stop at first non-ascii char
if ((tb_byte_t)str[i] >= 0x80) break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* //////////////////////////////////////////////////////////////////////////////////////
* implementation
*/
tb_int_t xm_string_lower(lua_State* lua) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move to lower.c and upper.c file

@waruqi
Copy link
Member

waruqi commented Jan 18, 2026

like this

/* string.upper
 *
 * @param str       the string
 *
 * @code
 *      local result = string.upper(str)
 * @endcode
 */
tb_int_t xm_string_toupper(lua_State *lua) {
    // check
    tb_assert_and_check_return_val(lua, 0);

    // get string
    size_t           size = 0;
    tb_char_t const* cstr = luaL_checklstring(lua, 1, &size);
    tb_check_return_val(cstr, 0);

    // empty?
    if (!size) {
        lua_pushstring(lua, "");
        return 1;
    }

    // copy string to buffer
    tb_char_t* buffer = tb_malloc_bytes(size + 1);
    if (buffer) {
        tb_memcpy(buffer, cstr, size);
        buffer[size] = '\0';

        // to upper
        tb_long_t real_size = tb_charset_utf8_toupper(buffer, size);
        
        // push result
        if (real_size >= 0) {
            lua_pushlstring(lua, buffer, real_size);
        } else {
            lua_pushlstring(lua, cstr, size);
        }
        tb_free(buffer);
    } else {
        lua_pushnil(lua);
    }
    return 1;
}

@waruqi waruqi added this to the v3.0.7 milestone Jan 18, 2026
@waruqi waruqi merged commit d570511 into xmake-io:dev Jan 18, 2026
35 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lowercase non-English characters via xmake l string.lower?

3 participants