Skip to content

Add utf8 module#7246

Merged
waruqi merged 23 commits intodevfrom
utf8
Jan 21, 2026
Merged

Add utf8 module#7246
waruqi merged 23 commits intodevfrom
utf8

Conversation

@waruqi
Copy link
Member

@waruqi waruqi commented Jan 20, 2026

#7242
#7238
#7241

Xmake utf8 Module Interface

The utf8 module provides comprehensive support for UTF-8 string manipulation, extending standard Lua functionality with additional helpers and convenient character-based indexing for common operations.

Overview

Mixed Indexing Modes:

  • Byte Indices: Used by len, codepoint, offset, and codes (consistent with Lua 5.3+ utf8 standard) for efficient low-level iteration.
  • Character Indices: Used by sub, find, lastof, and byte for convenient high-level text processing (1-based, treating each UTF-8 sequence as one unit).

Interface Description

utf8.len(s [, i [, j [, lax]]])

Returns the number of UTF-8 characters in string s that start between byte positions i and j (both inclusive).

  • Parameters:
    • s (string): The input string.
    • i (number, optional): Start byte position (default: 1).
    • j (number, optional): End byte position (default: -1).
    • lax (boolean, optional): If true, ignores invalid sequences.
  • Returns: (number) The count of characters, or nil + error position if invalid.

utf8.sub(s, i [, j])

Returns the substring of s that starts at character index i and ends at character index j.

  • Parameters:
    • s (string): The input string.
    • i (number): Start character index.
    • j (number, optional): End character index (default: -1).
  • Returns: (string) The substring.

utf8.find(s, pattern [, init [, plain]])

Looks for the first match of pattern in the string s.

  • Parameters:
    • s (string): The input string.
    • pattern (string): The pattern to search for.
    • init (number, optional): Start character index for search (default: 1).
    • plain (boolean, optional): If true, treats pattern as a plain string.
  • Returns: (number, number) Start and end character indices of the match, or nil.

utf8.lastof(s, pattern [, plain])

Finds the last occurrence of pattern in the string s.

  • Parameters:
    • s (string): The input string.
    • pattern (string): The pattern to search for.
    • plain (boolean, optional): If true, treats pattern as a plain string.
  • Returns: (number) The start character index of the last match, or nil.

utf8.byte(s [, i [, j]])

Returns the internal numeric codes (codepoints) of the characters s[i], s[i+1], ..., s[j].

  • Parameters:
    • s (string): The input string.
    • i (number, optional): Start character index (default: 1).
    • j (number, optional): End character index (default: i).
  • Returns: (number...) The codepoints.

utf8.codepoint(s [, i [, j [, lax]]])

Returns the codepoints (as integers) from all characters in s that start between byte position i and j.

  • Parameters:
    • s (string): The input string.
    • i (number, optional): Start byte position (default: 1).
    • j (number, optional): End byte position (default: i).
    • lax (boolean, optional): If true, ignores invalid sequences.
  • Returns: (number...) The codepoints.

utf8.char(...)

Receives zero or more integers, converts each one to its corresponding UTF-8 character, and returns a string with the concatenation of all these characters.

  • Returns: (string) The resulting string.

utf8.offset(s, n [, i])

Returns the byte index within s of the n-th character after byte position i.

  • Parameters:
    • s (string): The input string.
    • n (number): The number of characters to advance (can be negative).
    • i (number, optional): The starting byte position (default: 1).
  • Returns: (number) The byte index.

utf8.width(s) or utf8.width(codepoint)

Returns the display width of a string or a single codepoint (handling full-width characters).

  • Returns: (number) The width (usually 1 or 2 per character).

utf8.reverse(s)

Reverses the UTF-8 string s.

  • Returns: (string) The reversed string.

utf8.codes(s [, lax])

Returns an iterator (byte_pos, codepoint) for the string.

Usage Examples

local s = "你好xmake"

-- Length
print(utf8.len(s)) -- 7 (2 Chinese + 5 English)

-- Substring (Character Index)
print(utf8.sub(s, 1, 2)) -- "你好"
print(utf8.sub(s, 3))    -- "xmake"

-- Find (Character Index)
local start, finish = utf8.find(s, "xmake")
print(start, finish) -- 3, 7

-- Byte/Codepoint
-- utf8.byte uses character index
print(utf8.byte(s, 1)) -- 20320 (codepoint for '你')
-- utf8.codepoint uses byte index
print(utf8.codepoint(s, 1, 3)) -- 20320 (only if the char starts at byte 1 and ends before/at byte 3)

-- Width
print(utf8.width("你好")) -- 4 (2 * 2)
print(utf8.width("xmake")) -- 5

-- Reverse
print(utf8.reverse("你好")) -- "好你"

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @waruqi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Xmake's string processing capabilities by integrating a new utf8 module. This module provides a complete set of functions for character-aware operations on UTF-8 strings, such as determining character length, extracting substrings, converting between codepoints and characters, and performing searches. This addition is crucial for correctly handling internationalized text within build scripts and aligns Xmake's Lua environment with modern string handling standards, similar to Lua 5.3's native utf8 library.

Highlights

  • New UTF-8 Module: A dedicated utf8 module has been introduced, providing comprehensive support for UTF-8 encoded strings within Xmake's Lua environment.
  • Core UTF-8 Functionality: Essential UTF-8 string manipulation functions are now available, including len, char, codepoint, offset, codes, sub, lastof, and find.
  • Lua Integration: These new C-implemented functions are seamlessly exposed to Lua scripts, enabling robust handling of multi-byte characters.
  • Compatibility and Testing: Includes a lua_Unsigned typedef for broader Lua version compatibility and comprehensive unit tests to ensure the correctness and reliability of the new UTF-8 features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive utf8 module with Lua bindings, which is a valuable addition for improved Unicode string handling. The implementation is extensive, including C functions for various UTF-8 operations and corresponding Lua wrappers and tests. However, my review has identified several critical issues that need to be addressed before merging. These include a buffer overflow vulnerability, incorrect return values leading to stack corruption, and improper handling of invalid UTF-8 sequences. I have provided detailed comments and code suggestions to fix these problems.

@waruqi waruqi marked this pull request as ready for review January 20, 2026 15:18
@waruqi
Copy link
Member Author

waruqi commented Jan 21, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive native utf8 module, which is a significant improvement for handling UTF-8 strings in xmake. The implementation is robust, covering a wide range of functionality compatible with Lua's standard utf8 library, and includes extensive tests.

I've made a couple of suggestions for improvement:

  • Optimizing single-character searches in xm_utf8_lastof_impl for better performance.
  • Using luaL_Buffer in xm_utf8_reverse for safer and more idiomatic buffer management.

Overall, this is an excellent contribution that enhances xmake's string manipulation capabilities.

@waruqi waruqi merged commit f8d4400 into dev Jan 21, 2026
73 checks passed
@waruqi waruqi deleted the utf8 branch January 21, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant