Add utf8 module by waruqi · Pull Request #7246 · xmake-io/xmake

waruqi · 2026-01-20T09:43:12Z

Xmake utf8 Module Interface

The utf8 module provides comprehensive support for UTF-8 string manipulation, extending standard Lua functionality with additional helpers and convenient character-based indexing for common operations.

Overview

Mixed Indexing Modes:

Byte Indices: Used by len, codepoint, offset, and codes (consistent with Lua 5.3+ utf8 standard) for efficient low-level iteration.
Character Indices: Used by sub, find, lastof, and byte for convenient high-level text processing (1-based, treating each UTF-8 sequence as one unit).

Interface Description

`utf8.len(s [, i [, j [, lax]]])`

Returns the number of UTF-8 characters in string s that start between byte positions i and j (both inclusive).

Parameters:
- s (string): The input string.
- i (number, optional): Start byte position (default: 1).
- j (number, optional): End byte position (default: -1).
- lax (boolean, optional): If true, ignores invalid sequences.
Returns: (number) The count of characters, or nil + error position if invalid.

`utf8.sub(s, i [, j])`

Returns the substring of s that starts at character index i and ends at character index j.

Parameters:
- s (string): The input string.
- i (number): Start character index.
- j (number, optional): End character index (default: -1).
Returns: (string) The substring.

`utf8.find(s, pattern [, init [, plain]])`

Looks for the first match of pattern in the string s.

Parameters:
- s (string): The input string.
- pattern (string): The pattern to search for.
- init (number, optional): Start character index for search (default: 1).
- plain (boolean, optional): If true, treats pattern as a plain string.
Returns: (number, number) Start and end character indices of the match, or nil.

`utf8.lastof(s, pattern [, plain])`

Finds the last occurrence of pattern in the string s.

Parameters:
- s (string): The input string.
- pattern (string): The pattern to search for.
- plain (boolean, optional): If true, treats pattern as a plain string.
Returns: (number) The start character index of the last match, or nil.

`utf8.byte(s [, i [, j]])`

Returns the internal numeric codes (codepoints) of the characters s[i], s[i+1], ..., s[j].

Parameters:
- s (string): The input string.
- i (number, optional): Start character index (default: 1).
- j (number, optional): End character index (default: i).
Returns: (number...) The codepoints.

`utf8.codepoint(s [, i [, j [, lax]]])`

Returns the codepoints (as integers) from all characters in s that start between byte position i and j.

Parameters:
- s (string): The input string.
- i (number, optional): Start byte position (default: 1).
- j (number, optional): End byte position (default: i).
- lax (boolean, optional): If true, ignores invalid sequences.
Returns: (number...) The codepoints.

`utf8.char(...)`

Receives zero or more integers, converts each one to its corresponding UTF-8 character, and returns a string with the concatenation of all these characters.

Returns: (string) The resulting string.

`utf8.offset(s, n [, i])`

Returns the byte index within s of the n-th character after byte position i.

Parameters:
- s (string): The input string.
- n (number): The number of characters to advance (can be negative).
- i (number, optional): The starting byte position (default: 1).
Returns: (number) The byte index.

`utf8.width(s)` or `utf8.width(codepoint)`

Returns the display width of a string or a single codepoint (handling full-width characters).

Returns: (number) The width (usually 1 or 2 per character).

`utf8.reverse(s)`

Reverses the UTF-8 string s.

Returns: (string) The reversed string.

`utf8.codes(s [, lax])`

Returns an iterator (byte_pos, codepoint) for the string.

Usage Examples

local s = "你好xmake"

-- Length
print(utf8.len(s)) -- 7 (2 Chinese + 5 English)

-- Substring (Character Index)
print(utf8.sub(s, 1, 2)) -- "你好"
print(utf8.sub(s, 3))    -- "xmake"

-- Find (Character Index)
local start, finish = utf8.find(s, "xmake")
print(start, finish) -- 3, 7

-- Byte/Codepoint
-- utf8.byte uses character index
print(utf8.byte(s, 1)) -- 20320 (codepoint for '你')
-- utf8.codepoint uses byte index
print(utf8.codepoint(s, 1, 3)) -- 20320 (only if the char starts at byte 1 and ends before/at byte 3)

-- Width
print(utf8.width("你好")) -- 4 (2 * 2)
print(utf8.width("xmake")) -- 5

-- Reverse
print(utf8.reverse("你好")) -- "好你"

gemini-code-assist · 2026-01-20T09:43:40Z

Summary of Changes

Hello @waruqi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Xmake's string processing capabilities by integrating a new utf8 module. This module provides a complete set of functions for character-aware operations on UTF-8 strings, such as determining character length, extracting substrings, converting between codepoints and characters, and performing searches. This addition is crucial for correctly handling internationalized text within build scripts and aligns Xmake's Lua environment with modern string handling standards, similar to Lua 5.3's native utf8 library.

Highlights

New UTF-8 Module: A dedicated utf8 module has been introduced, providing comprehensive support for UTF-8 encoded strings within Xmake's Lua environment.
Core UTF-8 Functionality: Essential UTF-8 string manipulation functions are now available, including len, char, codepoint, offset, codes, sub, lastof, and find.
Lua Integration: These new C-implemented functions are seamlessly exposed to Lua scripts, enabling robust handling of multi-byte characters.
Compatibility and Testing: Includes a lua_Unsigned typedef for broader Lua version compatibility and comprehensive unit tests to ensure the correctness and reliability of the new UTF-8 features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive utf8 module with Lua bindings, which is a valuable addition for improved Unicode string handling. The implementation is extensive, including C functions for various UTF-8 operations and corresponding Lua wrappers and tests. However, my review has identified several critical issues that need to be addressed before merging. These include a buffer overflow vulnerability, incorrect return values leading to stack corruption, and improper handling of invalid UTF-8 sequences. I have provided detailed comments and code suggestions to fix these problems.

core/src/xmake/utf8/codepoint.c

core/src/xmake/utf8/codes.c

core/src/xmake/utf8/utf8.c

waruqi · 2026-01-21T01:25:00Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a comprehensive native utf8 module, which is a significant improvement for handling UTF-8 strings in xmake. The implementation is robust, covering a wide range of functionality compatible with Lua's standard utf8 library, and includes extensive tests.

I've made a couple of suggestions for improvement:

Optimizing single-character searches in xm_utf8_lastof_impl for better performance.
Using luaL_Buffer in xm_utf8_reverse for safer and more idiomatic buffer management.

Overall, this is an excellent contribution that enhances xmake's string manipulation capabilities.

core/src/xmake/utf8/reverse.c

core/src/xmake/utf8/utf8.c

waruqi added this to the v3.0.7 milestone Jan 20, 2026

waruqi mentioned this pull request Jan 20, 2026

Add the utf8 module to better support the Unicode character set. #7242

Closed

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

core/src/xmake/utf8/codepoint.c Outdated Show resolved Hide resolved

core/src/xmake/utf8/codes.c Show resolved Hide resolved

core/src/xmake/utf8/utf8.c Show resolved Hide resolved

add utf8 module

e0f8104

waruqi marked this pull request as ready for review January 20, 2026 15:18

waruqi added 16 commits January 20, 2026 23:24

modify comments

f8ab010

improve code styles

a22dc93

format code

9242345

add utf8.sub and lastof

a9d5fe2

fix utf8.find

2371058

improve utf8.sub and find

6eade15

improve utf8.lastof

91843ff

improve lastof

443fc30

add utf8.width

4c08bae

mark wcswidth deprecated

b5a5068

add utf8.byte

f10baca

add utf8.reverse

ae8b2c4

mark string.encode/decode as deprecated

96ad9ae

move deprecated apis to tail

8a1c05d

fix review issues

2eeb428

improve test runner

3a60244

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

core/src/xmake/utf8/reverse.c Show resolved Hide resolved

core/src/xmake/utf8/utf8.c Outdated Show resolved Hide resolved

waruqi mentioned this pull request Jan 21, 2026

implement utfsub/utflen/utffind/utfreverse/utfbyte/utfchar/utflastof functions for substring extraction in UTF-8 strings #7241

Closed

waruqi merged commit f8d4400 into dev Jan 21, 2026
73 checks passed

waruqi deleted the utf8 branch January 21, 2026 04:00

waruqi mentioned this pull request Jan 21, 2026

string.sub non-English characters via xmake l string.sub? #7238

Closed

waruqi added 2 commits January 21, 2026 22:09

fix path.filename and lastof

197dfb6

format code

e36eec6

waruqi added 4 commits January 21, 2026 22:31

fix compile error

e53ded7

fix compile error

66cb827

fix error again

a5d36f7

improve lastof

6ff8486

apocelipes mentioned this pull request Jan 26, 2026

#7245 breaks xmake-repo #7265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add utf8 module#7246

Add utf8 module#7246
waruqi merged 23 commits intodevfrom
utf8

waruqi commented Jan 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waruqi commented Jan 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

waruqi commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Xmake utf8 Module Interface

Overview

Interface Description

utf8.len(s [, i [, j [, lax]]])

utf8.sub(s, i [, j])

utf8.find(s, pattern [, init [, plain]])

utf8.lastof(s, pattern [, plain])

utf8.byte(s [, i [, j]])

utf8.codepoint(s [, i [, j [, lax]]])

utf8.char(...)

utf8.offset(s, n [, i])

utf8.width(s) or utf8.width(codepoint)

utf8.reverse(s)

utf8.codes(s [, lax])

Usage Examples

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waruqi commented Jan 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

waruqi commented Jan 20, 2026 •

edited

Loading

`utf8.len(s [, i [, j [, lax]]])`

`utf8.sub(s, i [, j])`

`utf8.find(s, pattern [, init [, plain]])`

`utf8.lastof(s, pattern [, plain])`

`utf8.byte(s [, i [, j]])`

`utf8.codepoint(s [, i [, j [, lax]]])`

`utf8.char(...)`

`utf8.offset(s, n [, i])`

`utf8.width(s)` or `utf8.width(codepoint)`

`utf8.reverse(s)`

`utf8.codes(s [, lax])`