Skip to content

Add xml module#7025

Merged
waruqi merged 13 commits intodevfrom
xml
Nov 15, 2025
Merged

Add xml module#7025
waruqi merged 13 commits intodevfrom
xml

Conversation

@waruqi
Copy link
Member

@waruqi waruqi commented Nov 14, 2025

core.base.xml

The core.base.xml module provides a tiny DOM-style XML toolkit that works inside Xmake’s sandbox. It focuses on predictable data structures, JSON-like usability, and optional streaming so you can parse large XML documents without building the entire tree.

Node Structure

XML nodes are plain Lua tables. All constructors (xml.new, xml.text, xml.comment, etc.) return values shaped like:

{
    name     = "element-name" | nil, -- only for element nodes
    kind     = "element" | "text" | "comment" | "cdata" | "doctype" | "document",
    attrs    = { key = value, ... } or nil,
    text     = string or nil,
    children = { child1, child2, ... } or nil,
    prolog   = { comment/doctype nodes before root } or nil
}

Because these are regular tables, mutating them updates the DOM in place and the changes show up automatically when you call xml.encode or xml.savefile.

Quick Start

import("core.base.xml")

local doc = assert(xml.decode([[
<?xml version="1.0"?>
<root id="1">
  <item id="foo">hello</item>
</root>
]]))

local item = assert(xml.find(doc, "//item[@id='foo']"))
item.attrs.lang = "en"             -- mutate attrs directly
item.children = {xml.text("world")} -- replace existing text node
table.insert(doc.children, xml.comment("generated by xmake"))

local pretty = assert(xml.encode(doc, {pretty = true}))
assert(xml.savefile("out.xml", doc, {pretty = true}))

Streaming Example

local found
xml.scan(plist_text, function(node)
    if node.name == "key" and xml.text_of(node) == "NSPrincipalClass" then
        found = node
        return false -- early terminate
    end
end)

xml.scan walks nodes as they are completed; returning false stops the scan immediately. This is ideal for large files (e.g. Info.plist) when you only need a few entries.

Options Summary

Option Applies to Description
trim_text = true xml.decode, xml.scan Strip leading/trailing spaces inside text nodes. Disabled by default to avoid data loss.
keep_whitespace_nodes = true xml.decode, xml.scan Preserve whitespace-only text nodes (by default they are discarded unless trim_text produced non-empty content).
pretty = true / indent / indentchar xml.encode, xml.savefile Enable formatting and control indentation.

API Reference

xml.new(opt)

Create a custom node. opt may contain name, kind, attrs, children, and text. Usually you call the dedicated helpers below instead of xml.new directly.

Element/Text Helpers

local textnode    = xml.text("hello")
local empty       = xml.empty("br", {class = "line"})
local comment     = xml.comment("generated by xmake")
local cdata_node  = xml.cdata("if (value < 1) {...}")
local doctype     = xml.doctype('plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"')

All helpers return node tables that you can insert into children.

xml.decode(data, opt)

Parse an XML string into a node tree. Returns the single root element when there is exactly one element, or all top-level nodes when multiple elements exist. On failure returns nil, err.

Supports:

  • Comments, CDATA, DOCTYPE (stored in root.prolog when present).
  • Unquoted attributes such as <item flag=true path=/tmp/file>.
  • XPath-friendly structure (name, attrs, children).
  • trim_text and keep_whitespace_nodes options described above.

xml.encode(node, opt)

Serialize a node tree back into XML. Set {pretty = true, indent = 2} for multi-line output or pass a custom indentchar.

xml.loadfile(path, opt) / xml.savefile(path, node, opt)

Convenience wrappers that call io.readfile/io.writefile and reuse the decode/encode options.

xml.text_of(node)

Concatenate all direct text children and return the combined string. Useful for quickly reading <string>...</string> values.

xml.find(node, path)

XPath-like lookup supporting:

  • / child axis, // descendant axis.
  • Wildcards (*) and node tests (text(), comment(), cdata(), doctype()).
  • Attribute predicates ([@id='foo'], [@enabled]), text predicates ([text()='value']), positional indexes ([2]).

Returns the first node that matches or nil if nothing is found.

xml.scan(data, callback, opt)

Streaming parser. Calls callback(node) for each completed node; returning false stops the scan early. Accepts the same options as xml.decode (trim_text, keep_whitespace_nodes). Nodes produced by xml.scan share the same structure as xml.decode.

Attribute Parsing Notes

  • Both quoted and unquoted values are supported (a="1 2", b='foo', c=bare).
  • Attribute names may include colons, dashes, or underscores.
  • Entity references inside attribute values are decoded (&amp;&).

Example: Parsing and Updating an Info.plist

import("core.base.xml")

local plist = assert(xml.loadfile("Info.plist"))
local dict = assert(xml.find(plist, "plist/dict"))
local version_key

for i = 1, #dict.children, 2 do
    local key = dict.children[i]
    local value = dict.children[i + 1]
    if key and value and xml.text_of(key) == "CFBundleShortVersionString" then
        version_key = value
        break
    end
end

if version_key then
    version_key.children = {xml.text("2.0")}
    assert(xml.savefile("Info.plist", plist, {pretty = true}))
end

This example demonstrates decoding, querying via DOM traversal, mutating nodes, and writing the file back with pretty formatting.

@waruqi waruqi added this to the v3.0.5 milestone Nov 14, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @waruqi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new xml module, significantly enhancing the project's capability to interact with XML data. The module offers a complete set of tools for parsing XML strings into a navigable Lua table representation and serializing these structures back into XML, with options for formatted output. It is designed to be robust, supporting various XML constructs, and comes with essential utilities for file operations and tree traversal, all validated by a thorough test suite.

Highlights

  • XML Module Introduction: A new xml module has been added, providing core functionalities for XML processing within the project.
  • Parsing and Serialization: The module supports decoding XML strings into a structured Lua table format and encoding these structures back into XML, including pretty-printing options.
  • Node Type Support: It handles various XML node types such as elements, text, comments, CDATA sections, and DOCTYPE declarations.
  • Utility Functions: Includes functions for loading/saving XML files, finding nodes by path, and extracting concatenated text from child nodes.
  • Comprehensive Testing: A dedicated test suite (tests/modules/xml/test.lua) has been added to ensure the correctness and reliability of all XML functionalities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new XML module, providing functionality for decoding and encoding XML strings and files. The module defines a clear node structure and includes helper functions for common XML operations like finding nodes and extracting text. Unit tests have been added to cover basic decoding, encoding, and special node types. The sandbox wrapper correctly integrates the new module and handles errors by raising exceptions. Overall, the implementation is a good foundation for XML processing.

table.insert(children, xml.cdata(value))
i = close + 3
elseif data:sub(lt + 1, lt + 8):upper() == "!DOCTYPE" then
local close = data:find(">", lt + 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The string.find function is used here without the plain argument (true). This means the > character in the pattern is treated as a magic character for Lua patterns, which is not the intended behavior. It should be data:find(">", lt + 8, true) to ensure > is matched literally. This issue appears in multiple places within the decode function.

            local close = data:find(">", lt + 8, true)

end
i = close + 2
elseif data:sub(lt + 1, lt + 1) == "!" then
local close = data:find(">", lt + 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, the string.find function is used here without the plain argument (true). This could lead to incorrect parsing if the > character is interpreted as a pattern magic character. It should be data:find(">", lt + 2, true) to match > literally.

            local close = data:find(">", lt + 2, true)

end
i = close + 1
elseif data:sub(lt + 1, lt + 1) == "/" then
local close = data:find(">", lt + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The string.find function is used here without the plain argument (true). This means the > character in the pattern is treated as a magic character for Lua patterns, which is not the intended behavior. It should be data:find(">", lt + 1, true) to ensure > is matched literally.

            local close = data:find(">", lt + 1, true)

end
i = close + 1
else
local close = data:find(">", lt + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The string.find function is used here without the plain argument (true). This means the > character in the pattern is treated as a magic character for Lua patterns, which is not the intended behavior. It should be data:find(">", lt + 1, true) to ensure > is matched literally.

            local close = data:find(">", lt + 1, true)

Comment on lines +71 to +76
function xml._parse_attrs(attrstr)
local attrs
attrstr:gsub("([%w_:%-%.]+)%s*=%s*([\"'])(.-)%2", function(key, quote, value)
attrs = attrs or {}
attrs[key] = xml._decode_entities(value)
end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regular expression for parsing attributes expects values to be enclosed in either double or single quotes. However, XML attribute values can also be unquoted if they do not contain spaces or special characters. This current implementation might fail to parse valid XML where attributes are unquoted, for example, <element attr=value>. Consider expanding the regex to support unquoted attribute values for broader XML compatibility.

Comment on lines +83 to +84
if opt.trim_text ~= false then
text = text:gsub("^%s+", ""):gsub("%s+$", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The trim_text option defaults to true (because opt.trim_text ~= false evaluates to true if opt.trim_text is nil). While trimming whitespace is often desired, it can lead to data loss if significant whitespace needs to be preserved, such as in xml:space="preserve" contexts. It would be more explicit and safer to make trim_text false by default and require users to opt-in for trimming, or provide a clear option to disable it when necessary.

@waruqi waruqi merged commit d5e9f0e into dev Nov 15, 2025
44 checks passed
@waruqi waruqi deleted the xml branch November 15, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant