Skip to content

String + character built-ins: Unicode-aware phase 2 #509

@aallan

Description

@aallan

Summary

v0.0.118 shipped 16 string utility + character classification built-ins (#470 + #471) with deliberately ASCII-only semantics. This issue tracks the phase-2 follow-up that adds Unicode awareness.

Context

Phase 1 operations work at the byte level:

  • Classifiers (is_digit, is_alpha, is_alphanumeric, is_whitespace, is_upper, is_lower) inspect the first byte and use ASCII ranges.
  • Case conversion (char_to_upper, char_to_lower) transforms only the first byte.
  • string_reverse reverses byte order — not a valid UTF-8 string if the input contains multi-byte sequences.
  • string_chars returns 1-byte slices — not Unicode codepoints or grapheme clusters.
  • string_words / string_lines and the trim / pad functions use ASCII whitespace (tab, LF, CR, space) only.

This was the right v0.0.118 scope (inline WAT, zero host imports, bit-identical Python/browser output), but a Unicode-native phase is needed before Vera can claim full i18n support.

Proposed phase-2 operations

Codepoint-level splits (in addition to the existing byte-level string_chars):

  • string_codepoints(@String) -> @Array<String> — one element per Unicode codepoint, so a 4-byte emoji becomes a single element.
  • string_graphemes(@String) -> @Array<String> — one element per user-perceived grapheme cluster (follows UAX Generic function codegen (monomorphization) #29).

Whole-string case conversion:

  • string_to_upper(@String) -> @String / string_to_lower(@String) -> @String — transform every character (currently char_to_upper/char_to_lower only touch the first byte). Unicode-aware case folding where the mapping is locale-independent.

Unicode-aware classifiers:

Either extend the existing is_digit etc. to accept codepoints beyond the ASCII range (risks behaviour change in existing programs) or add a parallel set with a unicode_ or uc_ prefix.

Unicode-safe reverse:

string_reverse_codepoints or a flag on string_reverse that operates at codepoint rather than byte level. The current byte reverse is fine for ASCII and deliberately lossy for UTF-8 strings that contain multi-byte sequences.

Implementation approach

Unlike phase 1, these operations cannot be implemented as inline WAT — Unicode tables are large enough that embedding them in every compiled module is impractical. The implementation would need to thread a host import for each operation and keep the Python (wasmtime) and browser (Node.js / JavaScript) runtimes bit-identical on Unicode output.

Python runtime: use the stdlib unicodedata module.
Browser runtime: use native String.prototype.toLowerCase()/toUpperCase() and the Intl.Segmenter API for graphemes.

Both runtimes already follow the pattern established by the log/trig math ops in v0.0.116.

Deferred — not blocking any current program

No current example or conformance program is blocked by the ASCII-only behaviour; phase 1 shipped deliberately scoped. Raising this separately so it's tracked but not confused with the already-shipped work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions