You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
v0.0.118 shipped 16 string utility + character classification built-ins (#470 + #471) with deliberately ASCII-only semantics. This issue tracks the phase-2 follow-up that adds Unicode awareness.
Context
Phase 1 operations work at the byte level:
Classifiers (is_digit, is_alpha, is_alphanumeric, is_whitespace, is_upper, is_lower) inspect the first byte and use ASCII ranges.
Case conversion (char_to_upper, char_to_lower) transforms only the first byte.
string_reverse reverses byte order — not a valid UTF-8 string if the input contains multi-byte sequences.
string_chars returns 1-byte slices — not Unicode codepoints or grapheme clusters.
string_words / string_lines and the trim / pad functions use ASCII whitespace (tab, LF, CR, space) only.
This was the right v0.0.118 scope (inline WAT, zero host imports, bit-identical Python/browser output), but a Unicode-native phase is needed before Vera can claim full i18n support.
Proposed phase-2 operations
Codepoint-level splits (in addition to the existing byte-level string_chars):
string_codepoints(@String) -> @Array<String> — one element per Unicode codepoint, so a 4-byte emoji becomes a single element.
string_to_upper(@String) -> @String / string_to_lower(@String) -> @String — transform every character (currently char_to_upper/char_to_lower only touch the first byte). Unicode-aware case folding where the mapping is locale-independent.
Unicode-aware classifiers:
Either extend the existing is_digit etc. to accept codepoints beyond the ASCII range (risks behaviour change in existing programs) or add a parallel set with a unicode_ or uc_ prefix.
Unicode-safe reverse:
string_reverse_codepoints or a flag on string_reverse that operates at codepoint rather than byte level. The current byte reverse is fine for ASCII and deliberately lossy for UTF-8 strings that contain multi-byte sequences.
Implementation approach
Unlike phase 1, these operations cannot be implemented as inline WAT — Unicode tables are large enough that embedding them in every compiled module is impractical. The implementation would need to thread a host import for each operation and keep the Python (wasmtime) and browser (Node.js / JavaScript) runtimes bit-identical on Unicode output.
Python runtime: use the stdlib unicodedata module.
Browser runtime: use native String.prototype.toLowerCase()/toUpperCase() and the Intl.Segmenter API for graphemes.
Both runtimes already follow the pattern established by the log/trig math ops in v0.0.116.
Deferred — not blocking any current program
No current example or conformance program is blocked by the ASCII-only behaviour; phase 1 shipped deliberately scoped. Raising this separately so it's tracked but not confused with the already-shipped work.
Summary
v0.0.118 shipped 16 string utility + character classification built-ins (#470 + #471) with deliberately ASCII-only semantics. This issue tracks the phase-2 follow-up that adds Unicode awareness.
Context
Phase 1 operations work at the byte level:
is_digit,is_alpha,is_alphanumeric,is_whitespace,is_upper,is_lower) inspect the first byte and use ASCII ranges.char_to_upper,char_to_lower) transforms only the first byte.string_reversereverses byte order — not a valid UTF-8 string if the input contains multi-byte sequences.string_charsreturns 1-byte slices — not Unicode codepoints or grapheme clusters.string_words/string_linesand the trim / pad functions use ASCII whitespace (tab, LF, CR, space) only.This was the right v0.0.118 scope (inline WAT, zero host imports, bit-identical Python/browser output), but a Unicode-native phase is needed before Vera can claim full i18n support.
Proposed phase-2 operations
Codepoint-level splits (in addition to the existing byte-level
string_chars):string_codepoints(@String) -> @Array<String>— one element per Unicode codepoint, so a 4-byte emoji becomes a single element.string_graphemes(@String) -> @Array<String>— one element per user-perceived grapheme cluster (follows UAX Generic function codegen (monomorphization) #29).Whole-string case conversion:
string_to_upper(@String) -> @String/string_to_lower(@String) -> @String— transform every character (currentlychar_to_upper/char_to_loweronly touch the first byte). Unicode-aware case folding where the mapping is locale-independent.Unicode-aware classifiers:
Either extend the existing
is_digitetc. to accept codepoints beyond the ASCII range (risks behaviour change in existing programs) or add a parallel set with aunicode_oruc_prefix.Unicode-safe reverse:
string_reverse_codepointsor a flag onstring_reversethat operates at codepoint rather than byte level. The current byte reverse is fine for ASCII and deliberately lossy for UTF-8 strings that contain multi-byte sequences.Implementation approach
Unlike phase 1, these operations cannot be implemented as inline WAT — Unicode tables are large enough that embedding them in every compiled module is impractical. The implementation would need to thread a host import for each operation and keep the Python (wasmtime) and browser (Node.js / JavaScript) runtimes bit-identical on Unicode output.
Python runtime: use the stdlib
unicodedatamodule.Browser runtime: use native
String.prototype.toLowerCase()/toUpperCase()and theIntl.SegmenterAPI for graphemes.Both runtimes already follow the pattern established by the log/trig math ops in v0.0.116.
Deferred — not blocking any current program
No current example or conformance program is blocked by the ASCII-only behaviour; phase 1 shipped deliberately scoped. Raising this separately so it's tracked but not confused with the already-shipped work.