String + character built-ins: Unicode-aware phase 2

## Summary

v0.0.118 shipped 16 string utility + character classification built-ins (#470 + #471) with deliberately ASCII-only semantics. This issue tracks the phase-2 follow-up that adds Unicode awareness.

## Context

Phase 1 operations work at the byte level:

- Classifiers (`is_digit`, `is_alpha`, `is_alphanumeric`, `is_whitespace`, `is_upper`, `is_lower`) inspect the **first byte** and use ASCII ranges.
- Case conversion (`char_to_upper`, `char_to_lower`) transforms only the **first byte**.
- `string_reverse` reverses **byte order** — not a valid UTF-8 string if the input contains multi-byte sequences.
- `string_chars` returns **1-byte slices** — not Unicode codepoints or grapheme clusters.
- `string_words` / `string_lines` and the trim / pad functions use **ASCII whitespace** (tab, LF, CR, space) only.

This was the right v0.0.118 scope (inline WAT, zero host imports, bit-identical Python/browser output), but a Unicode-native phase is needed before Vera can claim full i18n support.

## Proposed phase-2 operations

**Codepoint-level splits** (in addition to the existing byte-level `string_chars`):

- `string_codepoints(@String) -> @Array<String>` — one element per Unicode codepoint, so a 4-byte emoji becomes a single element.
- `string_graphemes(@String) -> @Array<String>` — one element per user-perceived grapheme cluster (follows UAX #29).

**Whole-string case conversion**:

- `string_to_upper(@String) -> @String` / `string_to_lower(@String) -> @String` — transform **every** character (currently `char_to_upper`/`char_to_lower` only touch the first byte). Unicode-aware case folding where the mapping is locale-independent.

**Unicode-aware classifiers**:

Either extend the existing `is_digit` etc. to accept codepoints beyond the ASCII range (risks behaviour change in existing programs) or add a parallel set with a `unicode_` or `uc_` prefix.

**Unicode-safe reverse**:

`string_reverse_codepoints` or a flag on `string_reverse` that operates at codepoint rather than byte level. The current byte reverse is fine for ASCII and deliberately lossy for UTF-8 strings that contain multi-byte sequences.

## Implementation approach

Unlike phase 1, these operations cannot be implemented as inline WAT — Unicode tables are large enough that embedding them in every compiled module is impractical. The implementation would need to thread a host import for each operation and keep the Python (wasmtime) and browser (Node.js / JavaScript) runtimes bit-identical on Unicode output.

Python runtime: use the stdlib `unicodedata` module.
Browser runtime: use native `String.prototype.toLowerCase()`/`toUpperCase()` and the `Intl.Segmenter` API for graphemes.

Both runtimes already follow the pattern established by the log/trig math ops in v0.0.116.

## Deferred — not blocking any current program

No current example or conformance program is blocked by the ASCII-only behaviour; phase 1 shipped deliberately scoped. Raising this separately so it's tracked but not confused with the already-shipped work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String + character built-ins: Unicode-aware phase 2 #509

Summary

Context

Proposed phase-2 operations

Implementation approach

Deferred — not blocking any current program

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

String + character built-ins: Unicode-aware phase 2 #509

Description

Summary

Context

Proposed phase-2 operations

Implementation approach

Deferred — not blocking any current program

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions