Skip to content

Default cp1252 encoding causes Windows test failures — audit file I/O for explicit UTF-8 #641

@aallan

Description

@aallan

Origin: Surfaced when the Windows CI matrix entries (added in PR #639 closing #637) ran for the first time. ~9 tests fail across tests/test_codegen.py, tests/test_codegen_monomorphize.py, tests/test_codegen_closures.py, and tests/test_html.py with two flavours of the same root cause:

  • UnicodeEncodeError: 'charmap' codec can't encode character '→' in position N: character maps to <undefined> — Python on Windows tries to write the right-arrow to a stream whose default encoding is cp1252 (Windows' legacy code page).
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position N: invalid start byte — a file written with cp1252 (where 0x97 is em-dash ) is then read back assuming UTF-8.

Failing tests (sample)

  • test_array_flatten_with_empty_inners — encode
  • test_array_sort_by_options — encode
  • test_array_sort_by_stability — decode 0x97 (em-dash)
  • test_modulecall_provenance_emits_guard_and_traps — encode
  • test_rhs_only_provenance_emits_guard_and_traps — encode
  • test_array_map_type_change_int_to_bool — encode
  • test_non_contiguous_int_capture_tail_shape — decode 0x97
  • test_html.py::test_all_vera_blocks_check (4 sub-blocks) — likely same root cause
  • test_html.py::test_all_vera_blocks_verify (4 sub-blocks) — likely same root cause

The pattern

Python on Windows uses locale.getpreferredencoding() for open() calls without an explicit encoding= argument. In en-US Windows that's cp1252; in other locales it's whatever the system default is. cp1252 can't represent most of Unicode; UTF-8 can.

Vera's source files contain Unicode characters in:

  • Test fixtures with (arrow) in error messages, contract violations, and trap diagnostics
  • HTML examples in docs/index.html
  • Regenerated llms.txt / SKILL.md outputs that contain em-dashes from CHANGELOG/spec prose

Any open() call in the code that touches these files without encoding='utf-8' will fail on Windows. The pre-PR-#639 matrix never tested Windows so the bugs lay dormant.

Recommended fix

Two complementary approaches; the project should pick one:

Option A — explicit encoding='utf-8' everywhere

Audit the codebase for open(...) calls without an explicit encoding kwarg. Add encoding='utf-8' to every text-mode open. This is the universal Python pattern and the recommended fix per PEP 597.

# Bad on Windows:
with open(path) as f:
    return f.read()

# Good everywhere:
with open(path, encoding='utf-8') as f:
    return f.read()

A grep gives the audit surface (rough estimate: ~50-100 sites across vera/, scripts/, tests/).

Option B — set PYTHONUTF8=1 in CI

Add PYTHONUTF8: 1 to the test job's environment, which forces Python to default to UTF-8 regardless of locale (PEP 540). Covers all open() calls without code changes.

- name: Run tests
  env:
    PYTHONUTF8: 1
  run: pytest -v -n auto

Trade-offs:

  • A is more robust. Code that explicitly says encoding='utf-8' works regardless of environment. Users running vera on Windows without PYTHONUTF8=1 still hit the bug under Option B.
  • B is faster to land (one CI config change vs. ~50 code changes). Good as a stopgap; Option A is the proper fix.
  • Recommended: Option B for Add windows-latest to CI test matrix (3 runners mirroring Ubuntu / macOS) #637 unblock, Option A as the durable fix. File a separate sub-issue for the audit if the project agrees.

Out of scope

  • Migrating Vera's emitted .wasm binaries (already always UTF-8 — WASM standard requires it).
  • Browser-runtime encoding (already always UTF-8 — browser standard).
  • Source files in tests/conformance/*.vera — these are read by the parser which already handles UTF-8 explicitly via lark.

Acceptance criteria

  • All ~9 failing tests pass on windows-latest, 3.{11,12,13}.
  • Either Option A is fully applied (preferred) or Option B is applied + Option A queued as follow-up.
  • A regression sentinel: a test that explicitly opens a file containing and without PYTHONUTF8=1 and expects the right behaviour (catches future regressions of Option A's coverage).

Pairs with

Metadata

Metadata

Assignees

No one assigned

    Labels

    windowsOS-specific to Windows

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions