Skip to content

Audit text-mode open()/read_text()/write_text() for explicit encoding='utf-8' (#641 follow-up) #645

@aallan

Description

@aallan

Origin: PR for #641 (cp1252-on-Windows fix) chose Option B (set PYTHONUTF8=1 in CI) as the primary fix because Option A (audit every open() / read_text() / write_text() call site for explicit encoding='utf-8') was too large for the same PR's scope. This issue tracks Option A as the durable follow-up.

The pattern

Python on Windows uses locale.getpreferredencoding() for text-mode open() / read_text() / write_text() calls without an explicit encoding= kwarg. In en-US Windows that's cp1252; in other locales it's whatever the system default is. cp1252 can't represent most of Unicode; UTF-8 can.

Vera's source / docs / fixtures contain Unicode characters in:

  • Test fixtures with (right arrow) and (em-dash) in error messages, contract violations, trap diagnostics
  • HTML examples in docs/index.html
  • Regenerated llms.txt / SKILL.md outputs that contain prose from CHANGELOG / spec
  • Any .vera file that contains non-ASCII (string literals, comments)

PR for #641 covered CI via PYTHONUTF8=1 and added explicit encoding='utf-8' to the load-bearing vera/parser.py grammar load. Per-call-site coverage for the rest of the codebase is the durable fix — locally users on Windows without PYTHONUTF8=1 still hit the bug on individual files.

Audit surface (rough, from the PR-for-#641 grep)

~30+ sites across vera/, scripts/, tests/:

vera/parser.py — already done (PR for #641)
vera/browser/emit.py — html_path.write_text(...)
scripts/check_conformance.py — MANIFEST_PATH.read_text()
scripts/check_version_sync.py — pyproject.read_text(), init.read_text(), index_html.read_text(), readme.read_text(), lock.read_text()
scripts/check_doc_counts.py — manifest.json, .pre-commit-config.yaml, ci.yml, TESTING.md, CONTRIBUTING.md, CLAUDE.md, README.md, SKILL.md, AGENTS.md, FAQ.md, docs/index.html, ROADMAP.md
scripts/check_limitations_sync.py — KNOWN_ISSUES.md, vera/README.md
tests/test_codegen_monomorphize.py — path.read_text() (×2)
tests/test_codegen_closures.py — path.read_text() (×2)
tests/test_tester_coverage.py — p.write_text(source)

Recommended approach

A single audit-and-replace pass with a small ruff or grep-based check to enforce going forward:

  1. Mechanical: grep + add encoding='utf-8' to every text-mode site.
  2. Add a pre-commit check (or a CI step) that fails on open(...), read_text(), write_text() without an explicit encoding. Use a small scripts/check_explicit_encoding.py that grep-greps the codebase.

Out of scope

  • Binary-mode opens ('rb', 'wb') — encoding is irrelevant.
  • urlopen / subprocess / network — no text encoding.
  • Third-party calls (e.g. lark.Lark(...)) that take an already-decoded string — encoding handled at the read site.

Acceptance criteria

  • Every text-mode open() / read_text() / write_text() in vera/, scripts/, tests/ has explicit encoding='utf-8'.
  • A pre-commit check enforces the convention.
  • The PYTHONUTF8: 1 line in .github/workflows/ci.yml's test job env can be removed (verified by a clean CI run without it).

Pairs with

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions