Origin: PR for #641 (cp1252-on-Windows fix) chose Option B (set PYTHONUTF8=1 in CI) as the primary fix because Option A (audit every open() / read_text() / write_text() call site for explicit encoding='utf-8') was too large for the same PR's scope. This issue tracks Option A as the durable follow-up.
The pattern
Python on Windows uses locale.getpreferredencoding() for text-mode open() / read_text() / write_text() calls without an explicit encoding= kwarg. In en-US Windows that's cp1252; in other locales it's whatever the system default is. cp1252 can't represent most of Unicode; UTF-8 can.
Vera's source / docs / fixtures contain Unicode characters in:
- Test fixtures with
→ (right arrow) and — (em-dash) in error messages, contract violations, trap diagnostics
- HTML examples in
docs/index.html
- Regenerated
llms.txt / SKILL.md outputs that contain prose from CHANGELOG / spec
- Any
.vera file that contains non-ASCII (string literals, comments)
PR for #641 covered CI via PYTHONUTF8=1 and added explicit encoding='utf-8' to the load-bearing vera/parser.py grammar load. Per-call-site coverage for the rest of the codebase is the durable fix — locally users on Windows without PYTHONUTF8=1 still hit the bug on individual files.
Audit surface (rough, from the PR-for-#641 grep)
~30+ sites across vera/, scripts/, tests/:
vera/parser.py — already done (PR for #641)
vera/browser/emit.py — html_path.write_text(...)
scripts/check_conformance.py — MANIFEST_PATH.read_text()
scripts/check_version_sync.py — pyproject.read_text(), init.read_text(), index_html.read_text(), readme.read_text(), lock.read_text()
scripts/check_doc_counts.py — manifest.json, .pre-commit-config.yaml, ci.yml, TESTING.md, CONTRIBUTING.md, CLAUDE.md, README.md, SKILL.md, AGENTS.md, FAQ.md, docs/index.html, ROADMAP.md
scripts/check_limitations_sync.py — KNOWN_ISSUES.md, vera/README.md
tests/test_codegen_monomorphize.py — path.read_text() (×2)
tests/test_codegen_closures.py — path.read_text() (×2)
tests/test_tester_coverage.py — p.write_text(source)
Recommended approach
A single audit-and-replace pass with a small ruff or grep-based check to enforce going forward:
- Mechanical: grep + add
encoding='utf-8' to every text-mode site.
- Add a pre-commit check (or a CI step) that fails on
open(...), read_text(), write_text() without an explicit encoding. Use a small scripts/check_explicit_encoding.py that grep-greps the codebase.
Out of scope
- Binary-mode opens (
'rb', 'wb') — encoding is irrelevant.
urlopen / subprocess / network — no text encoding.
- Third-party calls (e.g.
lark.Lark(...)) that take an already-decoded string — encoding handled at the read site.
Acceptance criteria
- Every text-mode
open() / read_text() / write_text() in vera/, scripts/, tests/ has explicit encoding='utf-8'.
- A pre-commit check enforces the convention.
- The
PYTHONUTF8: 1 line in .github/workflows/ci.yml's test job env can be removed (verified by a clean CI run without it).
Pairs with
Origin: PR for #641 (cp1252-on-Windows fix) chose Option B (set
PYTHONUTF8=1in CI) as the primary fix because Option A (audit everyopen()/read_text()/write_text()call site for explicitencoding='utf-8') was too large for the same PR's scope. This issue tracks Option A as the durable follow-up.The pattern
Python on Windows uses
locale.getpreferredencoding()for text-modeopen()/read_text()/write_text()calls without an explicitencoding=kwarg. In en-US Windows that's cp1252; in other locales it's whatever the system default is. cp1252 can't represent most of Unicode; UTF-8 can.Vera's source / docs / fixtures contain Unicode characters in:
→(right arrow) and—(em-dash) in error messages, contract violations, trap diagnosticsdocs/index.htmlllms.txt/SKILL.mdoutputs that contain prose from CHANGELOG / spec.verafile that contains non-ASCII (string literals, comments)PR for #641 covered CI via
PYTHONUTF8=1and added explicitencoding='utf-8'to the load-bearingvera/parser.pygrammar load. Per-call-site coverage for the rest of the codebase is the durable fix — locally users on Windows withoutPYTHONUTF8=1still hit the bug on individual files.Audit surface (rough, from the PR-for-#641 grep)
~30+ sites across
vera/,scripts/,tests/:Recommended approach
A single audit-and-replace pass with a small ruff or grep-based check to enforce going forward:
encoding='utf-8'to every text-mode site.open(...),read_text(),write_text()without an explicit encoding. Use a smallscripts/check_explicit_encoding.pythat grep-greps the codebase.Out of scope
'rb','wb') — encoding is irrelevant.urlopen/subprocess/ network — no text encoding.lark.Lark(...)) that take an already-decoded string — encoding handled at the read site.Acceptance criteria
open()/read_text()/write_text()invera/,scripts/,tests/has explicitencoding='utf-8'.PYTHONUTF8: 1line in.github/workflows/ci.yml's test job env can be removed (verified by a clean CI run without it).Pairs with