Audit text-mode open()/read_text()/write_text() for explicit encoding='utf-8' (#641 follow-up)

**Origin:** PR for #641 (cp1252-on-Windows fix) chose Option B (set `PYTHONUTF8=1` in CI) as the primary fix because Option A (audit every `open()` / `read_text()` / `write_text()` call site for explicit `encoding='utf-8'`) was too large for the same PR's scope.  This issue tracks Option A as the durable follow-up.

## The pattern

Python on Windows uses `locale.getpreferredencoding()` for text-mode `open()` / `read_text()` / `write_text()` calls without an explicit `encoding=` kwarg.  In en-US Windows that's cp1252; in other locales it's whatever the system default is.  cp1252 can't represent most of Unicode; UTF-8 can.

Vera's source / docs / fixtures contain Unicode characters in:

- Test fixtures with `→` (right arrow) and `—` (em-dash) in error messages, contract violations, trap diagnostics
- HTML examples in `docs/index.html`
- Regenerated `llms.txt` / `SKILL.md` outputs that contain prose from CHANGELOG / spec
- Any `.vera` file that contains non-ASCII (string literals, comments)

PR for #641 covered CI via `PYTHONUTF8=1` and added explicit `encoding='utf-8'` to the load-bearing `vera/parser.py` grammar load.  Per-call-site coverage for the rest of the codebase is the durable fix — locally users on Windows without `PYTHONUTF8=1` still hit the bug on individual files.

## Audit surface (rough, from the PR-for-#641 grep)

~30+ sites across `vera/`, `scripts/`, `tests/`:

```
vera/parser.py — already done (PR for #641)
vera/browser/emit.py — html_path.write_text(...)
scripts/check_conformance.py — MANIFEST_PATH.read_text()
scripts/check_version_sync.py — pyproject.read_text(), init.read_text(), index_html.read_text(), readme.read_text(), lock.read_text()
scripts/check_doc_counts.py — manifest.json, .pre-commit-config.yaml, ci.yml, TESTING.md, CONTRIBUTING.md, CLAUDE.md, README.md, SKILL.md, AGENTS.md, FAQ.md, docs/index.html, ROADMAP.md
scripts/check_limitations_sync.py — KNOWN_ISSUES.md, vera/README.md
tests/test_codegen_monomorphize.py — path.read_text() (×2)
tests/test_codegen_closures.py — path.read_text() (×2)
tests/test_tester_coverage.py — p.write_text(source)
```

## Recommended approach

A single audit-and-replace pass with a small ruff or grep-based check to enforce going forward:

1. Mechanical: grep + add `encoding='utf-8'` to every text-mode site.
2. Add a pre-commit check (or a CI step) that fails on `open(...)`, `read_text()`, `write_text()` without an explicit encoding.  Use a small `scripts/check_explicit_encoding.py` that grep-greps the codebase.

## Out of scope

- Binary-mode opens (`'rb'`, `'wb'`) — encoding is irrelevant.
- `urlopen` / `subprocess` / network — no text encoding.
- Third-party calls (e.g. `lark.Lark(...)`) that take an already-decoded string — encoding handled at the read site.

## Acceptance criteria

- Every text-mode `open()` / `read_text()` / `write_text()` in `vera/`, `scripts/`, `tests/` has explicit `encoding='utf-8'`.
- A pre-commit check enforces the convention.
- The `PYTHONUTF8: 1` line in `.github/workflows/ci.yml`'s test job env can be removed (verified by a clean CI run without it).

## Pairs with

- **#641** — landed in v0.0.143 with the Option B (CI-side) fix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit text-mode open()/read_text()/write_text() for explicit encoding='utf-8' (#641 follow-up) #645

The pattern

Audit surface (rough, from the PR-for-#641 grep)

Recommended approach

Out of scope

Acceptance criteria

Pairs with

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audit text-mode open()/read_text()/write_text() for explicit encoding='utf-8' (#641 follow-up) #645

Description

The pattern

Audit surface (rough, from the PR-for-#641 grep)

Recommended approach

Out of scope

Acceptance criteria

Pairs with

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions