Skip to content

fix: use utf-8 encoding and LF newlines for all text file IO#204

Closed
NYCU-Chung wants to merge 1 commit into
safishamsi:v4from
NYCU-Chung:fix/windows-encoding
Closed

fix: use utf-8 encoding and LF newlines for all text file IO#204
NYCU-Chung wants to merge 1 commit into
safishamsi:v4from
NYCU-Chung:fix/windows-encoding

Conversation

@NYCU-Chung

Copy link
Copy Markdown

Summary

Python's open() / Path.write_text() / Path.read_text() default to the
system locale encoding on Windows (CP1252 / CP932 / CP950), causing
UnicodeEncodeError when graph labels contain CJK characters or emojis.
This breaks file writes after LLM extraction is complete, discarding the
paid Claude API work from pass 3.

Also forces LF line endings for git hook installation, extending the fix
direction of commit 210243f ("fix hook reinstall, CRLF labels") to the
remaining hook write sites.

Concrete failure modes

  1. GRAPH_REPORT.md / wiki .md / Cypher .txt writes crash on CJK Windows
    Unlike graph.json (which benefits from json.dumps's default
    ensure_ascii=True escaping), these files write raw CJK bytes. Under
    CP932 (JP) / CP950 (zh-TW) / CP949 (KR) locales, the final write step
    raises UnicodeEncodeError and destroys the extraction run.

  2. Git hook CRLF breaks sh interpreter
    Path.write_text() without newline="\n" converts \n\r\n on
    Windows, producing a #!/bin/sh\r shebang. Git Bash / WSL then reports
    bad interpreter: No such file or directory when the hook fires. This
    is the same class of bug as 210243f.

  3. read_text() on UTF-8 files written by other platforms
    Reading graph.json / cache files / manifests created on Linux or
    macOS can fail on a Windows system with a non-UTF-8 default locale,
    even if the file itself is valid UTF-8.

Changes

All text file IO in graphify/ now explicitly passes encoding="utf-8":

File Locations
graphify/export.py to_json, to_cypher, to_obsidian
graphify/cache.py save_cached, load_cached
graphify/detect.py paper detection, word count, .graphifyignore load, manifest read/write
graphify/watch.py GRAPH_REPORT.md, flag file
graphify/wiki.py 3 wiki article write sites
graphify/hooks.py install / uninstall / status — also adds newline="\n" on hook writes to force LF
graphify/serve.py graph.json read on startup
graphify/benchmark.py graph.json read

No logic changes, no function signature changes, no behavior changes.
Binary-mode IO ("rb" / "wb") is untouched. newline="\n" is only
applied to git hook script writes — markdown / JSON outputs keep default
newline handling to match existing file conventions.

Tests

New file: tests/test_encoding_roundtrip.py (5 tests):

  • test_write_read_text_cjk_roundtrip_primitive — baseline IO pattern
    with CJK + emoji + ensure_ascii=False JSON payloads
  • test_cache_save_load_uses_utf8 — exercises
    graphify.cache.save_cached / load_cached with a CJK label
  • test_to_json_open_pattern_cjk — reproduces the fixed to_json
    open() pattern with ensure_ascii=False to force UTF-8 bytes into
    the output
  • test_hook_install_lf_only — asserts
    graphify.hooks._install_hook produces no \r\n on fresh install
  • test_hook_append_lf_only — same assertion when appending to an
    existing hook

The hook CRLF tests were manually verified to fail on the un-patched
v4 HEAD
on Windows (raw bytes contain \r\n) and pass after the
fix
— so they function as real regression guards rather than
tautologies.

All 5 tests pass on Windows Python 3.10.

Out of scope (intentionally)

  • os.chmod(0o755) in hook install is still a no-op on Windows — that's
    a separate cross-platform concern and not addressed here.
  • json.dumps(..., ensure_ascii=False) for human-readable graph.json
    output is a product decision, not touched.
  • Performance / parallelism of the extract loop is out of scope.

Related

  • 210243f — "fix hook reinstall, CRLF labels" (prior fix in the same
    bug class; this PR sweeps the remaining sites)

Python's open() / Path.read_text() / Path.write_text() default to the
system locale encoding on Windows (CP1252/CP932/CP950), causing
UnicodeEncodeError when graph labels contain CJK characters or emojis.
This breaks the final write of graph.json, discarding all prior LLM
extraction work.

Also adds newline="\n" for git hook writes to prevent CRLF line endings
from breaking sh interpreter ("#!/bin/sh\r" -> bad interpreter on Unix/WSL).

Files touched:
- graphify/export.py (to_json, to_cypher, to_obsidian)
- graphify/cache.py (load_cached, save_cached)
- graphify/detect.py (manifest, graphifyignore, count_words, looks_like_paper)
- graphify/watch.py (GRAPH_REPORT.md, needs_update flag)
- graphify/wiki.py (community articles, god node articles, index)
- graphify/hooks.py (install/uninstall/status, LF-only for hook scripts)
- graphify/serve.py (graph.json read)
- graphify/benchmark.py (graph.json read)

Test: tests/test_encoding_roundtrip.py verifies CJK/emoji labels
round-trip through JSON export and cache IO, and confirms hook files
use LF-only line endings.

@pedrolbacelar pedrolbacelar left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this fixes the exact same issue I hit on Windows 11 with Python 3.11 — UnicodeEncodeError: 'charmap' codec can't encode character when GRAPH_REPORT.md contains Unicode arrows (→). Had to work around it with PYTHONUTF8=1 on every invocation. This PR is the proper fix. LGTM.

safishamsi added a commit that referenced this pull request Apr 11, 2026
- build/validate: accept NetworkX <=3.1 "links" key alongside "edges" (#212)
- __main__: skip version check during install/uninstall, deduplicate paths (#220)
- all file IO: explicit encoding="utf-8" to prevent crashes on Windows CJK locales (#204)
- hooks: add newline="\n" on write to prevent CRLF shebang breakage on Windows (#204)
- export: strip trailing .md from safe_name so "CLAUDE.md" doesn't become "CLAUDE.md.md" (#221)
- report: add Community Hubs navigation block so Obsidian vault stays connected (#221)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@safishamsi

Copy link
Copy Markdown
Owner

Cherry-picked into v0.4.2 (encoding + hook CRLF fix) — thank you!

@safishamsi safishamsi closed this Apr 11, 2026
joyshmitz pushed a commit to joyshmitz/graphify that referenced this pull request Apr 13, 2026
…hamsi#221 into 0.4.2

- build/validate: accept NetworkX <=3.1 "links" key alongside "edges" (safishamsi#212)
- __main__: skip version check during install/uninstall, deduplicate paths (safishamsi#220)
- all file IO: explicit encoding="utf-8" to prevent crashes on Windows CJK locales (safishamsi#204)
- hooks: add newline="\n" on write to prevent CRLF shebang breakage on Windows (safishamsi#204)
- export: strip trailing .md from safe_name so "CLAUDE.md" doesn't become "CLAUDE.md.md" (safishamsi#221)
- report: add Community Hubs navigation block so Obsidian vault stays connected (safishamsi#221)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
safishamsi added a commit that referenced this pull request Apr 23, 2026
- build/validate: accept NetworkX <=3.1 "links" key alongside "edges" (#212)
- __main__: skip version check during install/uninstall, deduplicate paths (#220)
- all file IO: explicit encoding="utf-8" to prevent crashes on Windows CJK locales (#204)
- hooks: add newline="\n" on write to prevent CRLF shebang breakage on Windows (#204)
- export: strip trailing .md from safe_name so "CLAUDE.md" doesn't become "CLAUDE.md.md" (#221)
- report: add Community Hubs navigation block so Obsidian vault stays connected (#221)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
matzls pushed a commit to matzls/graphify that referenced this pull request May 10, 2026
…hamsi#221 into 0.4.2

- build/validate: accept NetworkX <=3.1 "links" key alongside "edges" (safishamsi#212)
- __main__: skip version check during install/uninstall, deduplicate paths (safishamsi#220)
- all file IO: explicit encoding="utf-8" to prevent crashes on Windows CJK locales (safishamsi#204)
- hooks: add newline="\n" on write to prevent CRLF shebang breakage on Windows (safishamsi#204)
- export: strip trailing .md from safe_name so "CLAUDE.md" doesn't become "CLAUDE.md.md" (safishamsi#221)
- report: add Community Hubs navigation block so Obsidian vault stays connected (safishamsi#221)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants