fix: use utf-8 encoding and LF newlines for all text file IO#204
Closed
NYCU-Chung wants to merge 1 commit into
Closed
fix: use utf-8 encoding and LF newlines for all text file IO#204NYCU-Chung wants to merge 1 commit into
NYCU-Chung wants to merge 1 commit into
Conversation
Python's open() / Path.read_text() / Path.write_text() default to the
system locale encoding on Windows (CP1252/CP932/CP950), causing
UnicodeEncodeError when graph labels contain CJK characters or emojis.
This breaks the final write of graph.json, discarding all prior LLM
extraction work.
Also adds newline="\n" for git hook writes to prevent CRLF line endings
from breaking sh interpreter ("#!/bin/sh\r" -> bad interpreter on Unix/WSL).
Files touched:
- graphify/export.py (to_json, to_cypher, to_obsidian)
- graphify/cache.py (load_cached, save_cached)
- graphify/detect.py (manifest, graphifyignore, count_words, looks_like_paper)
- graphify/watch.py (GRAPH_REPORT.md, needs_update flag)
- graphify/wiki.py (community articles, god node articles, index)
- graphify/hooks.py (install/uninstall/status, LF-only for hook scripts)
- graphify/serve.py (graph.json read)
- graphify/benchmark.py (graph.json read)
Test: tests/test_encoding_roundtrip.py verifies CJK/emoji labels
round-trip through JSON export and cache IO, and confirms hook files
use LF-only line endings.
pedrolbacelar
approved these changes
Apr 11, 2026
pedrolbacelar
left a comment
There was a problem hiding this comment.
Confirmed this fixes the exact same issue I hit on Windows 11 with Python 3.11 — UnicodeEncodeError: 'charmap' codec can't encode character when GRAPH_REPORT.md contains Unicode arrows (→). Had to work around it with PYTHONUTF8=1 on every invocation. This PR is the proper fix. LGTM.
safishamsi
added a commit
that referenced
this pull request
Apr 11, 2026
- build/validate: accept NetworkX <=3.1 "links" key alongside "edges" (#212) - __main__: skip version check during install/uninstall, deduplicate paths (#220) - all file IO: explicit encoding="utf-8" to prevent crashes on Windows CJK locales (#204) - hooks: add newline="\n" on write to prevent CRLF shebang breakage on Windows (#204) - export: strip trailing .md from safe_name so "CLAUDE.md" doesn't become "CLAUDE.md.md" (#221) - report: add Community Hubs navigation block so Obsidian vault stays connected (#221) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Owner
|
Cherry-picked into v0.4.2 (encoding + hook CRLF fix) — thank you! |
joyshmitz
pushed a commit
to joyshmitz/graphify
that referenced
this pull request
Apr 13, 2026
…hamsi#221 into 0.4.2 - build/validate: accept NetworkX <=3.1 "links" key alongside "edges" (safishamsi#212) - __main__: skip version check during install/uninstall, deduplicate paths (safishamsi#220) - all file IO: explicit encoding="utf-8" to prevent crashes on Windows CJK locales (safishamsi#204) - hooks: add newline="\n" on write to prevent CRLF shebang breakage on Windows (safishamsi#204) - export: strip trailing .md from safe_name so "CLAUDE.md" doesn't become "CLAUDE.md.md" (safishamsi#221) - report: add Community Hubs navigation block so Obsidian vault stays connected (safishamsi#221) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
safishamsi
added a commit
that referenced
this pull request
Apr 23, 2026
- build/validate: accept NetworkX <=3.1 "links" key alongside "edges" (#212) - __main__: skip version check during install/uninstall, deduplicate paths (#220) - all file IO: explicit encoding="utf-8" to prevent crashes on Windows CJK locales (#204) - hooks: add newline="\n" on write to prevent CRLF shebang breakage on Windows (#204) - export: strip trailing .md from safe_name so "CLAUDE.md" doesn't become "CLAUDE.md.md" (#221) - report: add Community Hubs navigation block so Obsidian vault stays connected (#221) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
matzls
pushed a commit
to matzls/graphify
that referenced
this pull request
May 10, 2026
…hamsi#221 into 0.4.2 - build/validate: accept NetworkX <=3.1 "links" key alongside "edges" (safishamsi#212) - __main__: skip version check during install/uninstall, deduplicate paths (safishamsi#220) - all file IO: explicit encoding="utf-8" to prevent crashes on Windows CJK locales (safishamsi#204) - hooks: add newline="\n" on write to prevent CRLF shebang breakage on Windows (safishamsi#204) - export: strip trailing .md from safe_name so "CLAUDE.md" doesn't become "CLAUDE.md.md" (safishamsi#221) - report: add Community Hubs navigation block so Obsidian vault stays connected (safishamsi#221) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced May 16, 2026
Closed
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Python's
open()/Path.write_text()/Path.read_text()default to thesystem locale encoding on Windows (CP1252 / CP932 / CP950), causing
UnicodeEncodeErrorwhen graph labels contain CJK characters or emojis.This breaks file writes after LLM extraction is complete, discarding the
paid Claude API work from pass 3.
Also forces LF line endings for git hook installation, extending the fix
direction of commit
210243f("fix hook reinstall, CRLF labels") to theremaining hook write sites.
Concrete failure modes
GRAPH_REPORT.md / wiki
.md/ Cypher.txtwrites crash on CJK WindowsUnlike
graph.json(which benefits fromjson.dumps's defaultensure_ascii=Trueescaping), these files write raw CJK bytes. UnderCP932 (JP) / CP950 (zh-TW) / CP949 (KR) locales, the final write step
raises
UnicodeEncodeErrorand destroys the extraction run.Git hook CRLF breaks
shinterpreterPath.write_text()withoutnewline="\n"converts\n→\r\nonWindows, producing a
#!/bin/sh\rshebang. Git Bash / WSL then reportsbad interpreter: No such file or directorywhen the hook fires. Thisis the same class of bug as
210243f.read_text()on UTF-8 files written by other platformsReading
graph.json/ cache files / manifests created on Linux ormacOS can fail on a Windows system with a non-UTF-8 default locale,
even if the file itself is valid UTF-8.
Changes
All text file IO in
graphify/now explicitly passesencoding="utf-8":graphify/export.pyto_json,to_cypher,to_obsidiangraphify/cache.pysave_cached,load_cachedgraphify/detect.py.graphifyignoreload, manifest read/writegraphify/watch.pyGRAPH_REPORT.md, flag filegraphify/wiki.pygraphify/hooks.pynewline="\n"on hook writes to force LFgraphify/serve.pygraph.jsonread on startupgraphify/benchmark.pygraph.jsonreadNo logic changes, no function signature changes, no behavior changes.
Binary-mode IO (
"rb"/"wb") is untouched.newline="\n"is onlyapplied to git hook script writes — markdown / JSON outputs keep default
newline handling to match existing file conventions.
Tests
New file:
tests/test_encoding_roundtrip.py(5 tests):test_write_read_text_cjk_roundtrip_primitive— baseline IO patternwith CJK + emoji +
ensure_ascii=FalseJSON payloadstest_cache_save_load_uses_utf8— exercisesgraphify.cache.save_cached/load_cachedwith a CJK labeltest_to_json_open_pattern_cjk— reproduces the fixedto_jsonopen()pattern withensure_ascii=Falseto force UTF-8 bytes intothe output
test_hook_install_lf_only— assertsgraphify.hooks._install_hookproduces no\r\non fresh installtest_hook_append_lf_only— same assertion when appending to anexisting hook
The hook CRLF tests were manually verified to fail on the un-patched
v4HEAD on Windows (raw bytes contain\r\n) and pass after thefix — so they function as real regression guards rather than
tautologies.
All 5 tests pass on Windows Python 3.10.
Out of scope (intentionally)
os.chmod(0o755)in hook install is still a no-op on Windows — that'sa separate cross-platform concern and not addressed here.
json.dumps(..., ensure_ascii=False)for human-readablegraph.jsonoutput is a product decision, not touched.
Related
210243f— "fix hook reinstall, CRLF labels" (prior fix in the samebug class; this PR sweeps the remaining sites)