fix(cli, fact-checker): reconfigure stdio to UTF-8 on Windows#1282
Merged
Conversation
3 tasks
This was referenced May 2, 2026
The `python -m mempalace.fact_checker --stdin` entry point reads non-ASCII text through the system ANSI codepage (cp1252/cp1251/cp950) on Windows, which mojibakes characters before claim-extraction sees them. Reconfigure stdin/stdout/stderr to UTF-8 with `errors="strict"`, wrapped in try/except so a replaced stream (Jupyter, test harness) logs a warning rather than crashing the CLI. Mirrors the same fix shipped for `mcp_server.py:main()` (MemPalace#400) and `hooks_cli.py:run_hook()` (MemPalace#1280) -- this is the third and last stdin-reading entry point in the package.
The primary `mempalace` console_script (`cli.py:main()`) reads non-ASCII arguments via piped stdin and writes verbatim drawer text / wing names through `print()`. On Windows, Python defaults stdio to the system ANSI codepage (cp1252/cp1251/cp950), so: - `mempalace search "..." > out.txt` mojibakes any drawer text containing non-Latin characters - `mempalace ... < input.txt` mojibakes piped non-ASCII input Reconfigure stdin/stdout/stderr to UTF-8 (`errors="strict"`) at the top of `main()`, mirroring the helper added in this PR for fact_checker's `__main__` block. Wrapped in try/except so a replaced stream (Jupyter, test harness) logs a warning and continues rather than crashing the CLI. The reconfigure cascades through every `mempalace` subcommand (`init`/`mine`/`search`/`status`/`hook`/etc.) and through the interactive flows that read non-ASCII names via `input()` (onboarding, entity detector, room detector). With this commit the package's three user-facing entry points (`mempalace`, `mempalace-mcp`, and `python -m mempalace.fact_checker`) all reconfigure stdio identically on Windows.
Previously all three streams reconfigured to UTF-8 with errors='strict'.
That kills 'mempalace search' the moment a drawer carrying a surrogate
half (round-tripped from a filename via surrogateescape) hits print(),
losing the rest of the result block. Same hazard for warning lines on
stderr.
Split the policy:
stdin -> surrogateescape (malformed bytes from a redirected file
survive as lone surrogates instead of crashing the read)
stdout -> replace (drawer text with a stray surrogate becomes U+FFFD
instead of UnicodeEncodeError mid-print)
stderr -> replace (same protection for logger / warning paths)
Applied identically in the cli.py and fact_checker.py helpers; the DRY
extraction into a shared module is a separate cleanup ask, kept out of
this fix to keep the diff narrow.
Tests updated for the new per-stream assertion.
187af0f to
03643eb
Compare
Both cli.py and fact_checker.py carried identical 28-line Windows stdio reconfigure helpers; pull the loop into mempalace/_stdio.py so the same machine drives the CLI, the fact_checker --stdin entry point, and the MCP server. The thin per-call-site wrappers stay so existing tests keep importing _reconfigure_stdio_utf8_on_windows from the same module they always have. CLI / fact_checker policy unchanged: stdin=surrogateescape (don't crash on a malformed redirected file), stdout/stderr=replace (don't crash mid-print on a surrogate half round-tripped from a filename).
4 tasks
2 tasks
4 tasks
xcarbo
added a commit
to xcarbo/mempalace
that referenced
this pull request
May 7, 2026
Catches up on a heavy upstream day — 22 fixes merged in 24h plus prior backlog. Highlights pulled in: - MemPalace#1305 hooks: ~/.mempalace/ deletion is now a stable kill-switch (hooks no longer rebuild the dir hierarchy on Stop/PreCompact/SessionStart) - MemPalace#1214 KG: reject inverted intervals (valid_to < valid_from) at write time — prevents silently invisible triples - MemPalace#1067/MemPalace#1105 chroma: ChromaBackend.close_palace() now actually releases the SQLite file lock (PersistentClient.close() on evict + invalidation) - MemPalace#1215 entity_registry: atomic save (tmp+fsync+rename) — no more corruption on crash mid-write - MemPalace#1073/MemPalace#1107 mempalace compress: paginated drawer fetch — no longer trips SQLITE_MAX_VARIABLE_NUMBER on palaces >32k drawers - MemPalace#1282 stdio: Windows console UTF-8 reconfig for cli/mcp_server/hooks_cli - MemPalace#1164/MemPalace#1167 mcp KG: sanitize_iso_date() blocks malformed date strings silently producing empty result sets - MemPalace#1136/MemPalace#1160 mcp: per-path KG cache for multi-tenant hosts that rotate MEMPALACE_PALACE_PATH between tool calls - MemPalace#1286 mcp: retry _get_collection() once on transient failure - MemPalace#1138 lint cleanup, MemPalace#1019 search-crash fix - 4 new tools/ scripts (backup_claude_jsonls, find_orphan_claude_jsonls, render_jsonl, save.md) Conflict resolution (CHANGELOG.md only — code files all auto-merged): - 3.3.5 section: untouched (already merged in our prior commit; upstream added several new bug-fix entries which auto-merged cleanly) - 3.3.4 Bug Fixes: kept upstream's new MemPalace#1305 entry; preserved our richer detail on topic-tunnels (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and auto-ingest (MemPalace#1230/MemPalace#1231) — upstream's shorter topic-tunnels entry was a strict subset of ours. xdev patches preserved (still on this branch, untouched by merge): - 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings - 3fad61d fix(config): allow leading dash in wing names - 3fc821a fix(config): tighten leading-char to allow dash but not underscore Tests: 1557 passed, 1 skipped (full unit suite excluding benchmarks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Reconfigure stdin/stdout/stderr to UTF-8 on Windows in two entry points, with a per-stream
errorspolicy that matches what each one writes:mempalace/cli.py:main()-- the primarymempalaceconsole_scriptmempalace/fact_checker.py:__main__--python -m mempalace.fact_checker --stdinPer-stream policy:
surrogateescape: a malformed byte from a redirected file (or a misbehaving caller) becomes a lone surrogate the consumer's parser surfaces, instead ofUnicodeDecodeErrorkilling the read on the first bad byte.replace:mempalace searchand the fact_checker--stdinpath both print verbatim drawer / fact text. A drawer that round-tripped a filename throughsurrogateescapecan carry a lone surrogate;strictwouldUnicodeEncodeErrormid-print and lose the rest of the result block.replacesubstitutes U+FFFD instead and the result still renders.replace: same hazard for warning lines that quote user-supplied paths.Why
On Windows, Python defaults stdio to the system ANSI codepage (cp1252/cp1251/cp950 depending on locale). That mojibakes non-ASCII content at the process boundary -- a hard bug to debug because verbatim drawer text gets corrupted in pipes, and arguments / interactive input read through
input()come back garbled.After auditing every stdio entry point on
develop, three user-facing console_scripts / module invocations route non-ASCII content throughsys.stdin/sys.stdout:mempalace/mcp_server.py:main()-- already fixed in fix: reconfigure MCP stdin/stdout to UTF-8 on Windows (fixes #363) #400mempalace/hooks_cli.py:run_hook()-- already fixed in Fix/windows hook stdio utf8 #1280mempalace/cli.py:main()andmempalace/fact_checker.py:__main__-- this PRAfter this PR all three of the package's user-facing stdio entry points reconfigure identically on Windows.
mempalace/cli.py:main()The primary CLI dispatches to subcommands that print verbatim drawer text and wing/room names (
mempalace search,mempalace status,mempalace wake-up) and read non-ASCII names viainput()through interactive flows (mempalace init-> onboarding -> entity / room detectors).Concrete failure modes:
mempalace search "..." > out.txt-- piped stdout mojibakes drawer text containing Cyrillic / CJKmempalace ... < input.txt-- piped stdin mojibakes non-ASCII content before subcommand sees itThe reconfigure cascades to every subcommand because
sys.stdin/sys.stdoutare the same module-global streams thatcmd_init,cmd_search,cmd_status,cmd_hook, etc. inherit.mempalace/fact_checker.py:__main__fact_checker.py:325callssys.stdin.read()from the__main__block when invoked aspython -m mempalace.fact_checker --stdin. Same Windows codepage failure mode -- non-ASCII fact text comes back as mojibake before pattern parsing sees it. Low-traffic CLI utility, fixed for sweep consistency rather than in response to a user-filed bug.How
Shared helper in
mempalace/_stdio.py:No-op off Windows. Each stream's reconfigure is wrapped in try/except so a replaced stream (Jupyter, test harness) routes through the
on_failurecallback (defaults to aWARNING:line onsys.stderr) and continues rather than crashing the entry point.cli.pyandfact_checker.pyship thin wrappers that pass the CLI policy (stdout_errors="replace",stderr_errors="replace"); the MCP-side reconfigure (#400) shares the same helper with its strict policy via the same module. The thin wrappers preserve the existing_reconfigure_stdio_utf8_on_windows()import surface so existing tests stay shape-compatible.Tests
tests/test_cli.py:test_reconfigures_stdio_to_utf8_on_windows-- patchessys.platform = "win32"plus aReconfigurableStringIOfor each stream; asserts each received the right per-streamreconfigure(encoding="utf-8", errors=...)exactly once (stdin=surrogateescape, stdout/stderr=replace).test_reconfigure_stdio_is_noop_off_windows-- patchessys.platform = "linux"; asserts no reconfigure call.tests/test_fact_checker.py::TestCLI:fact_checker._reconfigure_stdio_utf8_on_windows.Local run: 83 passed (cli + fact_checker suites).
ruff check .andruff format --check .clean.Out of scope
fact_checkerdetection logic is unchanged.open()/Path.read_text()lacking explicitencoding="utf-8"are a separate bug class (mojibake on file content, not stdio) and would warrant their own audit.python -m mempalace.<module>for development (dialect, diary_ingest, repair, spellcheck, etc.) are not in this sweep -- they are reached throughmempalace ...subcommands which now reconfigure atcli.py:main()and inherit the UTF-8 streams.Body updated 2026-05-03 to match landed code:
03643ebswitched theerrorspolicy from blanketstrictto per-stream (stdin=surrogateescape, stdout/stderr=replace) so a redirected file with bad bytes does not crash the read and a drawer carrying a surrogate half from a filename round-trip does not crash mid-print;285b3b4extracted the loop intomempalace/_stdio.pyso the CLI / fact_checker / mcp_server entry points share one helper instead of carrying duplicate copies.