Skip to content

fix: reconfigure MCP stdin/stdout to UTF-8 on Windows (fixes #363)#400

Open
mvalentsev wants to merge 3 commits into
MemPalace:developfrom
mvalentsev:fix/mcp-stdin-utf8-windows
Open

fix: reconfigure MCP stdin/stdout to UTF-8 on Windows (fixes #363)#400
mvalentsev wants to merge 3 commits into
MemPalace:developfrom
mvalentsev:fix/mcp-stdin-utf8-windows

Conversation

@mvalentsev

@mvalentsev mvalentsev commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes #363. On Windows, Python defaults sys.stdin to the system ANSI codepage (cp1251, cp1252, etc.), not UTF-8. When an MCP client sends JSON containing non-ASCII characters, the bytes get decoded through the wrong codepage and produce mojibake with surrogate escapes. These strings pass Python type checks but crash the HuggingFace tokenizer inside ChromaDB with TypeError: TextInputSequence must be str in query.

The fix adds sys.stdin.reconfigure(encoding="utf-8") at the top of main(), guarded by sys.platform == "win32". stdout and stderr get the same treatment so non-ASCII JSON responses serialize cleanly. On non-Windows platforms the code is a no-op.

This effectively makes mempalace_search work for every non-English speaker on Windows (Russian, Chinese, Hebrew, etc.), where it currently crashes on the very first query containing non-ASCII text.

How to test

On Windows with a non-English locale:

python -m mempalace.mcp_server
# from an MCP client, send: mempalace_search with query "привет" or "你好"
# Before: TypeError crash
# After: normal search results

On Linux/macOS: no behavior change (guard skips reconfigure).

ruff check mempalace/mcp_server.py
python -c "import mempalace.mcp_server; print('OK')"

Checklist

  • Linter passes (ruff check .)
  • No hardcoded paths
  • Python 3.9 compatible (reconfigure available since 3.7)
  • No new dependencies

@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from f7a7e0a to c570213 Compare April 9, 2026 17:09
@bgauryy

bgauryy commented Apr 9, 2026

Copy link
Copy Markdown

PR Review: fix: reconfigure MCP stdin/stdout to UTF-8 on Windows (fixes #363)

Executive Summary

Aspect Value
PR Goal Fix Windows encoding mismatch that crashes ChromaDB's HuggingFace tokenizer via MCP stdin/stdout
Files Changed 1
Risk Level LOW — platform-gated, try/except guarded, zero impact on non-Windows
Review Mode Quick
Review Effort 1 — trivial, well-scoped fix
Recommendation APPROVE with minor suggestions

Affected Areas: mempalace/mcp_server.pymain() entry point

Business Impact: Unblocks all Windows users encountering non-ASCII content via MCP (#363)

Flow Changes: All sys.stdin.readline() and sys.stdout.write() calls in the main loop now decode/encode as UTF-8 on Windows instead of the system ANSI codepage

Ratings

Aspect Score
Correctness 5/5
Security 5/5
Performance 5/5
Maintainability 4/5

PR Health

High Priority Issues

(Must fix before merge)

None. The fix is correct and well-scoped.

  • sys is already imported (line ~13)
  • sys.platform == "win32" is the canonical Python Windows check
  • reconfigure(encoding="utf-8", errors="strict") is the right call — strict mode prevents silently corrupting data; any UnicodeDecodeError is caught by the main loop's except Exception as e handler at line ~943 which logs and continues
  • The try/except guard handles Python < 3.7 (no reconfigure()) and non-standard streams
  • Placement before the main loop ensures all downstream I/O benefits

Medium Priority Issues

(Should fix, not blocking)

[Error Handling] #1: Silent failure swallows reconfigure errors with no diagnostic

Location: mempalace/mcp_server.py:930 (PR branch) | Confidence: HIGH

If reconfigure() fails on a Windows machine (e.g., frozen executable with custom I/O wrappers), the original bug silently persists with zero diagnostic output. The user gets the same crash as #363 and no clue that the fix was bypassed.

         except Exception:
-            pass  # older Python or non-reconfigurable stream
+            logger.warning("Could not reconfigure stdio to UTF-8: %s", e)

Note: requires renaming the except clause to except Exception as e:.


[Bug] #2: sys.stderr not reconfigured — logger output with non-ASCII may still corrupt on Windows

Location: mempalace/mcp_server.py:928-929 (PR branch) | Confidence: MED

The logger.error(f"Server error: {e}") at line ~943 and logger.exception() calls throughout the file write to stderr. If the exception message contains non-ASCII content (e.g., user-provided memory text in a traceback), stderr still uses the system codepage and could produce mojibake in logs.

             sys.stdin.reconfigure(encoding="utf-8", errors="strict")
             sys.stdout.reconfigure(encoding="utf-8", errors="strict")
+            sys.stderr.reconfigure(encoding="utf-8", errors="strict")

Lower priority since it only affects log readability, not protocol correctness.


Low Priority Issues

(Nice to have)

None.


Flow Impact Analysis

main()
  ├── [NEW] sys.stdin.reconfigure(utf-8)   ← only on win32
  ├── [NEW] sys.stdout.reconfigure(utf-8)  ← only on win32
  │
  ├── sys.stdin.readline()     ← now decodes as UTF-8 (was system codepage)
  │     └── json.loads(line)   ← receives clean UTF-8 strings
  │           └── handle_request(request)
  │                 └── TOOLS[...]["handler"](**tool_args)
  │                       └── ChromaDB → HuggingFace tokenizer  ← no more surrogate escapes
  │
  └── sys.stdout.write(json.dumps(response))  ← now encodes as UTF-8

Blast radius: Zero on non-Windows. On Windows, every stdin read and stdout write in the main loop is affected — this is the intended behavior.


Created by Octocode MCP https://octocode.ai

@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from c570213 to b1b9b5d Compare April 9, 2026 17:31
@mvalentsev

mvalentsev commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

Applied both: stderr is now reconfigured alongside stdin/stdout, and the except clause logs the reason instead of silently passing.


Edit 2026-05-03: two follow-ups landed since the above.

ea816f9 refined the policy from blanket errors="strict" to per-stream: stdin=surrogateescape (so a malformed byte from a misbehaving MCP client survives as a lone surrogate the downstream json.loads exception path can surface as a clean -32700 instead of UnicodeDecodeError killing the readline loop on the first bad byte), stdout/stderr stay strict (server-controlled writes, failures stay loud). Adds two mocked Windows tests pinning the per-stream policy.

dbdf5e6 extracts the per-stream loop to mempalace/_stdio.py so the CLI / fact_checker entry points (#1282) share the same helper -- internal-only refactor, no public API change. The MCP-side reconfigure now reads as a one-line reconfigure_stdio_utf8_on_windows(on_failure=...) call with the strict defaults, while the CLI side overrides stdout/stderr to replace.

@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from b1b9b5d to 0e9e614 Compare April 9, 2026 18:04
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from 0e9e614 to ebdd2c2 Compare April 9, 2026 21:05

@web3guru888 web3guru888 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix. The Windows codepage issue is real — non-English users on Windows get mojibake that eventually crashes ChromaDB's tokenizer downstream. The guard (sys.platform == "win32") is clean, and the try/except means it gracefully degrades.

Two notes:

  1. errors="strict" — Good call using strict mode rather than the default "replace". With MCP, you want to fail fast on encoding issues rather than silently corrupt JSON data.

  2. stderr too — Nice that you also reconfigure stderr. If the MCP server logs a non-ASCII palace path or drawer content in an error message, that would have crashed on Windows too.

This is a no-op on Linux/macOS, so zero risk to existing setups. We run on Linux but cross-platform compat is important for the ecosystem.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

@mvalentsev mvalentsev closed this Apr 10, 2026
@mvalentsev mvalentsev reopened this Apr 10, 2026
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch 2 times, most recently from 3182a4c to 7833da1 Compare April 10, 2026 16:41
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from 7833da1 to 5247e55 Compare April 11, 2026 11:29
@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch 4 times, most recently from 932337b to 9f52c4e Compare April 13, 2026 21:46
@igorls igorls added area/mcp MCP server and tools area/windows Windows-specific bugs and compatibility bug Something isn't working labels Apr 14, 2026
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch 2 times, most recently from c231860 to 3ddfb72 Compare April 26, 2026 19:00
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from 3ddfb72 to ea816f9 Compare May 3, 2026 16:33
@mvalentsev mvalentsev requested a review from igorls as a code owner May 3, 2026 16:33
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request May 3, 2026
The `python -m mempalace.fact_checker --stdin` entry point reads non-ASCII
text through the system ANSI codepage (cp1252/cp1251/cp950) on Windows,
which mojibakes characters before claim-extraction sees them. Reconfigure
stdin/stdout/stderr to UTF-8 with `errors="strict"`, wrapped in try/except
so a replaced stream (Jupyter, test harness) logs a warning rather than
crashing the CLI.

Mirrors the same fix shipped for `mcp_server.py:main()` (MemPalace#400) and
`hooks_cli.py:run_hook()` (MemPalace#1280) -- this is the third and last
stdin-reading entry point in the package.
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from dbdf5e6 to b79e09a Compare May 6, 2026 05:45
@mvalentsev

Copy link
Copy Markdown
Contributor Author

Rebased on develop now that #1060 has landed. Repositioned this PR's scope: with mempalace/_stdio.py shipped on develop, this PR wires mcp_server.py:main() through that shared helper instead of keeping a parallel inline reconfigure block. The helper's docstring specifies strict for stdout/stderr (server-controlled JSON-RPC; encode failures should be loud) and surrogateescape for stdin (a malformed byte survives as a lone surrogate so the json.loads error path can surface a clean -32700 instead of UnicodeDecodeError killing readline). The current inline block uses errors="replace" for stdin + stdout only with no stderr coverage; this PR brings main() in line with that helper's policy.

If the lighter replace form is preferred, happy to close -- the bug itself is functionally fixed by #1060.

@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch 5 times, most recently from 19c18b5 to 2a14a35 Compare May 15, 2026 13:32
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch 5 times, most recently from dc48ddd to ddc1e76 Compare May 22, 2026 05:32
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch 4 times, most recently from dec29ca to efa941f Compare May 24, 2026 23:25
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from efa941f to 93a921d Compare May 30, 2026 17:45
…ce#363)

On Windows, Python defaults stdin to the system ANSI codepage
(cp1251, cp1252, etc). When an MCP client sends UTF-8 JSON with
non-ASCII characters (Cyrillic, CJK, Hebrew, emoji), the bytes
are decoded through the wrong codepage and produce mojibake with
surrogate escapes. These broken strings pass Python type checks
but crash the HuggingFace tokenizer inside ChromaDB's embedding
function with "TextInputSequence must be str in query".

The fix calls sys.stdin.reconfigure(encoding="utf-8") at the top
of main(), guarded by sys.platform == "win32" so it only runs
where needed. stdout gets the same treatment so JSON responses
with non-ASCII content serialize cleanly.

Fixes MemPalace#363
…tests

The previous block reconfigured stdin / stdout / stderr to UTF-8 with
errors='strict'. On Windows, a malformed byte from a misbehaving MCP
client (or a stray BOM) then raises UnicodeDecodeError inside
sys.stdin.readline(), bypassing the inner try/except for json.loads
and killing the whole server on the first bad byte -- worse than the
pre-PR cp1252 behavior, which produced wrong-but-valid str downstream.

Split the policy:
  stdin  -> surrogateescape (lone surrogates pass through; json.loads
            then raises a recoverable parse error -> -32700 instead
            of process death)
  stdout -> strict (we control writes here; failures are real bugs)
  stderr -> strict (same)

Iterate over the three streams via getattr(sys, name) so a test
harness that strips a stream entirely degrades gracefully.

Add two mocked tests via _ReconfigurableStdio to pin the per-stream
policy AND verify the off-Windows branch is a no-op. Closes the
'no test added' coverage gap from pre-merge review.
Move the per-stream policy loop out of mcp_server.main() into
mempalace/_stdio.py so the same helper can serve other Windows entry
points (CLI, fact_checker) without copy-pasting the iteration. Behaviour
unchanged: stdin=surrogateescape, stdout=strict, stderr=strict;
on_failure callback routes failures through logger.warning as before.
@mvalentsev mvalentsev force-pushed the fix/mcp-stdin-utf8-windows branch from 93a921d to 2c4a08a Compare June 6, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP server and tools area/windows Windows-specific bugs and compatibility bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows: mempalace_search fails with "TextInputSequence must be str in query" on any non-ASCII query due to stdin encoding

5 participants