fix: sanitize malformed Unicode in MCP responses#39625
fix: sanitize malformed Unicode in MCP responses#39625pavelfeldman merged 3 commits intomicrosoft:mainfrom
Conversation
|
Also looks like agent-generated blob that fails lint. Better place to handle it though. Assume Node 20 and hand-craft the tests please. |
Add sanitizeUnicode() function to replace lone surrogates with U+FFFD before JSON serialization. This prevents "invalid high surrogate in string" errors when page content contains malformed Unicode. Uses String.prototype.toWellFormed() on Node 20+, with fallback for Node 18 compatibility. Integrated into response serialization pipeline alongside existing redactText() function.
Add comprehensive tests for malformed Unicode handling in MCP responses: - Lone high surrogates - Lone low surrogates - Valid surrogate pairs (emoji) - Mixed CJK content with malformed Unicode - Multiple consecutive lone surrogates - Console messages with lone surrogates All tests verify that MCP responses don't fail with JSON serialization errors when encountering malformed Unicode from page content.
- Remove Node 18 fallback, use toWellFormed() only - Rewrite tests to match MCP test patterns (3 focused tests) - Use server.setContent() for simpler test setup - Reduce test complexity while maintaining coverage
|
Thanks for the feedback. I've simplified the changes based on your suggestions: Changes made:
The branch has also been rebased onto latest main and the PR is now up to date. |
Replace lone surrogates with U+FFFD for all language bindings, not just non-JavaScript ones, so that protocol messages are always well-formed. Fixes microsoft#39625
| */ | ||
| function sanitizeUnicode(text: string): string { | ||
| // Use native toWellFormed() when available (Node 20+) | ||
| if (typeof text.toWellFormed === 'function') { |
There was a problem hiding this comment.
safe to assume Node 20+ for MCP, we are moving to being 20 altogether.
Test results for "MCP"5295 passed, 165 skipped Merge workflow run. |
Summary
Fixes "invalid high surrogate in string" JSON serialization errors in MCP responses when page content contains malformed Unicode (lone surrogates).
Changes
Add
sanitizeUnicode()function (response.ts)String.prototype.toWellFormed()on Node 20+ for native Unicode sanitizationIntegrate into response serialization
redactText()function inserialize()methodAdd comprehensive tests (unicode-serialization.spec.ts)
Context
Closes microsoft/playwright-mcp#1447
Previous attempt (PR #1448 in playwright-mcp) was closed because it fixed the issue at the transport layer in
cli.js. The proper fix location is inresponse.tswhere text processing already happens viaredactText().Verification
npm run ctest-mcp unicode-serialization # 6 passed (35.3s)All tests pass, confirming that: