fix(feishu): recover Chinese filenames from Latin-1 mojibake in Content-Disposition#50435
fix(feishu): recover Chinese filenames from Latin-1 mojibake in Content-Disposition#50435lishuaigit wants to merge 2 commits into
Conversation
Greptile SummaryThis PR fixes a well-known Latin-1 mojibake problem affecting Chinese filenames in Feishu
Confidence Score: 4/5
Last reviewed commit: "fix(feishu): recover..." |
WingedDragon
left a comment
There was a problem hiding this comment.
✅ Approved
Scope: Recovers Chinese filenames from Latin-1 mojibake in Feishu Content-Disposition headers.
Strengths:
tryRecoverLatin1AsUtf8is textbook: fast ASCII path, Latin-1 range check,TextDecoderwithfatal: truefor strict UTF-8 validation, catch returns original- Handles the real-world scenario where Node.js HTTP parser decodes UTF-8 bytes as ISO-8859-1 per RFC 7230
- Tests cover both Chinese filename recovery and ASCII passthrough
- Integration with existing
decodeDispositionFileNameis clean — recovery runs on the plainfilename=match, after thefilename*=UTF-8''check
No concerns. Ship it.
WingedDragon
left a comment
There was a problem hiding this comment.
✅ Approved
Scope: Fix Feishu Chinese filenames garbled by Latin-1 mojibake in Content-Disposition headers. Node.js HTTP parser decodes headers as ISO-8859-1 per RFC 7230, but Feishu sends UTF-8 filenames.
Strengths:
tryRecoverLatin1AsUtf8is a well-considered implementation:- Fast path for ASCII (no recovery needed)
- Guard: only attempts recovery if all chars are in Latin-1 range (U+0000–U+00FF)
- Uses
TextDecoderwithfatal: true— if the bytes aren't valid UTF-8, returns original unchanged
- Two tests: Chinese filename recovery (
何不同舟渡_2.txt) and ASCII preservation (report.pdf) - Applied at the
decodeDispositionFileNamelevel — covers all file downloads, not just a specific code path
No concerns. Correct fix for a real i18n encoding issue. Ship it.
e6ba829 to
b48b713
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b48b713cdc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (plainMatch?.[1]) { | ||
| const raw = plainMatch[1].trim(); | ||
| return tryRecoverLatin1AsUtf8(raw); |
There was a problem hiding this comment.
Avoid rewriting valid Latin-1 filenames as UTF-8
Applying tryRecoverLatin1AsUtf8 to every non-ASCII filename= value corrupts some filenames that were already correct. For example, a real filename like é.txt or £ rates.txt is valid Latin-1 text, but TextDecoder will reinterpret those bytes as UTF-8 and change it to é.txt / £ rates.txt. Since sanitizeFileNameForUpload preserves these names on upload, downloads through this path no longer round-trip those legitimate filenames.
Useful? React with 👍 / 👎.
…nt-Disposition Node.js HTTP parser decodes header values as ISO-8859-1 (per RFC 7230). When Feishu returns a Content-Disposition header with a UTF-8 filename in the plain `filename` parameter (without `filename*=UTF-8'...`), each multibyte UTF-8 character becomes multiple Latin-1 code points, producing garbled names like "æµ_è_æ_ä_2.txt" instead of "何不同舟渡_2.txt". Add tryRecoverLatin1AsUtf8() which detects the mojibake pattern (all chars in U+0000–U+00FF range) and attempts to reconstruct the original UTF-8 string from the byte values. The recovery is skipped for pure ASCII strings (fast path) and strings that already contain non-Latin-1 characters. Closes openclaw#48388
d8beb21 to
af083d4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: af083d4ff6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return plainMatch?.[1]?.trim(); | ||
| if (plainMatch?.[1]) { | ||
| const raw = plainMatch[1].trim(); | ||
| return tryRecoverLatin1AsUtf8(raw); |
There was a problem hiding this comment.
Gate UTF-8 recovery behind a mojibake heuristic
Calling tryRecoverLatin1AsUtf8 for every non-ASCII filename= value rewrites some valid Latin-1 names that were not mojibake (for example "£ rates.txt" becomes "£ rates.txt"), so legitimate filenames can be silently changed on download. I verified this with fresh evidence by running the new helper logic locally and reproducing that exact transformation, which means this can break filename round-tripping whenever users intentionally include byte patterns that are valid UTF-8 sequences.
Useful? React with 👍 / 👎.
|
This is superseded by #72388, which has landed as the canonical ProjectClownfish fix path for this cluster. Closing this now that the validated fix is merged. If this still reproduces on current main with a different path, reply here and we can reopen or split it back out. |
Summary
Content-Disposition: attachment; filename="何不同舟渡_2.txt"using the plainfilenameparameter (without RFC 5987filename*=UTF-8'...), each 3-byte UTF-8 Chinese character becomes 3 separate Latin-1 characters.tryRecoverLatin1AsUtf8()which detects the mojibake pattern (all chars in U+0000–U+00FF range, contains non-ASCII) and reconstructs the original UTF-8 string. The recovery is safely skipped for pure ASCII strings and strings with genuine non-Latin-1 Unicode characters.Change Type
Scope
Linked Issue
User-visible / Behavior Changes
Before: File "何不同舟渡_2.txt" → saved as "æµ_è_æ_ä_2---uuid.txt"
After: File "何不同舟渡_2.txt" → saved as "何不同舟渡_2---uuid.txt"
Security Impact
None. Pure string transformation with no external I/O.
Evidence
Compatibility
filename*=UTF-8'...path (already correct) is tried first.Risks
Minimal. The recovery only fires when ALL characters are in the Latin-1 range AND the bytes form valid UTF-8. If the bytes are not valid UTF-8, the original string is returned unchanged.
[AI-assisted development by OpenClaw agent 虾干 🦐]