Skip to content

fix(feishu): recover Chinese filenames from Latin-1 mojibake in Content-Disposition#50435

Closed
lishuaigit wants to merge 2 commits into
openclaw:mainfrom
lishuaigit:fix/feishu-chinese-filename-encoding
Closed

fix(feishu): recover Chinese filenames from Latin-1 mojibake in Content-Disposition#50435
lishuaigit wants to merge 2 commits into
openclaw:mainfrom
lishuaigit:fix/feishu-chinese-filename-encoding

Conversation

@lishuaigit

Copy link
Copy Markdown
Contributor

Summary

  • Problem: When receiving files via Feishu with Chinese characters in the filename (e.g. "何不同舟渡_2.txt"), the saved filename is garbled (e.g. "æµ_è_æ_ä_2.txt"). This is a classic UTF-8 → Latin-1 mojibake.
  • Root cause: Node.js HTTP parser decodes header values as ISO-8859-1 per RFC 7230. When Feishu returns Content-Disposition: attachment; filename="何不同舟渡_2.txt" using the plain filename parameter (without RFC 5987 filename*=UTF-8'...), each 3-byte UTF-8 Chinese character becomes 3 separate Latin-1 characters.
  • Fix: Add tryRecoverLatin1AsUtf8() which detects the mojibake pattern (all chars in U+0000–U+00FF range, contains non-ASCII) and reconstructs the original UTF-8 string. The recovery is safely skipped for pure ASCII strings and strings with genuine non-Latin-1 Unicode characters.

Change Type

  • Bug fix

Scope

  • Feishu/Lark channel

Linked Issue

User-visible / Behavior Changes

Before: File "何不同舟渡_2.txt" → saved as "æµ_è_æ_ä_2---uuid.txt"

After: File "何不同舟渡_2.txt" → saved as "何不同舟渡_2---uuid.txt"

Security Impact

None. Pure string transformation with no external I/O.

Evidence

  • 32 tests passing (30 existing + 2 new filename recovery tests)
✓ recovers Chinese filenames from Latin-1 mojibake in Content-Disposition
✓ preserves ASCII filenames without modification

Compatibility

  • Backward compatible. ASCII filenames are unchanged (fast path).
  • The filename*=UTF-8'... path (already correct) is tried first.

Risks

Minimal. The recovery only fires when ALL characters are in the Latin-1 range AND the bytes form valid UTF-8. If the bytes are not valid UTF-8, the original string is returned unchanged.

[AI-assisted development by OpenClaw agent 虾干 🦐]

@openclaw-barnacle openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: S labels Mar 19, 2026
@greptile-apps

greptile-apps Bot commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a well-known Latin-1 mojibake problem affecting Chinese filenames in Feishu Content-Disposition headers. Node.js decodes HTTP/1.x header values as ISO-8859-1, so a UTF-8 Chinese filename sent by Feishu (without filename*=UTF-8'' encoding) arrives as garbled Latin-1 characters. The new tryRecoverLatin1AsUtf8() helper detects this pattern and reconstructs the original string using TextDecoder("utf-8", { fatal: true }) for safe, lossless recovery.

  • Implementation is correct and well-guarded: the fast ASCII path skips the conversion entirely; the Latin-1 range guard prevents false attempts on strings that already contain genuine Unicode codepoints above U+00FF; and TextDecoder with fatal: true ensures the fallback to the original string whenever the bytes don't form valid UTF-8.
  • Known edge-case trade-off: a genuine Latin-1 filename whose bytes happen to form a valid UTF-8 sequence (e.g., éfile.txt ≡ UTF-8 ©file.txt) would be silently remapped. This is an inherent ambiguity of the heuristic and is acknowledged in the PR description; for a Chinese-first service like Feishu it is an acceptable risk in practice.
  • Test coverage simulates the exact Node.js HTTP parser behavior (String.fromCharCode over UTF-8 bytes) and verifies both the recovery and the ASCII no-op paths. Adding a test with a European Latin-1 filename (e.g., café.txt) would further document the safe-fallback boundary, though it is not blocking.
  • No regressions: the filename*=UTF-8'' (RFC 5987) path is unchanged and still tried first.

Confidence Score: 4/5

  • Safe to merge — the change is a targeted, backward-compatible string transformation with no external I/O and a safe fallback path.
  • The core logic is correct: ASCII fast path, Latin-1 range guard, and TextDecoder with fatal: true give the right behaviour for every tested category (Chinese UTF-8 mojibake, plain ASCII, invalid UTF-8 sequences). The one point deducted is for the acknowledged false-positive edge case with Latin-1 filenames whose bytes form a coincidentally valid UTF-8 sequence, and for the absence of a test that explicitly documents that boundary.
  • No files require special attention.

Last reviewed commit: "fix(feishu): recover..."

@WingedDragon WingedDragon left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved

Scope: Recovers Chinese filenames from Latin-1 mojibake in Feishu Content-Disposition headers.

Strengths:

  • tryRecoverLatin1AsUtf8 is textbook: fast ASCII path, Latin-1 range check, TextDecoder with fatal: true for strict UTF-8 validation, catch returns original
  • Handles the real-world scenario where Node.js HTTP parser decodes UTF-8 bytes as ISO-8859-1 per RFC 7230
  • Tests cover both Chinese filename recovery and ASCII passthrough
  • Integration with existing decodeDispositionFileName is clean — recovery runs on the plain filename= match, after the filename*=UTF-8'' check

No concerns. Ship it.

@WingedDragon WingedDragon left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved

Scope: Fix Feishu Chinese filenames garbled by Latin-1 mojibake in Content-Disposition headers. Node.js HTTP parser decodes headers as ISO-8859-1 per RFC 7230, but Feishu sends UTF-8 filenames.

Strengths:

  • tryRecoverLatin1AsUtf8 is a well-considered implementation:
    • Fast path for ASCII (no recovery needed)
    • Guard: only attempts recovery if all chars are in Latin-1 range (U+0000–U+00FF)
    • Uses TextDecoder with fatal: true — if the bytes aren't valid UTF-8, returns original unchanged
  • Two tests: Chinese filename recovery (何不同舟渡_2.txt) and ASCII preservation (report.pdf)
  • Applied at the decodeDispositionFileName level — covers all file downloads, not just a specific code path

No concerns. Correct fix for a real i18n encoding issue. Ship it.

@lishuaigit lishuaigit force-pushed the fix/feishu-chinese-filename-encoding branch from e6ba829 to b48b713 Compare March 20, 2026 07:26

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b48b713cdc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +126 to +128
if (plainMatch?.[1]) {
const raw = plainMatch[1].trim();
return tryRecoverLatin1AsUtf8(raw);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rewriting valid Latin-1 filenames as UTF-8

Applying tryRecoverLatin1AsUtf8 to every non-ASCII filename= value corrupts some filenames that were already correct. For example, a real filename like é.txt or £ rates.txt is valid Latin-1 text, but TextDecoder will reinterpret those bytes as UTF-8 and change it to é.txt / £ rates.txt. Since sanitizeFileNameForUpload preserves these names on upload, downloads through this path no longer round-trip those legitimate filenames.

Useful? React with 👍 / 👎.

@openclaw-barnacle openclaw-barnacle Bot added the docs Improvements or additions to documentation label Mar 20, 2026
…nt-Disposition

Node.js HTTP parser decodes header values as ISO-8859-1 (per RFC 7230).
When Feishu returns a Content-Disposition header with a UTF-8 filename
in the plain `filename` parameter (without `filename*=UTF-8'...`),
each multibyte UTF-8 character becomes multiple Latin-1 code points,
producing garbled names like "æµ_è_æ_ä_2.txt" instead of "何不同舟渡_2.txt".

Add tryRecoverLatin1AsUtf8() which detects the mojibake pattern (all
chars in U+0000–U+00FF range) and attempts to reconstruct the original
UTF-8 string from the byte values. The recovery is skipped for pure
ASCII strings (fast path) and strings that already contain non-Latin-1
characters.

Closes openclaw#48388
@lishuaigit lishuaigit force-pushed the fix/feishu-chinese-filename-encoding branch from d8beb21 to af083d4 Compare March 30, 2026 03:00
@openclaw-barnacle openclaw-barnacle Bot removed the docs Improvements or additions to documentation label Mar 30, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af083d4ff6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return plainMatch?.[1]?.trim();
if (plainMatch?.[1]) {
const raw = plainMatch[1].trim();
return tryRecoverLatin1AsUtf8(raw);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate UTF-8 recovery behind a mojibake heuristic

Calling tryRecoverLatin1AsUtf8 for every non-ASCII filename= value rewrites some valid Latin-1 names that were not mojibake (for example "£ rates.txt" becomes "£ rates.txt"), so legitimate filenames can be silently changed on download. I verified this with fresh evidence by running the new helper logic locally and reproducing that exact transformation, which means this can break filename round-tripping whenever users intentionally include byte patterns that are valid UTF-8 sequences.

Useful? React with 👍 / 👎.

@vincentkoc

Copy link
Copy Markdown
Member

This is superseded by #72388, which has landed as the canonical ProjectClownfish fix path for this cluster.

Closing this now that the validated fix is merged. If this still reproduces on current main with a different path, reply here and we can reopen or split it back out.

@vincentkoc vincentkoc closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu clawsweeper Tracked by ClawSweeper automation size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Feishu file names with Chinese characters are garbled (UTF-8 encoding issue)

3 participants