Skip to content

fix(markdown): decode HTML entities in code blocks and inline code#658

Merged
esengine merged 1 commit into
mainfrom
fix/issue-657-decode-html-entities
May 11, 2026
Merged

fix(markdown): decode HTML entities in code blocks and inline code#658
esengine merged 1 commit into
mainfrom
fix/issue-657-decode-html-entities

Conversation

@esengine

Copy link
Copy Markdown
Owner

Summary

Reporter showed a JSON snippet rendered as { "apiKey": "..." } inside a code fence — the model emitted literal HTML entities instead of ". marked passes the entities through verbatim (its tokens carry raw text, not HTML-escaped output), and our renderer hands that to the terminal unchanged. Terminals don't render entities, so " / & / < leak as visible artifacts.

This is a known LLM artifact: models sometimes HTML-escape inside code blocks — especially on JSON / HTML / XML output — because their training data had plenty of HTML-encoded code in web posts and docs. Claude Code and Cursor both decode entities at the rendering boundary; doing the same here.

Scope

  • New src/cli/ui/html-entities.ts with decodeHtmlEntities() — handles the five common named entities (quot / apos / amp / lt / gt / nbsp) plus numeric forms (" / "). Unknown named entities pass through so prose that quotes entity names by name doesn't get corrupted. Fast-path early-return when no & is present so paragraph text pays nothing.
  • Decode applied at four sites: CodeBlock text and codespan text in both markdown.tsx and markdown-lines.ts.
  • Prose paragraph text is left alone — limiting scope to code keeps the edge case of "user genuinely wrote about &" working in prose.

Closes #657

Test plan

  • npm run verify — 2601 passed (added 9), 2 skipped
  • tests/html-entities.test.ts covers: no-& fast path, five named entities,   → NBSP, decimal + hex numeric (incl. emoji codepoint), unknown-name pass-through, case-insensitivity, malformed & fragments left alone, and the literal real-world JSON pattern from the issue
  • Manual: ask the model to emit a JSON code block and verify quotes render as " not "

The reporter showed a JSON snippet rendered as
`{ "apiKey": "..." }` inside a code fence — the
model emitted literal HTML entities instead of `"`. marked passes the
entities through verbatim (its tokens carry the raw text, not HTML-
escaped output), and our renderer rightly hands that to the terminal
unchanged. Terminals don't render entities, so they leak as visible
`"` / `&` / `<` etc.

This is a known LLM artifact: models sometimes HTML-escape inside code
blocks, especially on JSON / HTML / XML output, because their training
saw a lot of HTML-encoded code in web posts and docs. Both Claude Code
and Cursor decode entities at the rendering boundary; doing the same
here.

Scope: only code blocks and inline code spans (the contexts where
models leak entities the most). Prose paragraphs are left alone — if
someone genuinely writes "use the `&` entity to escape ampersand"
in non-code text, the entity name stays visible. Numeric forms
(`"` / `"`) and the five common named forms (quot / apos /
amp / lt / gt / nbsp) decode; unknown names pass through so we don't
corrupt prose that quotes entity names.

Closes #657
@esengine esengine merged commit cf0e920 into main May 11, 2026
3 checks passed
@esengine esengine deleted the fix/issue-657-decode-html-entities branch May 11, 2026 07:38
@esengine esengine mentioned this pull request May 11, 2026
ChasLui pushed a commit to ChasLui/DeepSeek-Reasonix that referenced this pull request May 23, 2026
…sengine#658)

The reporter showed a JSON snippet rendered as
`{ "apiKey": "..." }` inside a code fence — the
model emitted literal HTML entities instead of `"`. marked passes the
entities through verbatim (its tokens carry the raw text, not HTML-
escaped output), and our renderer rightly hands that to the terminal
unchanged. Terminals don't render entities, so they leak as visible
`"` / `&` / `<` etc.

This is a known LLM artifact: models sometimes HTML-escape inside code
blocks, especially on JSON / HTML / XML output, because their training
saw a lot of HTML-encoded code in web posts and docs. Both Claude Code
and Cursor decode entities at the rendering boundary; doing the same
here.

Scope: only code blocks and inline code spans (the contexts where
models leak entities the most). Prose paragraphs are left alone — if
someone genuinely writes "use the `&` entity to escape ampersand"
in non-code text, the entity name stays visible. Numeric forms
(`&esengine#34;` / `"`) and the five common named forms (quot / apos /
amp / lt / gt / nbsp) decode; unknown names pass through so we don't
corrupt prose that quotes entity names.

Closes esengine#657
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

特殊字符的支持

1 participant