Skip to content

HTML numeric entities above U+FFFF are truncated during parse #725

@mcdurdin

Description

@mcdurdin
  • Are you running the latest version?
  • Have you included sample input, output, error, and expected output?
  • Have you checked if you are using correct configuration?
  • Did you try online tool?
  • Have you checked the docs for helpful APIs and examples?

Description

HTML numeric entities that are outside basic multilingual plane (U+0000 - U+FFFF) are truncated to their lower two bytes, e.g. U+1F60A (😊) is truncated to U+F60A.

Input

<?xml version="1.0" encoding="UTF-8"?>
<note>&#x1F60A;&#128523;</note>

Code

const options = {
    attributeNamePrefix: "",
    ignoreAttributes:    false,
    processEntities: true,
    htmlEntities: true,
};
const parser = new XMLParser(options);
let result = parser.parse(xmlData);
console.dir(result);

Output

{ "?xml": { version: "1.0", encoding: "UTF-8" }, note: "" }

For clarity, the 'note' field has values of U+F60A U+F60B instead of expected emoji values U+1F60A U+1F60B (😊😋).

expected data

{ '?xml': { version: "1.0", encoding: "UTF-8" }, note: "😊😋" }

Would you like to work on this issue?

  • Yes
  • No

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions