Skip to content

POD reader: many E<...> escape sequences are not handled correctly #11015

@njohnston

Description

@njohnston

Thanks to @silby for the POD reader, it's useful.

The POD reader supports the POD E<...> escape syntax, based on HTML character entity names (and some POD-specific ones). For example E<trade> should produce a trademark symbol (™).

Many entity names are handled incorrectly: instead of being replaced with the specified character, they are unrecognised and then, per the POD specification, the E<...> sequence is passed through literally.

Test document:

=pod

E<trade>

E<ccaron>

Test command:

pandoc -i pandoc.pod -w plain

Actual output (E<...> sequences not recognised, so per the POD spec they are passed through "as is"):

E<trade>

E<ccaron>

Expected output (E<...> replaced with corresponding characters):

™

č

Pandoc version:

pandoc 3.7.0.2-nightly-2025-08-01

Investigation

The POD reader delegates entity name look-up to Text.Pandoc.XML (lookupEntity) which in turn uses Commonmark.Entity. The list of entities in that module mostly contains the character names suffixed with a semicolon (for example "ccaron;", "\x010D"). If a semicolon is added to the E<...> escape sequences in POD (for example E<ccaron;>) then pandoc handles the escape sequence correctly. However, it is undesirable to include semicolons in E<...>.

Could the POD reader append a semicolon to each E<...> value before calling lookupEntity?

The HTML spec states:

Character references must start with a U+0026 AMPERSAND character (&). Following this, there are three possible kinds of character references:

Named character references

The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).

... implying that the semicolon is not part of the name, but also notes:

It is intentional, for legacy compatibility, that many code points have multiple character reference names. For example, some appear both with and without the trailing semicolon, or with different capitalizations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions