-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Thanks to @silby for the POD reader, it's useful.
The POD reader supports the POD E<...> escape syntax, based on HTML character entity names (and some POD-specific ones). For example E<trade> should produce a trademark symbol (™).
Many entity names are handled incorrectly: instead of being replaced with the specified character, they are unrecognised and then, per the POD specification, the E<...> sequence is passed through literally.
Test document:
=pod
E<trade>
E<ccaron>
Test command:
pandoc -i pandoc.pod -w plain
Actual output (E<...> sequences not recognised, so per the POD spec they are passed through "as is"):
E<trade>
E<ccaron>
Expected output (E<...> replaced with corresponding characters):
™
č
Pandoc version:
pandoc 3.7.0.2-nightly-2025-08-01
Investigation
The POD reader delegates entity name look-up to Text.Pandoc.XML (lookupEntity) which in turn uses Commonmark.Entity. The list of entities in that module mostly contains the character names suffixed with a semicolon (for example "ccaron;", "\x010D"). If a semicolon is added to the E<...> escape sequences in POD (for example E<ccaron;>) then pandoc handles the escape sequence correctly. However, it is undesirable to include semicolons in E<...>.
Could the POD reader append a semicolon to each E<...> value before calling lookupEntity?
The HTML spec states:
Character references must start with a U+0026 AMPERSAND character (&). Following this, there are three possible kinds of character references:
Named character references
The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).
... implying that the semicolon is not part of the name, but also notes:
It is intentional, for legacy compatibility, that many code points have multiple character reference names. For example, some appear both with and without the trailing semicolon, or with different capitalizations.