Skip to content

proposal: encoding/xml: update character ranges for names to fifth edition (2008) specification #28124

@iand

Description

@iand

Currently the validation of XML names is based on the original 1998 specification which defines a large set of codepoint ranges that are to be accepted. These ranges were widened and simplified in the fifth edition of the spec, published in 2008 and now the current version.

The name production rules are now:

NameStartChar  ::=       ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | 
                           [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | 
                           [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | 
                           [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 
                           [#x10000-#xEFFFF]
NameChar       ::=       NameStartChar | "-" | "." | [0-9] | #xB7 | 
                           [#x0300-#x036F] | [#x203F-#x2040]
Name           ::=       NameStartChar (NameChar)*
Names          ::=       Name (#x20 Name)*
Nmtoken        ::=       (NameChar)+
Nmtokens       ::=       Nmtoken (#x20 Nmtoken)*

This may also address the majority of the requirements for xml1.1 support (#25755) since the changes between 1.0 and 1.1 were the expansion of the name character ranges, the addition of two line ending characters (U+0085, U+2028) and specification of additional normalisation rules

The current ranges span 300 lines of code in the xml package so changing this will also contribute to #26775

If there is interest then I can submit a CL.

Metadata

Metadata

Assignees

No one assigned

    Labels

    FeatureRequestIssues asking for a new feature that does not need a proposal.Proposal

    Type

    No type

    Projects

    Status

    Incoming

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions