Skip to content

fix: Extra content at the end of the document#161

Merged
kou merged 2 commits into
ruby:masterfrom
naitoh:fix_multiple_root_elements
Jul 7, 2024
Merged

fix: Extra content at the end of the document#161
kou merged 2 commits into
ruby:masterfrom
naitoh:fix_multiple_root_elements

Conversation

@naitoh

@naitoh naitoh commented Jul 4, 2024

Copy link
Copy Markdown
Contributor

Why?

XML with additional content at the end of the document is invalid.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Misc

[27]    Misc       ::=          Comment | PI | S

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI

[16]    PI         ::=          '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget

[17]    PITarget           ::=          Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

## Why?

XML with multiple root elements is invalid.

See: ruby#160 (comment)
Comment thread lib/rexml/parsers/baseparser.rb Outdated
Comment thread lib/rexml/parsers/baseparser.rb Outdated
Comment thread lib/rexml/parsers/baseparser.rb Outdated
Comment thread test/parser/test_base_parser.rb Outdated
Comment thread test/parser/test_base_parser.rb Outdated
Comment thread test/parser/test_base_parser.rb Outdated
Comment thread test/parser/test_base_parser.rb Outdated
@naitoh naitoh force-pushed the fix_multiple_root_elements branch from 641d9d1 to 4e9de51 Compare July 5, 2024 04:19
@naitoh naitoh requested a review from kou July 5, 2024 04:30
naitoh added a commit to naitoh/rexml that referenced this pull request Jul 6, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: ruby#161 (comment)
Comment thread test/parse/test_processing_instruction.rb Outdated
Comment thread test/parse/test_comment.rb Outdated
Comment thread test/parse/test_element.rb Outdated
Comment thread test/parse/test_element.rb Outdated
Comment thread test/parse/test_text.rb Outdated
Comment thread lib/rexml/parsers/baseparser.rb Outdated
@source.position -= "<".bytesize
end
if @tags.empty? and @have_root
if text.strip != ""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip allocates a new string. Can we avoid it?
For example: /\A\s*\z/.match?(text)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
I see.

## Why?

XML with additional content at the end of the document is invalid.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Misc

```
[27]   	Misc	   ::=   	Comment | PI | S
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PI

```
[16]   	PI	   ::=   	'<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget

```
[17]   	PITarget	   ::=   	Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
```
@naitoh naitoh force-pushed the fix_multiple_root_elements branch from 4e9de51 to c094825 Compare July 7, 2024 00:40
@naitoh naitoh requested a review from kou July 7, 2024 00:45
naitoh added a commit to naitoh/rexml that referenced this pull request Jul 7, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: ruby#161 (comment)
@kou kou merged commit eb45c8d into ruby:master Jul 7, 2024
@kou

kou commented Jul 7, 2024

Copy link
Copy Markdown
Member

Thanks.

kou pushed a commit that referenced this pull request Jul 7, 2024
## Why?
XML declaration must be the first item.

https://www.w3.org/TR/2006/REC-xml11-20060816/#document

```
[1]   document   ::=   ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-prolog

```
[22]   prolog   ::=   	XMLDecl Misc* (doctypedecl Misc*)?
```

https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-XMLDecl

```
[23]   XMLDecl  ::=   '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
```

See: #161 (comment)
@naitoh naitoh deleted the fix_multiple_root_elements branch July 7, 2024 21:02
@DmitryPogrebnoy

DmitryPogrebnoy commented Oct 15, 2024

Copy link
Copy Markdown
Contributor

@naitoh @kou After this change, parsing of '<threads><thread id="1"/></threads><threads><thread id="1"/></threads>' using REXML::Parsers::PullParser.new(socket).pull throw an illegal seek exception. Should I create a separate issue for this problem?

The main cause is this if statement at lib/rexml/parsers/baseparser.rb:498

if @tags.empty? and @have_root
  raise ParseException.new("Malformed XML: Extra tag at the end of the document (got '<#{tag}')", @source)
end

@kou

kou commented Oct 16, 2024

Copy link
Copy Markdown
Member

<threads><thread id="1"/></threads><threads><thread id="1"/></threads> is an invalid XML.
Are you suggesting that we should support an invalid XML?

@DmitryPogrebnoy

DmitryPogrebnoy commented Oct 16, 2024

Copy link
Copy Markdown
Contributor

@kou Well, I use socket to get xml messages from the server and parse them using PullParser. Each message is complete and valid. Before change it worked like a charm. Now it doesn't work anymore.
If the current way with REXML::Parsers::PullParser.new(socket).pull does not work, what is the right approach for such a use case?

@kou

kou commented Oct 16, 2024

Copy link
Copy Markdown
Member

OK. Could you open a new issue for it?

@DmitryPogrebnoy

Copy link
Copy Markdown
Contributor

@kou Yep, here it is - #214

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants