[DomCrawler] HTML5 not recognized when document starts with a comment

**Symfony version(s) affected**: 4.4.11 (`symfony/dom-crawler`)

**Description**
Symfony DOM crawler has an option to use a HTML5 parser when you install the respective package ( `masterminds/html5`). However, this parser is specifically checking for a HTML5 doc-type as the first content in the HTML. The following situation therefore does not work (see reproduction):

**How to reproduce** 
Consider the following file `sample.html`: 
```html

<!DOCTYPE html>
<html lang="en">
<body>
    <h1>Hello</h1>
</body>
</html>
```

Next, we create a crawler with this content:
```php
$crawler = new \Symfony\Component\DomCrawler\Crawler(file_get_contents('sample.html'), 'https://example.com');
```

The file above is now parsed using the regular non-html5 parser.

As seen on this line https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L186,
it evaluates to `parseXhtml` instead of the expected `parseHtml5`:
```php
$dom = null !== $this->html5Parser && strspn($content, " \t\r\n") === stripos($content, '<!doctype html>') ? $this->parseHtml5($content, $charset) : $this->parseXhtml($content, $charset);
```

This creates trivial issues since it is actually a HTML5 document.

P.S. I dont know if the html sample above is according to spec.

**Possible Solution** 
**1)**
A dirty fix I'm using is simply discarding any HTML comments using a regex:
```
$content = preg_replace('//s', '', file_get_contents('sample.html'));
$crawler = new Crawler($content, ...);
```

This is unlikely to be a closing solution. I can imagine there being websites that have `<script>` tags or even other html elements before the `<!DOCTYPE html>` definition. Again, I do not know if this is against html5 spec.

**2)**
Add a feature so the HTML5 parser can be forced for any content you pass. I have no clue what implications this has because this causes non-html5 content to be parsed by the HTML5 parser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DomCrawler] HTML5 not recognized when document starts with a comment #37681

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[DomCrawler] HTML5 not recognized when document starts with a comment #37681

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions