Symfony version(s) affected: 4.4.11 (symfony/dom-crawler)
Description
Symfony DOM crawler has an option to use a HTML5 parser when you install the respective package ( masterminds/html5). However, this parser is specifically checking for a HTML5 doc-type as the first content in the HTML. The following situation therefore does not work (see reproduction):
How to reproduce
Consider the following file sample.html:
<!--
This is a comment
-->
<!DOCTYPE html>
<html lang="en">
<body>
<h1>Hello</h1>
</body>
</html>
Next, we create a crawler with this content:
$crawler = new \Symfony\Component\DomCrawler\Crawler(file_get_contents('sample.html'), 'https://example.com');
The file above is now parsed using the regular non-html5 parser.
As seen on this line https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L186,
it evaluates to parseXhtml instead of the expected parseHtml5:
$dom = null !== $this->html5Parser && strspn($content, " \t\r\n") === stripos($content, '<!doctype html>') ? $this->parseHtml5($content, $charset) : $this->parseXhtml($content, $charset);
This creates trivial issues since it is actually a HTML5 document.
P.S. I dont know if the html sample above is according to spec.
Possible Solution
1)
A dirty fix I'm using is simply discarding any HTML comments using a regex:
$content = preg_replace('/<!--.*?-->/s', '', file_get_contents('sample.html'));
$crawler = new Crawler($content, ...);
This is unlikely to be a closing solution. I can imagine there being websites that have <script> tags or even other html elements before the <!DOCTYPE html> definition. Again, I do not know if this is against html5 spec.
2)
Add a feature so the HTML5 parser can be forced for any content you pass. I have no clue what implications this has because this causes non-html5 content to be parsed by the HTML5 parser.
Symfony version(s) affected: 4.4.11 (
symfony/dom-crawler)Description
Symfony DOM crawler has an option to use a HTML5 parser when you install the respective package (
masterminds/html5). However, this parser is specifically checking for a HTML5 doc-type as the first content in the HTML. The following situation therefore does not work (see reproduction):How to reproduce
Consider the following file
sample.html:Next, we create a crawler with this content:
The file above is now parsed using the regular non-html5 parser.
As seen on this line https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L186,
it evaluates to
parseXhtmlinstead of the expectedparseHtml5:This creates trivial issues since it is actually a HTML5 document.
P.S. I dont know if the html sample above is according to spec.
Possible Solution
1)
A dirty fix I'm using is simply discarding any HTML comments using a regex:
This is unlikely to be a closing solution. I can imagine there being websites that have
<script>tags or even other html elements before the<!DOCTYPE html>definition. Again, I do not know if this is against html5 spec.2)
Add a feature so the HTML5 parser can be forced for any content you pass. I have no clue what implications this has because this causes non-html5 content to be parsed by the HTML5 parser.