Skip to content

Parsing to normalize HTML in spec tests not always appropriate #702

@ollpu

Description

@ollpu

Currently when normalizing HTML for the correct answer and library output, the HTML is parsed with html5ever. This doesn't seem appropriate because some tests explicitly pass through invalid HTML.

I enabled error reporting like this:

diff
diff --git a/tests/lib.rs b/tests/lib.rs
index c1c2181..583c8fb 100644
--- a/tests/lib.rs
+++ b/tests/lib.rs
@@ -1,6 +1,8 @@
 #![cfg(feature = "html")]
 
 use html5ever::serialize::{serialize, SerializeOpts};
+use html5ever::tokenizer::TokenizerOpts;
+use html5ever::tree_builder::TreeBuilderOpts;
 use html5ever::{driver as html, local_name, namespace_url, ns, QualName};
 use markup5ever_rcdom::{Handle, NodeData, RcDom, SerializableHandle};
 use pulldown_cmark::{Options, Parser};
@@ -38,7 +40,12 @@ pub fn test_markdown_html(input: &str, output: &str, smart_punct: bool, metadata
     let p = Parser::new_ext(input, opts);
     pulldown_cmark::html::push_html(&mut s, p);
 
-    assert_eq!(normalize_html(output), normalize_html(&s));
+    println!("normalizing correct HTML");
+    let correct = normalize_html(output);
+    println!("ok");
+    println!("normalizing result HTML");
+    let result = normalize_html(&s);
+    assert_eq!(correct, result);
 }
 
 lazy_static::lazy_static! {
@@ -114,7 +121,16 @@ lazy_static::lazy_static! {
 fn make_html_parser() -> html::Parser<RcDom> {
     html::parse_fragment(
         RcDom::default(),
-        html::ParseOpts::default(),
+        html::ParseOpts {
+            tokenizer: TokenizerOpts {
+                exact_errors: true,
+                ..Default::default()
+            },
+            tree_builder: TreeBuilderOpts {
+                exact_errors: true,
+                ..Default::default()
+            },
+        },
         QualName::new(None, ns!(html), local_name!("div")),
         vec![],
     )
@@ -128,7 +144,13 @@ fn normalize_html(s: &str) -> String {
     let mut ret_val = Vec::new();
     serialize(&mut ret_val, &body, opts)
         .expect("Writing to a string shouldn't fail (expect on OOM)");
-    String::from_utf8(ret_val).expect("html5ever should always produce UTF8")
+    let result = String::from_utf8(ret_val).expect("html5ever should always produce UTF8");
+    assert!(
+        dom.errors.is_empty(),
+        "Errors while parsing HTML:\n{:#?}\nfrom string\n{s:?}\noutput\n{result:?}",
+        dom.errors
+    );
+    result
 }
 
 fn normalize_dom(dom: &RcDom) -> Handle {

There are many cases where the HTML is changed in a questionable manner before checking for equality.

Questionable normalizations

Preprocessing tag is converted to a comment:

---- suite::regression::regression_test_13 stdout ----
normalizing correct HTML
thread 'suite::regression::regression_test_13' panicked at 'Errors while parsing HTML:
[
    "Saw ? in state TagOpen",
]
from string
"<h2>a &lt;?php this is not a valid processing tag</h2>\n<p>b <?php but this is ?></p>\n"
output
"<h2>a &lt;?php this is not a valid processing tag</h2> <p>b <!--?php but this is ?--></p>"', tests/lib.rs:148:5

Here the library outputs nested anchors, which are technically illegal. The correct answer expects them to be separated (unintentionally?). Either way, the difference is masked by normalization and the test passes:

---- suite::regression::regression_test_29 stdout ----
normalizing correct HTML
ok
normalizing result HTML
thread 'suite::regression::regression_test_29' panicked at 'Errors while parsing HTML:
[
    "Unexpected token Tag { kind: StartTag, name: Atom(\\'a\\' type=static), self_closing: false, attrs: [Attribute { name: QualName { prefix: None, ns: Atom(\\'\\' type=static), local: Atom(\\'href\\' type=static) }, value: Tendril<UTF8>(owned: \\\"http://one\\\") }] } in insertion mode InBody",
    "Found special tag while closing generic tag",
]
from string
"<p><a href=\"url\"><a href=\"http://one\">http://one</a> <a href=\"http://two\">http://two</a></a></p>\n"
output
"<p><a href=\"url\"></a><a href=\"http://one\">http://one</a> <a href=\"http://two\">http://two</a></p>"', tests/lib.rs:148:5

Anchor split into multiple parts:

---- suite::regression::regression_test_66 stdout ----
normalizing correct HTML
thread 'suite::regression::regression_test_66' panicked at 'Errors while parsing HTML:
[
    "Unexpected open element while closing Atom('p' type=static)",
    "Unexpected open element while closing Atom('blockquote' type=static)",
    "Unexpected open element while closing Atom('blockquote' type=static)",
    "Unexpected open tag {http://www.w3.org/1999/xhtml}:a at end of body",
]
from string
"<blockquote>\n<blockquote>\n<p>a <a href\n=\"yo\nlo\"></p>\n</blockquote>\n</blockquote>\n"
output
"<blockquote><blockquote><p>a <a href=\"yo\nlo\"></a></p><a href=\"yo\nlo\"></a></blockquote><a href=\"yo\nlo\"></a></blockquote><a href=\"yo\nlo\"></a>"', tests/lib.rs:148:5

<!DOCTYPE> is removed completely:

---- suite::regression::regression_test_86 stdout ----
normalizing correct HTML
thread 'suite::regression::regression_test_86' panicked at 'Errors while parsing HTML:
[
    "DOCTYPE in insertion mode InBody",
]
from string
"<!doctype html>\n"
output
""', tests/lib.rs:148:5

I also found some actual missing closing tags, for which I'll file a PR.

Parsing normalization might be needed for some examples, to insert <tbody> tags and such. Then again, some Commonmark spec examples are also normalized in an unwanted way:

---- suite::spec::spec_test_181 stdout ----
normalizing correct HTML
thread 'suite::spec::spec_test_181' panicked at 'Errors while parsing HTML:
[
    "DOCTYPE in insertion mode InBody",
]
from string
"<!DOCTYPE html>\n"
output
""', tests/lib.rs:148:5

Not sure what to do with the 3rd party specs, but at least for our own tests I think the normalization should be a little simpler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions