Skip to content

Parsing breaks after <script> or <style> block, followed by an entity (&blah;) #1426

@KillyMXI

Description

@KillyMXI

Input:

import { parseDocument } from 'htmlparser2';

const document = parseDocument(
  '<style>a{}</style>&apos;<br/>',
  { decodeEntities: true }
);

console.log(document);

Observed output:

<ref *1> Document {
  type: 'root',
  parent: null,
  prev: null,
  next: null,
  startIndex: null,
  endIndex: null,
  children: [
    Element {
      type: 'style',
      parent: [Circular *1],
      prev: null,
      next: [Text],
      startIndex: null,
      endIndex: null,
      children: [Array],
      name: 'style',
      attribs: {}
    },
    Text {
      type: 'text',
      parent: [Circular *1],
      prev: [Element],
      next: null,
      startIndex: null,
      endIndex: null,
      data: "'<br/>"
    }
  ]
}

Expected: Text node contains "'", it is followed by an Element of type "tag" named "br".

When changed to <style>a{}</style>\'<br/> or <style>a{}</style><br/>&apos;<br/> - it works as expected.

When decodeEntities is set to false - it works as expected.

Version 6.1.0 is the last one that works as expected - it was broken in version 7.0.0.

First reported by @galenhuntington in html-to-text/node-html-to-text#285

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions