Fix existing and add new options by patrick-steele-idem · Pull Request #74 · fb55/htmlparser2

patrick-steele-idem · 2014-02-06T20:52:18Z

To allow a hybrid approach to parsing HTML code that may have certain XML constructs, a few more options have been added:

recognizeSelfClosing: If set to true then self-closing tags will result in the tag being closed even if xmlMode is not set to true
recognizeCDATA: If set to true then CDATA text will result in the context event being fired even if xmlMode is not set to true

Also, the lowerCaseTags and lowerCaseAttributeNames options have been fixed so that case conversion can be disabled in non-XML mode.

…non-XML mode

fb55 · 2014-02-09T20:24:22Z

You want to dynamically enable xmlMode? That will require further changes to the tokenizer. Otherwise, I'd prefer to just add another attribute to the instance (instead of the method).

patrick-steele-idem · 2014-02-10T16:55:03Z

I am just looking for more fine-grained control over the parser and no changes to the tokenizer are required. I updated the parser to allow the option to recognize self-closing tags and CDATA while still in "HTML mode". The issue was that "xmlMode" was a single option that was being used to control a lot of different parsing behaviors. In my case, I have HTML files that may include XML constructs such as self-closing tags and CDATA sections. The changes I made will not impact existing code at all because there will only be an impact on parsing if the new options are enabled.

As a separate fix, I fixed the bug where the "lowerCaseTags" and "lowerCaseAttributeNames" options were not working as expected. I pushed to the same branch so that commit was merged into this Pull Request.

Please let me know if you need more clarification. Thanks for looking into this Pull Request

fb55 · 2014-02-10T20:16:48Z

What I was saying is this: If you don't want dynamic xmlMode (which you apparently don't): Add a _lowerCaseTagNames attribute to the instance and get rid of the isLowerCaseTagsEnabled() method.

patrick-steele-idem · 2014-02-10T23:51:33Z

I'll make the change and add a new commit. Thanks.

patrick-steele-idem · 2014-02-10T23:55:18Z

I have updated my Pull Request with the suggested change. Please review and let me know if you find any issues. Thanks.

fb55 · 2014-02-11T13:31:19Z

lib/Parser.js

This fails if options is undefined (have a look at the failing test). Also, I would prefer a solution that accepts all truthy values, not only true (the property should be a bool, though). The in operator is probably the better choice here as well.

patrick-steele-idem · 2014-02-11T14:53:37Z

I think we are close. I updated the code to handle the case where options is null and verified that tests passed. I also updated the code to allow truthy values for the lowerCaseTags and lowerCaseAttributeNames options. However, I chose not to use "in" since that only checks for the property being in the options and would evaluate to true even if one of the option values was set to false. Please review the updated code and let me know if you would like me to make any additional updates. Thanks

philidem · 2014-02-14T01:23:42Z

I reviewed the changes independently and they look good. I like the fine-grain control over XML-like parsing behavior instead of having only a single "xmlMode" flag. The only line that didn't need to be changed was "this._options = options || {};" (which was split into two lines).

Are there any other changes that need to be made before this can be merged?

fb55 · 2014-02-14T01:32:21Z

I don't really understand why the in operator shouldn't be used. == null does pretty much the same, but ignores properties with the values null and undefined. I would also prefer to avoid polymorphic variables (ie. options).

patrick-steele-idem · 2014-02-14T05:32:17Z

The reason in is not a good choice is because the following code would not work as expected when using in:

var parser = new htmlparser.Parser(handlers, {
        lowerCaseTags: false
    });

The user would expect the lowerCaseTags option to be set to be false, but because that property is in the options it actually is set to true because:

("lowerCaseTags" in options) === true

The following expression will result in the correct boolean value:

(!!options.lowerCaseTags) === false

For your second point, why would you want to avoid polymorphic options argument? The parser already supported an options argument so I don't see the harm in extending it with new options.

fb55 · 2014-02-14T09:29:33Z

lib/Parser.js

this._lowerCaseAttributeNames = "lowerCaseAttributeNames" in this._options ? !!this._options.lowerCaseAttributeNames : !this._options.xmlMode;

patrick-steele-idem · 2014-02-14T19:50:27Z

I updated the code to use the in operator as suggested.

Feel free to merge my change and make any additional syntactic changes if you think they're necessary. My end-goal is to have additional control over parsing behavior in HTML parsing mode via the new options.

Thanks.

fb55 · 2014-02-14T19:54:39Z

lib/Parser.js

Just move this to a single line & I'll merge.

fb55 · 2014-02-14T19:55:44Z

It would also be great if you could update the wiki page to reflect these changes.

patrick-steele-idem · 2014-02-14T19:58:02Z

Please review the latest change with the options init code merged into a single line.

Fix existing and add new options

patrick-steele-idem · 2014-02-14T19:59:20Z

Also, I think it would be beneficial to move the documentation on the Wiki into the README so that the documentation is versioned with the code. Thoughts? Would you like me to make that change?

fb55 · 2014-02-14T19:59:23Z

Merged and done! Thanks a lot, @patrick-steele-idem!

fb55 · 2014-02-14T20:05:19Z

It would probably be beneficial to move the documentation to the module, but eg. the cheerio docs link to the page & I doubt it's worth the effort.

Fix existing and add new options

patrick-steele-idem added 2 commits February 6, 2014 13:48

fb55#73 Added support for recognizing self-closing tags and CDATA in …

6585609

…non-XML mode

Fix option to disable lower case tags and attars in non-XML mode

6c173b8

Added this._lowerCaseTagNames and this._lowerCaseAttributeNames

357be1d

fb55 reviewed Feb 11, 2014
View reviewed changes

Handle case where options is null and allow truthy values

bdb1273

fb55 reviewed Feb 14, 2014
View reviewed changes

Switched to using "in" operator for options

adfaafb

fb55 reviewed Feb 14, 2014
View reviewed changes

lib/Parser.js Outdated

Copy link

Owner

fb55 Feb 14, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move this to a single line & I'll merge.

Merged options initialization into a single line

54f33ad

fb55 added a commit that referenced this pull request Feb 14, 2014

Merge pull request #74 from patrick-steele-idem/master

4497ee4

Fix existing and add new options

fb55 merged commit 4497ee4 into fb55:master Feb 14, 2014

fb55 mentioned this pull request Feb 24, 2014

html parser doesn't handle cdata #59

Closed

fb55 added a commit that referenced this pull request Oct 21, 2018

Merge pull request #74 from patrick-steele-idem/master

55ce8d2

Fix existing and add new options

vassudanagunta mentioned this pull request Nov 28, 2021

Test additions to expose/clarify perviously uncovered distinctions between "self closing tags" and void elements #1023

Closed

vassudanagunta mentioned this pull request Dec 14, 2021

tests: update Events/07 test to clarify interpretation of tag end slashes #1046

Merged

Uh oh!

Conversation

patrick-steele-idem commented Feb 6, 2014

Uh oh!

fb55 commented Feb 9, 2014

Uh oh!

patrick-steele-idem commented Feb 10, 2014

Uh oh!

fb55 commented Feb 10, 2014

Uh oh!

patrick-steele-idem commented Feb 10, 2014

Uh oh!

patrick-steele-idem commented Feb 10, 2014

Uh oh!

fb55 Feb 11, 2014

Choose a reason for hiding this comment

Uh oh!

patrick-steele-idem commented Feb 11, 2014

Uh oh!

philidem commented Feb 14, 2014

Uh oh!

fb55 commented Feb 14, 2014

Uh oh!

patrick-steele-idem commented Feb 14, 2014

Uh oh!

fb55 Feb 14, 2014

Choose a reason for hiding this comment

Uh oh!

patrick-steele-idem commented Feb 14, 2014

Uh oh!

fb55 Feb 14, 2014

Choose a reason for hiding this comment

Uh oh!

fb55 commented Feb 14, 2014

Uh oh!

patrick-steele-idem commented Feb 14, 2014

Uh oh!

patrick-steele-idem commented Feb 14, 2014

Uh oh!

fb55 commented Feb 14, 2014

Uh oh!

fb55 commented Feb 14, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants