Htmlparser beatifulsoup by PetrDlouhy · Pull Request #119 · linkchecker/linkchecker

PetrDlouhy · 2018-01-07T15:01:12Z

PR for #26

This completely removes the build-in parser and replaces it with BeatifulSoup.
The current implementation is mocking the API of the old parser. It means, that if somebody would like to use the old parser (maybe it is faster?), he could make standalone project from that and easily change the libraries.

So far this approach works, but there are few remaining failing tests and it would require a bit of cleaning.

PetrDlouhy · 2018-01-09T20:51:34Z

I have updated this.
There are remaining problems:

The failed tests are probably all due to differences between the parsers. But it would need to be checked and tested and the tests changed (or some removed in case of test_parser.py).
I was not possible to get line numbers and position for the tag. As I was searching, BeautifulSoup doesn't provide this kind of information.

mgedmin · 2018-01-09T21:48:47Z

I'd be very happy to see the custom C parser replaced with something external and maintained.

I was not possible to get line numbers and position for the tag. As I was searching, BeautifulSoup doesn't provide this kind of information.

That's sad :(

Are there other HTML parsers we could consider? Does lxml.html provide line number information?

Can we file a wishlist bug request to BeautifulSoup, or maybe even consider a PR?

Can we live without line numbers? Search exists, as long as we know http://example.com is a broken link on some_page.html, we can find it.

mgedmin

I'm greatly in favour of the idea.

Charset handling worries me a bit in the current implementation. (Then again I'm not familiar with the LinkChecker sources, and cannot say if it overrides the charset to the one it gets from HTTP headers in code not touched by this PR.)

mgedmin · 2018-01-09T21:51:42Z

linkcheck/HtmlParser/htmlsax.py

+        self.handler = handler
+
+    def feed(self, feed_text):
+        if type(feed_text) == bytes:


if isinstance(feed_text, bytes): would be more idiomatic.

I really hope the caller of this API won't ever feed us partial UTF-8 characters!

mgedmin · 2018-01-09T21:53:58Z

linkcheck/HtmlParser/htmlsax.py

+        self.html_doc = u""
+
+    def flush(self):
+        soup = BeautifulSoup(self.html_doc, 'html.parser')


Oh, how about this instead: make self.html_doc a StringIO(), write to it in feed(), and here do html_doc = self.html_doc.getvalue(), and then do the if isinstance(html_doc, bytes): html_doc = html_doc.decode(self.encoding)?

Also, what if the document encoding is not actually ISO-8859-1?

mgedmin · 2018-01-09T21:54:26Z

linkcheck/HtmlParser/htmlsax.py

+        for tag in soup.find_all():
+            attrs = ListDict()
+            for k, v_list in tag.attrs.items():
+                if type(v_list) != list:


if not isinstance(v_list, list):

mgedmin · 2018-01-09T21:54:42Z

linkcheck/HtmlParser/htmlsax.py

+                if type(v_list) != list:
+                    v_list = [v_list]
+                for v in v_list:
+                    if type(v) == str:


if isinstance(v, bytes):

mgedmin · 2018-01-09T21:55:04Z

linkcheck/HtmlParser/htmlsax.py

+                    self.handler.comment(comment)
+
+    def debug(self, text):
+        raise NotImplementedError("number is not implemented")


number -> debug?

mgedmin · 2018-01-09T21:55:39Z

linkcheck/HtmlParser/htmlsax.py

+        raise NotImplementedError("Peeking is deprecated in favor of 'handler.start_element' attribute 'element_text'")
+
+
+def parser(handler = None):


PEP-8 says no spaces around = here.

mgedmin · 2018-01-09T21:58:48Z

requirements.txt

@@ -1,4 +1,5 @@
 # required:
+bs4


This also needs to go into install_requires in setup.py.

mgedmin · 2018-01-09T21:59:17Z

setup.py

        'linkcheck.configuration',
        'linkcheck.director',
        'linkcheck.htmlutil',
-        'linkcheck.HtmlParser',


This sort of thing is greatly simplified by using setuptools.find_packages().

anarcat · 2018-01-10T14:42:24Z

just a little note to send a big cheer down that lane: great initiative, and i hope we can make this work. it's utterly reckless to have our own C-based HTML parser...

In fact, I often wondered if we should just reuse those Javascript-based engines like Phantom or whatever that DD ended up with. ;)

can't we file a bug against beautiful soup to get line numbers reported or is that something completely out of scope there? progressive enhancement is good..

there's a conflict on the travis config file here, should be easy to fix, and i otherwise agree with the current review.

keep up the good work!

PetrDlouhy · 2018-01-12T11:57:01Z

I filled the issue here: https://bugs.launchpad.net/beautifulsoup/+bug/1742921

I also tried to implement this with Python's HTMLParser here: PetrDlouhy@b013538
The implementation is even more straight-forward, but I wasn't able to get past the peeking. The peek function is speciality of the built in parser, and it's behaviour is not easy to simulate and couldn't be overpassed as easy as with BeautifulSoup.

I am not sure about the encoding handling either. I think, we should make some tests focused on the encoding and on the line numbers prior to pulling this.

anarcat

so i'd be really happy to merge this, but we need to make sure we don't crash on weird encodings. travis seems to be failing on this right now - not sure what's up there, but obviously that needs to be fixed as well...

line numbers isn't that important. can't we get another locator of some sort from Beautiful soup, say something like an Xpath? or just the original link string? then people can just grep for that - it is the way I use this in the first place.

this will definitly break the GUI, but we already have other issues with that thing already (e.g. #105).

anarcat · 2018-01-12T13:56:55Z

linkcheck/HtmlParser/htmllib.py

        @type data: string
        @return: None
        """
-        data = data.encode(self.encoding, "ignore")


i agree encoding is an issue. we don't want to start crashing on weird encodings we find out there. and we don't need to be super fancy - from what i can tell from the previous source, we just drop weird chars... in fact, i wonder why those encoding lines are dropped here in the first place... isn't that where we would fix the encoding?

PetrDlouhy · 2018-01-12T23:41:19Z

I think, I solved the encoding. I removed the decoding parts and it is stored now in bytes (I will try to rewrite it to be BytesIO).
It seems, that now it works properly unlike in original Linkchecker. The charset tests, that are separately in #125 are passing with this implementation!

need to review the new patches, but i don't want to block

anarcat · 2018-01-13T04:09:16Z

after getting confirmation that weird encodings don't even work now (#125), i retract my opposition to this PR. i think the concerns of @mgedmin still stand, but at least we shouldn't block on encoding.

thanks for doing all this research!

PetrDlouhy · 2018-01-15T12:31:54Z

I made several corrections for this.
I fixed all tests, but I had to change test_parser massively to achieve this. The new parser is working different on edge cases, but I hope, closer to browsers in many cases. Anyway, this would require careful review to ensure, that it works properly.

This also doesn't include line numbers reporting or some other reference (except the link name and url, which Linkchecker gives).

mgedmin · 2018-01-15T13:53:57Z

I've no energy to review this today. Please ping me if I forget to come back to it.

anarcat · 2018-04-13T21:16:55Z

@mgedmin ping :)

in retrospect, i think i'd be tempted to be more lax with a new beautifulsoup parser than anything else: rolling our own parser (written in C, nothing less!) is a really bad idea and i would be quite glad to get rid of it, even if it means a few more warts.

mgedmin

LGTM!

mgedmin · 2018-04-15T09:06:12Z

linkcheck/HtmlParser/htmlsax.py

+            else:
+                self.html_doc = StringIO()
+            self.html_doc.truncate(0)
+            self.html_doc.seek(0)


I don't think there's much point in truncating and seeking in an empty html_doc you've just created.

mgedmin · 2018-04-15T09:06:48Z

linkcheck/HtmlParser/htmlsax.py

+    def reset(self):
+        self.html_doc.truncate(0)
+        self.html_doc.seek(0)
+        self.html_doc = None


Again, I don't see the point in truncating a StringIO you're about to delete. Let the garbage collector take care of it.

mgedmin · 2018-04-15T09:08:26Z

linkcheck/HtmlParser/htmlsax.py

+                    self.handler.cdata(content)
+            elif isinstance(content, ProcessingInstruction):
+                if hasattr(self.handler, 'pi'):
+                    self.handler.pi(content[:-1])


Would be nice to have a comment explaining the [:-1] bit.

mgedmin · 2018-04-15T09:10:09Z

linkcheck/HtmlParser/htmlsax.py

+        return 0
+
+    def pos(self, text):
+        return 0


I think if we merge this we should also remove all calls of these placeholder methods (and the methods themselves). I do not want faked line numbers to show up in user-oriented messages.

mgedmin · 2018-04-15T09:12:26Z

linkcheck/htmlutil/linkparse.py

        self.title = None

-    def start_element (self, tag, attrs):
+    def start_element (self, tag, attrs, title=None):


I would prefer it if the argument was called element_text. Otherwise it can be very confusing for elements like <a> that have a title attribute.

mgedmin · 2018-04-15T09:14:04Z

tests/checker/data/charsets/cp1250.html

@@ -0,0 +1,3 @@
+<meta http-equiv="Content-Type" content="text/html; charset=cp1250">
+<a href="file.html">�lu�ou�k� k�� �p�l ��belsk� �dy ����</a>
+<img src="img.png" alt="�lu�ou�k� k�� �p�l ��belsk� �dy ����"/>


I'm amazed that GitHub can figure out the right encoding and display this correctly!

mgedmin · 2018-04-15T09:15:34Z

tests/checker/data/http_quotes.html.result

-url http://www.example.com/
-cache key http://www.example.com/
-real url http://www.example.com/
+url http://www.example.com/"


Fair enough, I suppose.

mgedmin · 2018-04-15T09:16:32Z

tests/checker/test_charsets.py

@@ -0,0 +1,40 @@
+# -*- coding: iso-8859-1 -*-


I wouldn't add encoding declarations to files that do not contain any non-ASCII characters.

anarcat · 2018-04-15T17:30:48Z

@PetrDlouhy i know it's been a long time, but @mgedmin finally reviewed your work, and while we could merge as is, it seems there are minor tweaks that would improve the result... could you take a look?

(i took the liberty of resolving conflicts after #147 - that could be rebased to remove the merge commit...)

mgedmin · 2018-04-15T18:36:55Z

We still have one test failure because of the missing line numbers.

anarcat · 2018-06-21T12:46:11Z

do we have an upstream issue for the line number stuff?

I would be ready to drop that feature in favor of having a non-C, non-housemade parser. what do you think @mgedmin ?

mgedmin · 2018-06-22T11:34:21Z

do we have an upstream issue for the line number stuff?

I don't know. I never filed one.

I would be ready to drop that feature in favor of having a non-C, non-housemade parser. what do you think @mgedmin ?

Oh, yes! Only I'd like to do it properly (i.e. remove any line numbers from output instead of printing 0 in their place if we currently do so), and somebody ought to fix the test failure before we can merge this.

I don't have the energy currently to push this (or any other linkchecker issue), but I will try to provide code reviews.

anarcat · 2018-06-22T12:25:26Z

actually @PetrDlouhy has already filed an issue at https://bugs.launchpad.net/beautifulsoup/+bug/1742921 so we've got that covered. it seems there was zero activity on that and, in general, beautifulsoup does not seem very active so I am a little worried about blocking on this in general. the response from the mailing list was this, mind you:

All three parsers used by Beautiful Soup have some way of reading off
the current line number, but Beautiful Soup does not record this
information. That's something I could add, but for your project I
recommend making direct use of html.parser, html5lib, or lxml.

so it seems that technically the information is there somewhere, just not available in the API... but maybe we could monkeypatch our way around that?

in any case: let's not block on this, but do this properly. @PetrDlouhy unless you want to patch beautifulsoup i guess the next step here is to rip out line number support from linkchecker itself so that we get tests to pass and, as @mgedmin, do this properly. :)

thanks!

anarcat · 2018-11-29T14:09:47Z

update: upstream doesn't plan on adding the feature, but is open to a pull request.

so the next step here is either to rip out the line number code or patch beautifulsoup to include those.

@PetrDlouhy are you up for this?

cjmayo · 2019-04-14T18:14:59Z

Are there other HTML parsers we could consider? Does lxml.html provide line number information?

Can we file a wishlist bug request to BeautifulSoup, or maybe even consider a PR?

Can we live without line numbers? Search exists, as long as we know http://example.com is a broken link on some_page.html, we can find it.

lxml Element has a property sourceline:

Original line number as found by the parser or None if unknown.
https://lxml.de/api/lxml.etree._Element-class.html#sourceline

Don't know if that would work.

anarcat · 2019-04-15T14:29:45Z

Don't know if that would work.

that would be awesome, but at this point I think we're ready to ditch our homegrown parser in favor of beautifulsoup even if it means losing line numbers. it's too much of a liability.

cjmayo · 2019-04-16T19:38:39Z

Beautiful Soup is using somebody else's parser, optionally lxml:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
So, maybe we can have both.

cjmayo · 2019-10-30T20:18:17Z

All the commits and code comments plus line number support are addressed in #337.

PetrDlouhy force-pushed the htmlparser-beatifulsoup branch 8 times, most recently from f10135e to 01fedd0 Compare January 9, 2018 20:38

This was referenced Jan 9, 2018

Consider using BeautifulSoup instead of built-in HTML parser? #26

Closed

Add Python 3 support #40

Closed

mgedmin reviewed Jan 9, 2018

View reviewed changes

PetrDlouhy force-pushed the htmlparser-beatifulsoup branch 2 times, most recently from 53087ae to f60731f Compare January 12, 2018 11:35

anarcat requested changes Jan 12, 2018

View reviewed changes

PetrDlouhy force-pushed the htmlparser-beatifulsoup branch 3 times, most recently from bd0cdfe to 37d6148 Compare January 12, 2018 23:32

anarcat previously approved these changes Jan 13, 2018

View reviewed changes

PetrDlouhy force-pushed the htmlparser-beatifulsoup branch 5 times, most recently from 29acf9a to 7964412 Compare January 14, 2018 23:47

PetrDlouhy changed the title ~~(WIP) Htmlparser beatifulsoup~~ Htmlparser beatifulsoup Jan 15, 2018

anarcat mentioned this pull request Apr 13, 2018

Require requests >= 2.4 #147

Merged

mgedmin approved these changes Apr 15, 2018

View reviewed changes

Merge branch 'master' into htmlparser-beatifulsoup

d6cfac8

This was referenced May 20, 2018

README.rst vs. install.txt contradiction & installation on Windows #124

Closed

build Windows binaries #53

Open

mgedmin mentioned this pull request Nov 6, 2018

Mutex around parse #198

Merged

Merge branch 'master' into htmlparser-beatifulsoup

c5267d4

cjmayo mentioned this pull request Apr 21, 2019

[WIP] Convert HtmlParser to use Beautiful Soup #249

Closed

cjmayo added a commit to cjmayo/linkchecker that referenced this pull request May 1, 2019

squash - implement linkchecker#119 comments

8885622

cjmayo added a commit to cjmayo/linkchecker that referenced this pull request May 3, 2019

squash - implement linkchecker#119 comments

7bd7b71

cjmayo added a commit to cjmayo/linkchecker that referenced this pull request May 4, 2019

squash - implement linkchecker#119 comments

33fb5b9

cjmayo added a commit to cjmayo/linkchecker that referenced this pull request May 4, 2019

squash - implement linkchecker#119 comments

055edf4

cjmayo added a commit to cjmayo/linkchecker that referenced this pull request Jul 22, 2019

squash - implement linkchecker#119 comments

3f3200b

cjmayo added a commit to cjmayo/linkchecker that referenced this pull request Sep 20, 2019

squash - implement linkchecker#119 comments

0b09b7e

cjmayo mentioned this pull request Oct 30, 2019

Change HtmlParser to use Beautiful Soup #337

Merged

cjmayo closed this Oct 30, 2019

		raise NotImplementedError("Peeking is deprecated in favor of 'handler.start_element' attribute 'element_text'")


		def parser(handler = None):

Conversation

PetrDlouhy commented Jan 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PetrDlouhy commented Jan 9, 2018

Uh oh!

mgedmin commented Jan 9, 2018

Uh oh!

mgedmin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anarcat commented Jan 10, 2018

Uh oh!

PetrDlouhy commented Jan 12, 2018

Uh oh!

anarcat left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PetrDlouhy commented Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anarcat commented Jan 13, 2018

Uh oh!

PetrDlouhy commented Jan 15, 2018

Uh oh!

mgedmin commented Jan 15, 2018

Uh oh!

anarcat commented Apr 13, 2018

Uh oh!

mgedmin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anarcat commented Apr 15, 2018

Uh oh!

mgedmin commented Apr 15, 2018

Uh oh!

anarcat commented Jun 21, 2018

Uh oh!

mgedmin commented Jun 22, 2018

Uh oh!

anarcat commented Jun 22, 2018

Uh oh!

anarcat commented Nov 29, 2018

Uh oh!

cjmayo commented Apr 14, 2019

Uh oh!

PetrDlouhy commented Jan 7, 2018 •

edited

Loading

PetrDlouhy commented Jan 12, 2018 •

edited

Loading