Skip to content

Htmlparser beatifulsoup#119

Closed
PetrDlouhy wants to merge 13 commits intolinkchecker:masterfrom
PetrDlouhy:htmlparser-beatifulsoup
Closed

Htmlparser beatifulsoup#119
PetrDlouhy wants to merge 13 commits intolinkchecker:masterfrom
PetrDlouhy:htmlparser-beatifulsoup

Conversation

@PetrDlouhy
Copy link
Copy Markdown
Contributor

@PetrDlouhy PetrDlouhy commented Jan 7, 2018

PR for #26

This completely removes the build-in parser and replaces it with BeatifulSoup.
The current implementation is mocking the API of the old parser. It means, that if somebody would like to use the old parser (maybe it is faster?), he could make standalone project from that and easily change the libraries.

So far this approach works, but there are few remaining failing tests and it would require a bit of cleaning.

@PetrDlouhy PetrDlouhy force-pushed the htmlparser-beatifulsoup branch 8 times, most recently from f10135e to 01fedd0 Compare January 9, 2018 20:38
@PetrDlouhy
Copy link
Copy Markdown
Contributor Author

I have updated this.
There are remaining problems:

  • The failed tests are probably all due to differences between the parsers. But it would need to be checked and tested and the tests changed (or some removed in case of test_parser.py).
  • I was not possible to get line numbers and position for the tag. As I was searching, BeautifulSoup doesn't provide this kind of information.

@mgedmin
Copy link
Copy Markdown
Contributor

mgedmin commented Jan 9, 2018

I'd be very happy to see the custom C parser replaced with something external and maintained.

I was not possible to get line numbers and position for the tag. As I was searching, BeautifulSoup doesn't provide this kind of information.

That's sad :(

Are there other HTML parsers we could consider? Does lxml.html provide line number information?

Can we file a wishlist bug request to BeautifulSoup, or maybe even consider a PR?

Can we live without line numbers? Search exists, as long as we know http://example.com is a broken link on some_page.html, we can find it.

Copy link
Copy Markdown
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm greatly in favour of the idea.

Charset handling worries me a bit in the current implementation. (Then again I'm not familiar with the LinkChecker sources, and cannot say if it overrides the charset to the one it gets from HTTP headers in code not touched by this PR.)

self.handler = handler

def feed(self, feed_text):
if type(feed_text) == bytes:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(feed_text, bytes): would be more idiomatic.

I really hope the caller of this API won't ever feed us partial UTF-8 characters!

self.html_doc = u""

def flush(self):
soup = BeautifulSoup(self.html_doc, 'html.parser')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, how about this instead: make self.html_doc a StringIO(), write to it in feed(), and here do html_doc = self.html_doc.getvalue(), and then do the if isinstance(html_doc, bytes): html_doc = html_doc.decode(self.encoding)?

Also, what if the document encoding is not actually ISO-8859-1?

for tag in soup.find_all():
attrs = ListDict()
for k, v_list in tag.attrs.items():
if type(v_list) != list:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not isinstance(v_list, list):

if type(v_list) != list:
v_list = [v_list]
for v in v_list:
if type(v) == str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(v, bytes):

self.handler.comment(comment)

def debug(self, text):
raise NotImplementedError("number is not implemented")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number -> debug?

raise NotImplementedError("Peeking is deprecated in favor of 'handler.start_element' attribute 'element_text'")


def parser(handler = None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP-8 says no spaces around = here.

@@ -1,4 +1,5 @@
# required:
bs4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to go into install_requires in setup.py.

'linkcheck.configuration',
'linkcheck.director',
'linkcheck.htmlutil',
'linkcheck.HtmlParser',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of thing is greatly simplified by using setuptools.find_packages().

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jan 10, 2018

just a little note to send a big cheer down that lane: great initiative, and i hope we can make this work. it's utterly reckless to have our own C-based HTML parser...

In fact, I often wondered if we should just reuse those Javascript-based engines like Phantom or whatever that DD ended up with. ;)

can't we file a bug against beautiful soup to get line numbers reported or is that something completely out of scope there? progressive enhancement is good..

there's a conflict on the travis config file here, should be easy to fix, and i otherwise agree with the current review.

keep up the good work!

@PetrDlouhy PetrDlouhy force-pushed the htmlparser-beatifulsoup branch 2 times, most recently from 53087ae to f60731f Compare January 12, 2018 11:35
@PetrDlouhy
Copy link
Copy Markdown
Contributor Author

I filled the issue here: https://bugs.launchpad.net/beautifulsoup/+bug/1742921

I also tried to implement this with Python's HTMLParser here: PetrDlouhy@b013538
The implementation is even more straight-forward, but I wasn't able to get past the peeking. The peek function is speciality of the built in parser, and it's behaviour is not easy to simulate and couldn't be overpassed as easy as with BeautifulSoup.

I am not sure about the encoding handling either. I think, we should make some tests focused on the encoding and on the line numbers prior to pulling this.

Copy link
Copy Markdown
Contributor

@anarcat anarcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'd be really happy to merge this, but we need to make sure we don't crash on weird encodings. travis seems to be failing on this right now - not sure what's up there, but obviously that needs to be fixed as well...

line numbers isn't that important. can't we get another locator of some sort from Beautiful soup, say something like an Xpath? or just the original link string? then people can just grep for that - it is the way I use this in the first place.

this will definitly break the GUI, but we already have other issues with that thing already (e.g. #105).

@type data: string
@return: None
"""
data = data.encode(self.encoding, "ignore")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree encoding is an issue. we don't want to start crashing on weird encodings we find out there. and we don't need to be super fancy - from what i can tell from the previous source, we just drop weird chars... in fact, i wonder why those encoding lines are dropped here in the first place... isn't that where we would fix the encoding?

@PetrDlouhy PetrDlouhy force-pushed the htmlparser-beatifulsoup branch 3 times, most recently from bd0cdfe to 37d6148 Compare January 12, 2018 23:32
@PetrDlouhy
Copy link
Copy Markdown
Contributor Author

PetrDlouhy commented Jan 12, 2018

I think, I solved the encoding. I removed the decoding parts and it is stored now in bytes (I will try to rewrite it to be BytesIO).
It seems, that now it works properly unlike in original Linkchecker. The charset tests, that are separately in #125 are passing with this implementation!

anarcat
anarcat previously approved these changes Jan 13, 2018
@anarcat anarcat dismissed their stale review January 13, 2018 04:08

need to review the new patches, but i don't want to block

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jan 13, 2018

after getting confirmation that weird encodings don't even work now (#125), i retract my opposition to this PR. i think the concerns of @mgedmin still stand, but at least we shouldn't block on encoding.

thanks for doing all this research!

@PetrDlouhy PetrDlouhy force-pushed the htmlparser-beatifulsoup branch 5 times, most recently from 29acf9a to 7964412 Compare January 14, 2018 23:47
@PetrDlouhy
Copy link
Copy Markdown
Contributor Author

I made several corrections for this.
I fixed all tests, but I had to change test_parser massively to achieve this. The new parser is working different on edge cases, but I hope, closer to browsers in many cases. Anyway, this would require careful review to ensure, that it works properly.

This also doesn't include line numbers reporting or some other reference (except the link name and url, which Linkchecker gives).

@PetrDlouhy PetrDlouhy changed the title (WIP) Htmlparser beatifulsoup Htmlparser beatifulsoup Jan 15, 2018
@mgedmin
Copy link
Copy Markdown
Contributor

mgedmin commented Jan 15, 2018

I've no energy to review this today. Please ping me if I forget to come back to it.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Apr 13, 2018

@mgedmin ping :)

in retrospect, i think i'd be tempted to be more lax with a new beautifulsoup parser than anything else: rolling our own parser (written in C, nothing less!) is a really bad idea and i would be quite glad to get rid of it, even if it means a few more warts.

@anarcat anarcat mentioned this pull request Apr 13, 2018
Copy link
Copy Markdown
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

else:
self.html_doc = StringIO()
self.html_doc.truncate(0)
self.html_doc.seek(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's much point in truncating and seeking in an empty html_doc you've just created.

def reset(self):
self.html_doc.truncate(0)
self.html_doc.seek(0)
self.html_doc = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't see the point in truncating a StringIO you're about to delete. Let the garbage collector take care of it.

self.handler.cdata(content)
elif isinstance(content, ProcessingInstruction):
if hasattr(self.handler, 'pi'):
self.handler.pi(content[:-1])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a comment explaining the [:-1] bit.

return 0

def pos(self, text):
return 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we merge this we should also remove all calls of these placeholder methods (and the methods themselves). I do not want faked line numbers to show up in user-oriented messages.

self.title = None

def start_element (self, tag, attrs):
def start_element (self, tag, attrs, title=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer it if the argument was called element_text. Otherwise it can be very confusing for elements like <a> that have a title attribute.

@@ -0,0 +1,3 @@
<meta http-equiv="Content-Type" content="text/html; charset=cp1250">
<a href="file.html">�lu�ou�k� k�� �p�l ��belsk� �dy ����</a>
<img src="img.png" alt="�lu�ou�k� k�� �p�l ��belsk� �dy ����"/>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm amazed that GitHub can figure out the right encoding and display this correctly!

url http://www.example.com/
cache key http://www.example.com/
real url http://www.example.com/
url http://www.example.com/"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I suppose.

@@ -0,0 +1,40 @@
# -*- coding: iso-8859-1 -*-
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't add encoding declarations to files that do not contain any non-ASCII characters.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Apr 15, 2018

@PetrDlouhy i know it's been a long time, but @mgedmin finally reviewed your work, and while we could merge as is, it seems there are minor tweaks that would improve the result... could you take a look?

(i took the liberty of resolving conflicts after #147 - that could be rebased to remove the merge commit...)

@mgedmin
Copy link
Copy Markdown
Contributor

mgedmin commented Apr 15, 2018

We still have one test failure because of the missing line numbers.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jun 21, 2018

do we have an upstream issue for the line number stuff?

I would be ready to drop that feature in favor of having a non-C, non-housemade parser. what do you think @mgedmin ?

@mgedmin
Copy link
Copy Markdown
Contributor

mgedmin commented Jun 22, 2018

do we have an upstream issue for the line number stuff?

I don't know. I never filed one.

I would be ready to drop that feature in favor of having a non-C, non-housemade parser. what do you think @mgedmin ?

Oh, yes! Only I'd like to do it properly (i.e. remove any line numbers from output instead of printing 0 in their place if we currently do so), and somebody ought to fix the test failure before we can merge this.

I don't have the energy currently to push this (or any other linkchecker issue), but I will try to provide code reviews.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jun 22, 2018

actually @PetrDlouhy has already filed an issue at https://bugs.launchpad.net/beautifulsoup/+bug/1742921 so we've got that covered. it seems there was zero activity on that and, in general, beautifulsoup does not seem very active so I am a little worried about blocking on this in general. the response from the mailing list was this, mind you:

All three parsers used by Beautiful Soup have some way of reading off
the current line number, but Beautiful Soup does not record this
information. That's something I could add, but for your project I
recommend making direct use of html.parser, html5lib, or lxml.

so it seems that technically the information is there somewhere, just not available in the API... but maybe we could monkeypatch our way around that?

in any case: let's not block on this, but do this properly. @PetrDlouhy unless you want to patch beautifulsoup i guess the next step here is to rip out line number support from linkchecker itself so that we get tests to pass and, as @mgedmin, do this properly. :)

thanks!

@mgedmin mgedmin mentioned this pull request Nov 6, 2018
@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Nov 29, 2018

update: upstream doesn't plan on adding the feature, but is open to a pull request.

so the next step here is either to rip out the line number code or patch beautifulsoup to include those.

@PetrDlouhy are you up for this?

@cjmayo
Copy link
Copy Markdown
Contributor

cjmayo commented Apr 14, 2019

Are there other HTML parsers we could consider? Does lxml.html provide line number information?

Can we file a wishlist bug request to BeautifulSoup, or maybe even consider a PR?

Can we live without line numbers? Search exists, as long as we know http://example.com is a broken link on some_page.html, we can find it.

lxml Element has a property sourceline:

Original line number as found by the parser or None if unknown.
https://lxml.de/api/lxml.etree._Element-class.html#sourceline

Don't know if that would work.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Apr 15, 2019

Don't know if that would work.

that would be awesome, but at this point I think we're ready to ditch our homegrown parser in favor of beautifulsoup even if it means losing line numbers. it's too much of a liability.

@cjmayo
Copy link
Copy Markdown
Contributor

cjmayo commented Apr 16, 2019

Beautiful Soup is using somebody else's parser, optionally lxml:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
So, maybe we can have both.

cjmayo added a commit to cjmayo/linkchecker that referenced this pull request May 1, 2019
cjmayo added a commit to cjmayo/linkchecker that referenced this pull request May 3, 2019
cjmayo added a commit to cjmayo/linkchecker that referenced this pull request May 4, 2019
cjmayo added a commit to cjmayo/linkchecker that referenced this pull request May 4, 2019
cjmayo added a commit to cjmayo/linkchecker that referenced this pull request Jul 22, 2019
cjmayo added a commit to cjmayo/linkchecker that referenced this pull request Sep 20, 2019
@cjmayo
Copy link
Copy Markdown
Contributor

cjmayo commented Oct 30, 2019

All the commits and code comments plus line number support are addressed in #337.

@cjmayo cjmayo closed this Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants