Conversation
f10135e to
01fedd0
Compare
|
I have updated this.
|
|
I'd be very happy to see the custom C parser replaced with something external and maintained.
That's sad :( Are there other HTML parsers we could consider? Does lxml.html provide line number information? Can we file a wishlist bug request to BeautifulSoup, or maybe even consider a PR? Can we live without line numbers? Search exists, as long as we know http://example.com is a broken link on some_page.html, we can find it. |
mgedmin
left a comment
There was a problem hiding this comment.
I'm greatly in favour of the idea.
Charset handling worries me a bit in the current implementation. (Then again I'm not familiar with the LinkChecker sources, and cannot say if it overrides the charset to the one it gets from HTTP headers in code not touched by this PR.)
linkcheck/HtmlParser/htmlsax.py
Outdated
| self.handler = handler | ||
|
|
||
| def feed(self, feed_text): | ||
| if type(feed_text) == bytes: |
There was a problem hiding this comment.
if isinstance(feed_text, bytes): would be more idiomatic.
I really hope the caller of this API won't ever feed us partial UTF-8 characters!
linkcheck/HtmlParser/htmlsax.py
Outdated
| self.html_doc = u"" | ||
|
|
||
| def flush(self): | ||
| soup = BeautifulSoup(self.html_doc, 'html.parser') |
There was a problem hiding this comment.
Oh, how about this instead: make self.html_doc a StringIO(), write to it in feed(), and here do html_doc = self.html_doc.getvalue(), and then do the if isinstance(html_doc, bytes): html_doc = html_doc.decode(self.encoding)?
Also, what if the document encoding is not actually ISO-8859-1?
linkcheck/HtmlParser/htmlsax.py
Outdated
| for tag in soup.find_all(): | ||
| attrs = ListDict() | ||
| for k, v_list in tag.attrs.items(): | ||
| if type(v_list) != list: |
There was a problem hiding this comment.
if not isinstance(v_list, list):
linkcheck/HtmlParser/htmlsax.py
Outdated
| if type(v_list) != list: | ||
| v_list = [v_list] | ||
| for v in v_list: | ||
| if type(v) == str: |
linkcheck/HtmlParser/htmlsax.py
Outdated
| self.handler.comment(comment) | ||
|
|
||
| def debug(self, text): | ||
| raise NotImplementedError("number is not implemented") |
linkcheck/HtmlParser/htmlsax.py
Outdated
| raise NotImplementedError("Peeking is deprecated in favor of 'handler.start_element' attribute 'element_text'") | ||
|
|
||
|
|
||
| def parser(handler = None): |
There was a problem hiding this comment.
PEP-8 says no spaces around = here.
| @@ -1,4 +1,5 @@ | |||
| # required: | |||
| bs4 | |||
There was a problem hiding this comment.
This also needs to go into install_requires in setup.py.
| 'linkcheck.configuration', | ||
| 'linkcheck.director', | ||
| 'linkcheck.htmlutil', | ||
| 'linkcheck.HtmlParser', |
There was a problem hiding this comment.
This sort of thing is greatly simplified by using setuptools.find_packages().
|
just a little note to send a big cheer down that lane: great initiative, and i hope we can make this work. it's utterly reckless to have our own C-based HTML parser... In fact, I often wondered if we should just reuse those Javascript-based engines like Phantom or whatever that DD ended up with. ;) can't we file a bug against beautiful soup to get line numbers reported or is that something completely out of scope there? progressive enhancement is good.. there's a conflict on the travis config file here, should be easy to fix, and i otherwise agree with the current review. keep up the good work! |
53087ae to
f60731f
Compare
|
I filled the issue here: https://bugs.launchpad.net/beautifulsoup/+bug/1742921 I also tried to implement this with Python's I am not sure about the encoding handling either. I think, we should make some tests focused on the encoding and on the line numbers prior to pulling this. |
anarcat
left a comment
There was a problem hiding this comment.
so i'd be really happy to merge this, but we need to make sure we don't crash on weird encodings. travis seems to be failing on this right now - not sure what's up there, but obviously that needs to be fixed as well...
line numbers isn't that important. can't we get another locator of some sort from Beautiful soup, say something like an Xpath? or just the original link string? then people can just grep for that - it is the way I use this in the first place.
this will definitly break the GUI, but we already have other issues with that thing already (e.g. #105).
| @type data: string | ||
| @return: None | ||
| """ | ||
| data = data.encode(self.encoding, "ignore") |
There was a problem hiding this comment.
i agree encoding is an issue. we don't want to start crashing on weird encodings we find out there. and we don't need to be super fancy - from what i can tell from the previous source, we just drop weird chars... in fact, i wonder why those encoding lines are dropped here in the first place... isn't that where we would fix the encoding?
bd0cdfe to
37d6148
Compare
|
I think, I solved the encoding. I removed the decoding parts and it is stored now in bytes (I will try to rewrite it to be |
need to review the new patches, but i don't want to block
29acf9a to
7964412
Compare
|
I made several corrections for this. This also doesn't include line numbers reporting or some other reference (except the link name and url, which Linkchecker gives). |
|
I've no energy to review this today. Please ping me if I forget to come back to it. |
|
@mgedmin ping :) in retrospect, i think i'd be tempted to be more lax with a new beautifulsoup parser than anything else: rolling our own parser (written in C, nothing less!) is a really bad idea and i would be quite glad to get rid of it, even if it means a few more warts. |
| else: | ||
| self.html_doc = StringIO() | ||
| self.html_doc.truncate(0) | ||
| self.html_doc.seek(0) |
There was a problem hiding this comment.
I don't think there's much point in truncating and seeking in an empty html_doc you've just created.
| def reset(self): | ||
| self.html_doc.truncate(0) | ||
| self.html_doc.seek(0) | ||
| self.html_doc = None |
There was a problem hiding this comment.
Again, I don't see the point in truncating a StringIO you're about to delete. Let the garbage collector take care of it.
| self.handler.cdata(content) | ||
| elif isinstance(content, ProcessingInstruction): | ||
| if hasattr(self.handler, 'pi'): | ||
| self.handler.pi(content[:-1]) |
There was a problem hiding this comment.
Would be nice to have a comment explaining the [:-1] bit.
| return 0 | ||
|
|
||
| def pos(self, text): | ||
| return 0 |
There was a problem hiding this comment.
I think if we merge this we should also remove all calls of these placeholder methods (and the methods themselves). I do not want faked line numbers to show up in user-oriented messages.
| self.title = None | ||
|
|
||
| def start_element (self, tag, attrs): | ||
| def start_element (self, tag, attrs, title=None): |
There was a problem hiding this comment.
I would prefer it if the argument was called element_text. Otherwise it can be very confusing for elements like <a> that have a title attribute.
| @@ -0,0 +1,3 @@ | |||
| <meta http-equiv="Content-Type" content="text/html; charset=cp1250"> | |||
| <a href="file.html">�lu�ou�k� k�� �p�l ��belsk� �dy ����</a> | |||
| <img src="img.png" alt="�lu�ou�k� k�� �p�l ��belsk� �dy ����"/> | |||
There was a problem hiding this comment.
I'm amazed that GitHub can figure out the right encoding and display this correctly!
| url http://www.example.com/ | ||
| cache key http://www.example.com/ | ||
| real url http://www.example.com/ | ||
| url http://www.example.com/" |
| @@ -0,0 +1,40 @@ | |||
| # -*- coding: iso-8859-1 -*- | |||
There was a problem hiding this comment.
I wouldn't add encoding declarations to files that do not contain any non-ASCII characters.
|
@PetrDlouhy i know it's been a long time, but @mgedmin finally reviewed your work, and while we could merge as is, it seems there are minor tweaks that would improve the result... could you take a look? (i took the liberty of resolving conflicts after #147 - that could be rebased to remove the merge commit...) |
|
We still have one test failure because of the missing line numbers. |
|
do we have an upstream issue for the line number stuff? I would be ready to drop that feature in favor of having a non-C, non-housemade parser. what do you think @mgedmin ? |
I don't know. I never filed one.
Oh, yes! Only I'd like to do it properly (i.e. remove any line numbers from output instead of printing I don't have the energy currently to push this (or any other linkchecker issue), but I will try to provide code reviews. |
|
actually @PetrDlouhy has already filed an issue at https://bugs.launchpad.net/beautifulsoup/+bug/1742921 so we've got that covered. it seems there was zero activity on that and, in general, beautifulsoup does not seem very active so I am a little worried about blocking on this in general. the response from the mailing list was this, mind you:
so it seems that technically the information is there somewhere, just not available in the API... but maybe we could monkeypatch our way around that? in any case: let's not block on this, but do this properly. @PetrDlouhy unless you want to patch beautifulsoup i guess the next step here is to rip out line number support from linkchecker itself so that we get tests to pass and, as @mgedmin, do this properly. :) thanks! |
|
update: upstream doesn't plan on adding the feature, but is open to a pull request. so the next step here is either to rip out the line number code or patch beautifulsoup to include those. @PetrDlouhy are you up for this? |
lxml Element has a property
Don't know if that would work. |
that would be awesome, but at this point I think we're ready to ditch our homegrown parser in favor of beautifulsoup even if it means losing line numbers. it's too much of a liability. |
|
Beautiful Soup is using somebody else's parser, optionally lxml: |
|
All the commits and code comments plus line number support are addressed in #337. |
PR for #26
This completely removes the build-in parser and replaces it with BeatifulSoup.
The current implementation is mocking the API of the old parser. It means, that if somebody would like to use the old parser (maybe it is faster?), he could make standalone project from that and easily change the libraries.
So far this approach works, but there are few remaining failing tests and it would require a bit of cleaning.