[WIP] Convert HtmlParser to use Beautiful Soup#249
[WIP] Convert HtmlParser to use Beautiful Soup#249cjmayo wants to merge 21 commits intolinkchecker:masterfrom
Conversation
04e2ca8 to
400208f
Compare
|
this begs the question of how long we actually want to support python 2. or is your goal eventually to backport this to py2 as well? thanks for your work! |
5d1c236 to
66d8370
Compare
No, I think this is Python 3 only. There shouldn't be too many commits to review when it is all tidied up. I'm going to have more of a look at the parsing. This uses Beautiful Soup which is using an HTML parser and then creates a parser for the Beautiful Soup output. Maybe there can be fewer layers and get back the line numbers. |
I personally plan to stop supporting Python 2 when it gets EOLed by upstream, which AFAIR is going to happen on January 1st 2020. |
|
I've got line numbers working. Patch to Beautiful Soup submitted here: I'll update this branch in the coming days. |
66d8370 to
60394eb
Compare
|
Well, working with my patched Beautiful Soup (Python 3.7 Travis run), cloned to GitHub for use with tox, and easier to view: Need to look at compatibility with current Beautiful Soup. |
60394eb to
8885622
Compare
|
Tests passing. I've also addressed the outstanding comments in #119, in a commit on the end for now. I think that's it for this one for the time being, need to see what happens with Beautiful Soup line numbers and get the python3 branch merged. There will be tidying up and reviewing to do. I've also selected the patched Beautiful Soup for Python 3.7 in requirements.txt, because it was easy to do. If Beautiful Soup is modified there will need to be another way of testing with old and new versions for a while. |
b02a5a0 to
05dde78
Compare
SchoolGuy
left a comment
There was a problem hiding this comment.
I am not able to see any critical changes. Since CI is green I am satisfied.
|
here we need to decide what we do with our patched beautifulsoup. it's the only reason why the test suite is passing and i don't know if we want to carry a fork of this in the long term. upstream seemed open to the feature but hasn't responded to the patch since april, so i'm worried. |
|
@anarcat I could do a plus one and comment on that patch to bring it to their attention list. Just ping me the PR. |
|
it's listed above, but hidden behind a "load more" button, here it is again: https://bugs.launchpad.net/beautifulsoup/+bug/1742921 |
05dde78 to
24e8b89
Compare
|
Beautiful Soup now supports line numbers (not yet released): Code on this PR tweaked to use it. |
|
oh my, that's amazing! now the question is when will this be released in bs4, and do we need to wait for this to happen... why do tests pass here? is it because we patch to use the latest bs4 from bzr upstream? or that we took out the line numbers from the test suite? it would be nice to still be able to run with older bs4 releases, is that possible? that way we could release with this change rightaway without either breaking the test suite or having to wait for the upstream release.... |
|
I've created a GitHub mirror with the latest bs4 code + a modified version number, tox is using that for Python 3.7 tests: Line 26 in 24e8b89 but not Python 3.5 or 3.6 (hence the modified release number so that you can tell from the Travis log). test_all_parts.py is checking whether bs4 supports line numbers: and using the appropriate results file by skipping a test. Maybe there is a better way to code that. The results file used if bs4 does not have line number support shows that in this case you get values of None : |
|
this looks pretty good, although it would be better to source directly from the bs4 source tree than your fork... do they mirror on github or provide dev snapshots in some way? awesome work. |
|
I haven't researched, the issue being that bzr still has: __version__ = "4.8.0"so you wouldn't be able to tell from the Travis log which version you had downloaded. Cheers. |
|
OK #230 in. Votes please for one of:
|
|
not sure i know very well the impact of each enough for my vote to be relevant but... wouldn't merging this before py3 make the latter easier to do since we'd be removing a shit ton of code? then again, bs4 is not absolutely necessary for our survival, in the short term: py3 is, however. so i'd be tempted to do that first... so I'd go #210. |
|
I like lots of tiny PRs, those are fun and easy to review. |
UnicodeDammit input has to be non-unicode to trigger character set detection.
Read-only property with new Beautiful Soup parser.
Needs miniboa >= 1.0.8 for telnet test on Python 3.7. Test with older Beautiful Soup without line number support on Python 3.5. Resolve tox deprecation warning: Matching undeclared envs is deprecated. Be sure all the envs that Tox should run are declared in the tox config.
Shown every time linkchecker is run: /usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The soupsieve package is not installed. CSS selectors cannot be used. 'The soupsieve package is not installed. CSS selectors cannot be used.'
File "/usr/lib/python3.7/site-packages/linkcheck/httputil.py", line 68, in asn1_generaltime_to_seconds
line: res = datetime.strptime(timestr, timeformat + 'Z')
locals:
res = <local> None
datetime = <global> <class 'datetime.datetime'>
datetime.strptime = <global> <built-in method strptime of type object at 0x7fa39064dda0>
timestr = <local> b'20191106202117Z'
timeformat = <local> '%Y%m%d%H%M%S'
TypeError: strptime() argument 1 must be str, not bytes
pyOpenSSL OpenSSL.crypto.X509.get_notAfter() returns bytes:
https://www.pyopenssl.org/en/stable/api/crypto.html#OpenSSL.crypto.X509.get_notAfter
Else non-UTF-8 codes are misinterpreted:
>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
8732142 to
5f3e58d
Compare
|
It seemed like it was going to be so simple to rebase on master after the latest changes including the pdfminer tox fix, but alas enabling the sitemap test resulted in another failure. My available time has been taken up with #335 (which is included here to get the green tick on this one) - hopefully I will be able to respond to other public and private questions before too long. |
There was a problem hiding this comment.
Good catch about the sitemap parser being broken!
Apologies for creating extra work for you, but the regression test I thought I enabled wasn't actually testing the thing I thought it was testing -- passing unicode data to the expat parser on Python 2 "works" as long as that unicode data contains only characters in the ASCII subset.
EDIT: I thought I was reviewing #335, but actually GitHub was showing me only the most recent changes of #249. Oops.
| data = url_data.get_content() | ||
| isfinal = True | ||
| try: | ||
| self.parser.Parse(data, isfinal) |
There was a problem hiding this comment.
This explodes on Python 2 as soon as the sitemap.xml contains any non-ASCII characters. Try this:
diff --git a/tests/checker/data/sitemap.xml b/tests/checker/data/sitemap.xml
index 30e23048..b0ae9a20 100644
--- a/tests/checker/data/sitemap.xml
+++ b/tests/checker/data/sitemap.xml
@@ -6,4 +6,10 @@
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
+ <url>
+ <loc>http://www.example.com/?ascii=nø</loc>
+ <lastmod>2005-01-01</lastmod>
+ <changefreq>monthly</changefreq>
+ <priority>0.8</priority>
+ </url>
</urlset> There was a problem hiding this comment.
Apologies for creating extra work for you, but the regression test I thought I enabled wasn't actually testing the thing I thought it was testing -- passing unicode data to the expat parser on Python 2 "works" as long as that unicode data contains only characters in the ASCII subset.
Enabling more tests is great (and has proved its value). I came here to help get Python 3 supported, it's all part of the same effort.
URLs within a sitemap file were not being captured.
f723b97 to
4f7bc55
Compare
I tried to push to https://github.com/linkchecker/linkchecker.git and got a permission denied response. I'm keen to not be the sole "keeper" of these big branches (and rate limiter). So am happy to store them here. To date there has been a fair bit of rebasing going on which could have made sharing difficult but I think I am pretty much done on this one now (there is hope!). |
So, being a member of the organisation does not grant one write access to the org's repos! Interesting. I've created a new team ('developers'), invited you and @anarcat, and granted all team members write access to this repository. |
That worked. I've done one last rebase to remove the sitemap commit that is now in master and pushed the branch to https://github.com/linkchecker/linkchecker.git. What to do with this PR? It is not possible to change the source branch? I haven't updated it. Is a new PR needed to run CI on the new branch? (GitHub does show a couple of links to this PR when you select the new branch). |
|
Yes, a new PR is needed. GitHub lets you change the target branch of the PR, but not the source branch/repo. |
|
OK, created #337 and closing this one. |
For merging after python3 branch
This is derived from the work by @PetrDlouhy in #119 and like that PR also the work in #125.
It is based on the python3 branch in #210, and indeed does not work on Python 2. A CI test pass here indicates success on Python 3.5 and Python 3.6. Python 3.7 results are included but telnet tests will likely fail due to miniboa not being Python 3.7 compatible.
Currently there is a test fail in the python3 branch #210 described in #230.
However, I have been able to install and run this under Python 3.7.