[WIP] Convert HtmlParser to use Beautiful Soup by cjmayo · Pull Request #249 · linkchecker/linkchecker

cjmayo · 2019-04-21T19:03:49Z

For merging after python3 branch

This is derived from the work by @PetrDlouhy in #119 and like that PR also the work in #125.

It is based on the python3 branch in #210, and indeed does not work on Python 2. A CI test pass here indicates success on Python 3.5 and Python 3.6. Python 3.7 results are included but telnet tests will likely fail due to miniboa not being Python 3.7 compatible.

Currently there is a test fail in the python3 branch #210 described in #230.
However, I have been able to install and run this under Python 3.7.

anarcat · 2019-04-21T21:08:41Z

this begs the question of how long we actually want to support python 2. or is your goal eventually to backport this to py2 as well?

thanks for your work!

cjmayo · 2019-04-22T11:11:51Z

this begs the question of how long we actually want to support python 2. or is your goal eventually to backport this to py2 as well?

No, I think this is Python 3 only. There shouldn't be too many commits to review when it is all tidied up.
For the python3 branch I think (at the moment!) we still have to go with Python 2 & 3, else we will have no CI testing per commit.

I'm going to have more of a look at the parsing. This uses Beautiful Soup which is using an HTML parser and then creates a parser for the Beautiful Soup output. Maybe there can be fewer layers and get back the line numbers.

mgedmin · 2019-04-23T09:25:08Z

this begs the question of how long we actually want to support python 2.

I personally plan to stop supporting Python 2 when it gets EOLed by upstream, which AFAIR is going to happen on January 1st 2020.

cjmayo · 2019-04-29T19:32:18Z

I've got line numbers working. Patch to Beautiful Soup submitted here:
https://bugs.launchpad.net/beautifulsoup/+bug/1742921

I'll update this branch in the coming days.

cjmayo · 2019-04-30T19:34:10Z

Well, working with my patched Beautiful Soup (Python 3.7 Travis run), cloned to GitHub for use with tox, and easier to view:
https://github.com/cjmayo/beautifulsoup/commit/21f6f34307129d12f686b72e1c8b565fff4af922

Need to look at compatibility with current Beautiful Soup.

cjmayo · 2019-05-01T19:24:26Z

Tests passing. I've also addressed the outstanding comments in #119, in a commit on the end for now.

I think that's it for this one for the time being, need to see what happens with Beautiful Soup line numbers and get the python3 branch merged.

There will be tidying up and reviewing to do. I've also selected the patched Beautiful Soup for Python 3.7 in requirements.txt, because it was easy to do. If Beautiful Soup is modified there will need to be another way of testing with old and new versions for a while.

SchoolGuy

I am not able to see any critical changes. Since CI is green I am satisfied.

anarcat · 2019-07-18T09:34:11Z

here we need to decide what we do with our patched beautifulsoup. it's the only reason why the test suite is passing and i don't know if we want to carry a fork of this in the long term. upstream seemed open to the feature but hasn't responded to the patch since april, so i'm worried.

SchoolGuy · 2019-07-18T09:52:18Z

@anarcat I could do a plus one and comment on that patch to bring it to their attention list. Just ping me the PR.

anarcat · 2019-07-18T10:07:12Z

it's listed above, but hidden behind a "load more" button, here it is again: https://bugs.launchpad.net/beautifulsoup/+bug/1742921

cjmayo · 2019-07-22T19:18:41Z

Beautiful Soup now supports line numbers (not yet released):
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/changes/516?start_revid=516

Code on this PR tweaked to use it.

anarcat · 2019-07-29T13:49:59Z

oh my, that's amazing! now the question is when will this be released in bs4, and do we need to wait for this to happen...

why do tests pass here? is it because we patch to use the latest bs4 from bzr upstream? or that we took out the line numbers from the test suite?

it would be nice to still be able to run with older bs4 releases, is that possible? that way we could release with this change rightaway without either breaking the test suite or having to wait for the upstream release....

cjmayo · 2019-07-29T19:01:16Z

I've created a GitHub mirror with the latest bs4 code + a modified version number, tox is using that for Python 3.7 tests:

linkchecker/tox.ini

Line 26 in 24e8b89

    
           https://github.com/cjmayo/beautifulsoup/archive/ad776316f7e0d80b5c3ee07d2a083cb884e8017c.tar.gz

but not Python 3.5 or 3.6 (hence the modified release number so that you can tell from the Travis log).

test_all_parts.py is checking whether bs4 supports line numbers:

linkchecker/tests/checker/test_all_parts.py

Line 26 in 24e8b89

bs_has_linenos = BeautifulSoup("<a>", "html.parser").a.sourceline is not None

and using the appropriate results file by skipping a test. Maybe there is a better way to code that.

The results file used if bs4 does not have line number support shows that in this case you get values of None :

linkchecker/tests/checker/data/all_parts.html.result

Line 6 in 9f824cf

line None

anarcat · 2019-07-29T19:23:53Z

this looks pretty good, although it would be better to source directly from the bs4 source tree than your fork... do they mirror on github or provide dev snapshots in some way?

awesome work.

cjmayo · 2019-07-29T19:33:49Z

I haven't researched, the issue being that bzr still has:

__version__ = "4.8.0"

so you wouldn't be able to tell from the Travis log which version you had downloaded.

Cheers.

cjmayo · 2019-09-04T19:10:13Z

OK #230 in. Votes please for one of:

lots more individual PRs
tidying up this one and merging in
tidying up [WIP] Python3 Support #210 and merging before this one

anarcat · 2019-09-04T20:35:45Z

not sure i know very well the impact of each enough for my vote to be relevant but...

wouldn't merging this before py3 make the latter easier to do since we'd be removing a shit ton of code?

then again, bs4 is not absolutely necessary for our survival, in the short term: py3 is, however. so i'd be tempted to do that first...

so I'd go #210.

mgedmin · 2019-09-05T08:18:39Z

I like lots of tiny PRs, those are fun and easy to review.

UnicodeDammit input has to be non-unicode to trigger character set detection.

Read-only property with new Beautiful Soup parser.

Needs miniboa >= 1.0.8 for telnet test on Python 3.7. Test with older Beautiful Soup without line number support on Python 3.5. Resolve tox deprecation warning: Matching undeclared envs is deprecated. Be sure all the envs that Tox should run are declared in the tox config.

Shown every time linkchecker is run: /usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The soupsieve package is not installed. CSS selectors cannot be used. 'The soupsieve package is not installed. CSS selectors cannot be used.'

File "/usr/lib/python3.7/site-packages/linkcheck/httputil.py", line 68, in asn1_generaltime_to_seconds line: res = datetime.strptime(timestr, timeformat + 'Z') locals: res = <local> None datetime = <global> <class 'datetime.datetime'> datetime.strptime = <global> <built-in method strptime of type object at 0x7fa39064dda0> timestr = <local> b'20191106202117Z' timeformat = <local> '%Y%m%d%H%M%S' TypeError: strptime() argument 1 must be str, not bytes pyOpenSSL OpenSSL.crypto.X509.get_notAfter() returns bytes: https://www.pyopenssl.org/en/stable/api/crypto.html#OpenSSL.crypto.X509.get_notAfter

Else non-UTF-8 codes are misinterpreted: >>> from urllib import parse >>> parse.unquote("%FF") '�' >>> parse.unquote("%FF", "latin1") 'ÿ'

cjmayo · 2019-10-26T19:02:54Z

It seemed like it was going to be so simple to rebase on master after the latest changes including the pdfminer tox fix, but alas enabling the sitemap test resulted in another failure. My available time has been taken up with #335 (which is included here to get the green tick on this one) - hopefully I will be able to respond to other public and private questions before too long.

mgedmin

Good catch about the sitemap parser being broken!

Apologies for creating extra work for you, but the regression test I thought I enabled wasn't actually testing the thing I thought it was testing -- passing unicode data to the expat parser on Python 2 "works" as long as that unicode data contains only characters in the ASCII subset.

EDIT: I thought I was reviewing #335, but actually GitHub was showing me only the most recent changes of #249. Oops.

mgedmin · 2019-10-28T07:31:20Z

linkcheck/parser/sitemap.py

+        data = url_data.get_content()
        isfinal = True
        try:
            self.parser.Parse(data, isfinal)


This explodes on Python 2 as soon as the sitemap.xml contains any non-ASCII characters. Try this:

diff --git a/tests/checker/data/sitemap.xml b/tests/checker/data/sitemap.xml index 30e23048..b0ae9a20 100644 --- a/tests/checker/data/sitemap.xml +++ b/tests/checker/data/sitemap.xml @@ -6,4 +6,10 @@ <changefreq>monthly</changefreq> <priority>0.8</priority> </url> + <url> + <loc>http://www.example.com/?ascii=nø</loc> + <lastmod>2005-01-01</lastmod> + <changefreq>monthly</changefreq> + <priority>0.8</priority> + </url> </urlset>

Apologies for creating extra work for you, but the regression test I thought I enabled wasn't actually testing the thing I thought it was testing -- passing unicode data to the expat parser on Python 2 "works" as long as that unicode data contains only characters in the ASCII subset.

Enabling more tests is great (and has proved its value). I came here to help get Python 3 supported, it's all part of the same effort.

URLs within a sitemap file were not being captured.

cjmayo · 2019-10-28T19:53:00Z

We've added you as a member to the linkchecker organization on GH a while back; does that give you enough permissions to push to this repo? If so, it would be convenient to have you PRs based on branches here, so we could all collaborate. If not, then that's something we ought to fix.

I tried to push to https://github.com/linkchecker/linkchecker.git and got a permission denied response.

I'm keen to not be the sole "keeper" of these big branches (and rate limiter). So am happy to store them here. To date there has been a fair bit of rebasing going on which could have made sharing difficult but I think I am pretty much done on this one now (there is hope!).

mgedmin · 2019-10-29T13:50:02Z

I tried to push to https://github.com/linkchecker/linkchecker.git and got a permission denied response.

So, being a member of the organisation does not grant one write access to the org's repos! Interesting.

I've created a new team ('developers'), invited you and @anarcat, and granted all team members write access to this repository.

cjmayo · 2019-10-29T19:53:22Z

I've created a new team ('developers'), invited you and @anarcat, and granted all team members write access to this repository.

That worked. I've done one last rebase to remove the sitemap commit that is now in master and pushed the branch to https://github.com/linkchecker/linkchecker.git.

What to do with this PR? It is not possible to change the source branch? I haven't updated it. Is a new PR needed to run CI on the new branch? (GitHub does show a couple of links to this PR when you select the new branch).

mgedmin · 2019-10-29T20:14:43Z

Yes, a new PR is needed. GitHub lets you change the target branch of the PR, but not the source branch/repo.

cjmayo · 2019-10-30T19:56:38Z

OK, created #337 and closing this one.

cjmayo force-pushed the htmlparser-beautifulsoup branch from 04e2ca8 to 400208f Compare April 21, 2019 19:09

cjmayo force-pushed the htmlparser-beautifulsoup branch 2 times, most recently from 5d1c236 to 66d8370 Compare April 22, 2019 10:58

cjmayo force-pushed the htmlparser-beautifulsoup branch from 66d8370 to 60394eb Compare April 30, 2019 19:27

cjmayo force-pushed the htmlparser-beautifulsoup branch from 60394eb to 8885622 Compare May 1, 2019 19:14

cjmayo force-pushed the htmlparser-beautifulsoup branch 3 times, most recently from b02a5a0 to 05dde78 Compare May 4, 2019 19:25

cjmayo mentioned this pull request May 10, 2019

{python3_20} Python3: decode parts before submitting them to urllib.quote() #230

Merged

cjmayo mentioned this pull request Jul 16, 2019

[WIP] Python3 Support #210

Closed

SchoolGuy approved these changes Jul 18, 2019

View reviewed changes

cjmayo force-pushed the htmlparser-beautifulsoup branch from 05dde78 to 24e8b89 Compare July 22, 2019 18:59

cjmayo mentioned this pull request Sep 3, 2019

consider removing dependency on meliae #268

Open

PetrDlouhy and others added 17 commits October 25, 2019 19:42

html parser: use name instead of peeking

c28889a

fix parser tests after parser change

f97064b

fix http test after parser change

3df215f

change test_parse encoding to UTF-8

0eab6c7

fix parser encoding tests after change of parser

4d67754

UnicodeDammit input has to be non-unicode to trigger character set detection.

Don't set parser.encoding

6a4ebd2

Read-only property with new Beautiful Soup parser.

Support Beautiful Soup line numbers

3ba547a

Improve pretty printed comments

f965e4a

Hide Beautiful Soup soupsieve warning

af763f7

Shown every time linkchecker is run: /usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The soupsieve package is not installed. CSS selectors cannot be used. 'The soupsieve package is not installed. CSS selectors cannot be used.'

Reuse soup object used for detecting encoding in the HTML parser

89db471

Pass encoding when unquoting

47bde1a

Else non-UTF-8 codes are misinterpreted: >>> from urllib import parse >>> parse.unquote("%FF") '�' >>> parse.unquote("%FF", "latin1") 'ÿ'

Update strformat.ascii_safe() because paths are now strings

dcaa0f9

Update fileutil.pathencode() because paths are now strings

e4b454b

Always use utf-8 encoding when quoting

5934bd2

Add new tests for URL quoting

5f3e58d

cjmayo force-pushed the htmlparser-beautifulsoup branch from 8732142 to 5f3e58d Compare October 25, 2019 18:42

mgedmin requested changes Oct 28, 2019

View reviewed changes

Fix XmlTagUrlParser and make Python 3 compatible

4f7bc55

URLs within a sitemap file were not being captured.

cjmayo force-pushed the htmlparser-beautifulsoup branch from f723b97 to 4f7bc55 Compare October 28, 2019 19:20

cjmayo mentioned this pull request Oct 28, 2019

Fix XmlTagUrlParser and make Python 3 compatible #335

Merged

cjmayo mentioned this pull request Oct 30, 2019

Change HtmlParser to use Beautiful Soup #337

Merged

cjmayo closed this Oct 30, 2019

Conversation

cjmayo commented Apr 21, 2019

Uh oh!

anarcat commented Apr 21, 2019

Uh oh!

cjmayo commented Apr 22, 2019

Uh oh!

mgedmin commented Apr 23, 2019

Uh oh!

cjmayo commented Apr 29, 2019

Uh oh!

cjmayo commented Apr 30, 2019

Uh oh!

cjmayo commented May 1, 2019

Uh oh!

SchoolGuy left a comment

Choose a reason for hiding this comment

Uh oh!

anarcat commented Jul 18, 2019

Uh oh!

SchoolGuy commented Jul 18, 2019

Uh oh!

anarcat commented Jul 18, 2019

Uh oh!

cjmayo commented Jul 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anarcat commented Jul 29, 2019

Uh oh!

cjmayo commented Jul 29, 2019

Uh oh!

anarcat commented Jul 29, 2019

Uh oh!

cjmayo commented Jul 29, 2019

Uh oh!

cjmayo commented Sep 4, 2019

Uh oh!

anarcat commented Sep 4, 2019

Uh oh!

mgedmin commented Sep 5, 2019

Uh oh!

cjmayo commented Oct 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgedmin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgedmin Oct 28, 2019

Choose a reason for hiding this comment

Uh oh!

cjmayo Oct 28, 2019

Choose a reason for hiding this comment

Uh oh!

cjmayo commented Oct 28, 2019

Uh oh!

mgedmin commented Oct 29, 2019

Uh oh!

cjmayo commented Oct 29, 2019

Uh oh!

mgedmin commented Oct 29, 2019

Uh oh!

cjmayo commented Oct 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cjmayo commented Jul 22, 2019 •

edited

Loading

cjmayo commented Oct 26, 2019 •

edited

Loading

mgedmin left a comment •

edited

Loading