Skip to content

[WIP] Convert HtmlParser to use Beautiful Soup#249

Closed
cjmayo wants to merge 21 commits intolinkchecker:masterfrom
cjmayo:htmlparser-beautifulsoup
Closed

[WIP] Convert HtmlParser to use Beautiful Soup#249
cjmayo wants to merge 21 commits intolinkchecker:masterfrom
cjmayo:htmlparser-beautifulsoup

Conversation

@cjmayo
Copy link
Copy Markdown
Contributor

@cjmayo cjmayo commented Apr 21, 2019

For merging after python3 branch

This is derived from the work by @PetrDlouhy in #119 and like that PR also the work in #125.

It is based on the python3 branch in #210, and indeed does not work on Python 2. A CI test pass here indicates success on Python 3.5 and Python 3.6. Python 3.7 results are included but telnet tests will likely fail due to miniboa not being Python 3.7 compatible.

Currently there is a test fail in the python3 branch #210 described in #230.
However, I have been able to install and run this under Python 3.7.

@cjmayo cjmayo force-pushed the htmlparser-beautifulsoup branch from 04e2ca8 to 400208f Compare April 21, 2019 19:09
@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Apr 21, 2019

this begs the question of how long we actually want to support python 2. or is your goal eventually to backport this to py2 as well?

thanks for your work!

@cjmayo cjmayo force-pushed the htmlparser-beautifulsoup branch 2 times, most recently from 5d1c236 to 66d8370 Compare April 22, 2019 10:58
@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Apr 22, 2019

this begs the question of how long we actually want to support python 2. or is your goal eventually to backport this to py2 as well?

No, I think this is Python 3 only. There shouldn't be too many commits to review when it is all tidied up.
For the python3 branch I think (at the moment!) we still have to go with Python 2 & 3, else we will have no CI testing per commit.

I'm going to have more of a look at the parsing. This uses Beautiful Soup which is using an HTML parser and then creates a parser for the Beautiful Soup output. Maybe there can be fewer layers and get back the line numbers.

@mgedmin
Copy link
Copy Markdown
Contributor

mgedmin commented Apr 23, 2019

this begs the question of how long we actually want to support python 2.

I personally plan to stop supporting Python 2 when it gets EOLed by upstream, which AFAIR is going to happen on January 1st 2020.

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Apr 29, 2019

I've got line numbers working. Patch to Beautiful Soup submitted here:
https://bugs.launchpad.net/beautifulsoup/+bug/1742921

I'll update this branch in the coming days.

@cjmayo cjmayo force-pushed the htmlparser-beautifulsoup branch from 66d8370 to 60394eb Compare April 30, 2019 19:27
@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Apr 30, 2019

Well, working with my patched Beautiful Soup (Python 3.7 Travis run), cloned to GitHub for use with tox, and easier to view:
https://github.com/cjmayo/beautifulsoup/commit/21f6f34307129d12f686b72e1c8b565fff4af922

Need to look at compatibility with current Beautiful Soup.

@cjmayo cjmayo force-pushed the htmlparser-beautifulsoup branch from 60394eb to 8885622 Compare May 1, 2019 19:14
@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented May 1, 2019

Tests passing. I've also addressed the outstanding comments in #119, in a commit on the end for now.

I think that's it for this one for the time being, need to see what happens with Beautiful Soup line numbers and get the python3 branch merged.

There will be tidying up and reviewing to do. I've also selected the patched Beautiful Soup for Python 3.7 in requirements.txt, because it was easy to do. If Beautiful Soup is modified there will need to be another way of testing with old and new versions for a while.

@cjmayo cjmayo force-pushed the htmlparser-beautifulsoup branch 3 times, most recently from b02a5a0 to 05dde78 Compare May 4, 2019 19:25
@cjmayo cjmayo mentioned this pull request Jul 16, 2019
Copy link
Copy Markdown

@SchoolGuy SchoolGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to see any critical changes. Since CI is green I am satisfied.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jul 18, 2019

here we need to decide what we do with our patched beautifulsoup. it's the only reason why the test suite is passing and i don't know if we want to carry a fork of this in the long term. upstream seemed open to the feature but hasn't responded to the patch since april, so i'm worried.

@SchoolGuy
Copy link
Copy Markdown

@anarcat I could do a plus one and comment on that patch to bring it to their attention list. Just ping me the PR.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jul 18, 2019

it's listed above, but hidden behind a "load more" button, here it is again: https://bugs.launchpad.net/beautifulsoup/+bug/1742921

@cjmayo cjmayo force-pushed the htmlparser-beautifulsoup branch from 05dde78 to 24e8b89 Compare July 22, 2019 18:59
@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Jul 22, 2019

Beautiful Soup now supports line numbers (not yet released):
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/changes/516?start_revid=516

Code on this PR tweaked to use it.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jul 29, 2019

oh my, that's amazing! now the question is when will this be released in bs4, and do we need to wait for this to happen...

why do tests pass here? is it because we patch to use the latest bs4 from bzr upstream? or that we took out the line numbers from the test suite?

it would be nice to still be able to run with older bs4 releases, is that possible? that way we could release with this change rightaway without either breaking the test suite or having to wait for the upstream release....

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Jul 29, 2019

I've created a GitHub mirror with the latest bs4 code + a modified version number, tox is using that for Python 3.7 tests:

https://github.com/cjmayo/beautifulsoup/archive/ad776316f7e0d80b5c3ee07d2a083cb884e8017c.tar.gz

but not Python 3.5 or 3.6 (hence the modified release number so that you can tell from the Travis log).

test_all_parts.py is checking whether bs4 supports line numbers:

bs_has_linenos = BeautifulSoup("<a>", "html.parser").a.sourceline is not None

and using the appropriate results file by skipping a test. Maybe there is a better way to code that.

The results file used if bs4 does not have line number support shows that in this case you get values of None :

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jul 29, 2019

this looks pretty good, although it would be better to source directly from the bs4 source tree than your fork... do they mirror on github or provide dev snapshots in some way?

awesome work.

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Jul 29, 2019

I haven't researched, the issue being that bzr still has:

__version__ = "4.8.0"

so you wouldn't be able to tell from the Travis log which version you had downloaded.

Cheers.

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Sep 4, 2019

OK #230 in. Votes please for one of:

  1. lots more individual PRs
  2. tidying up this one and merging in
  3. tidying up [WIP] Python3 Support #210 and merging before this one

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Sep 4, 2019

not sure i know very well the impact of each enough for my vote to be relevant but...

wouldn't merging this before py3 make the latter easier to do since we'd be removing a shit ton of code?

then again, bs4 is not absolutely necessary for our survival, in the short term: py3 is, however. so i'd be tempted to do that first...

so I'd go #210.

@mgedmin
Copy link
Copy Markdown
Contributor

mgedmin commented Sep 5, 2019

I like lots of tiny PRs, those are fun and easy to review.

PetrDlouhy and others added 17 commits October 25, 2019 19:42
UnicodeDammit input has to be non-unicode to trigger character set
detection.
Read-only property with new Beautiful Soup parser.
Needs miniboa >= 1.0.8 for telnet test on Python 3.7.
Test with older Beautiful Soup without line number support on
Python 3.5.

Resolve tox deprecation warning:
  Matching undeclared envs is deprecated. Be sure all the envs that Tox
  should run are declared in the tox config.
Shown every time linkchecker is run:

/usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The
soupsieve package is not installed. CSS selectors cannot be used.
  'The soupsieve package is not installed. CSS selectors cannot be used.'
  File "/usr/lib/python3.7/site-packages/linkcheck/httputil.py", line 68, in asn1_generaltime_to_seconds
    line: res = datetime.strptime(timestr, timeformat + 'Z')
    locals:
      res = <local> None
      datetime = <global> <class 'datetime.datetime'>
      datetime.strptime = <global> <built-in method strptime of type object at 0x7fa39064dda0>
      timestr = <local> b'20191106202117Z'
      timeformat = <local> '%Y%m%d%H%M%S'
TypeError: strptime() argument 1 must be str, not bytes

pyOpenSSL OpenSSL.crypto.X509.get_notAfter() returns bytes:
https://www.pyopenssl.org/en/stable/api/crypto.html#OpenSSL.crypto.X509.get_notAfter
Else non-UTF-8 codes are misinterpreted:

>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
@cjmayo cjmayo force-pushed the htmlparser-beautifulsoup branch from 8732142 to 5f3e58d Compare October 25, 2019 18:42
@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Oct 26, 2019

It seemed like it was going to be so simple to rebase on master after the latest changes including the pdfminer tox fix, but alas enabling the sitemap test resulted in another failure. My available time has been taken up with #335 (which is included here to get the green tick on this one) - hopefully I will be able to respond to other public and private questions before too long.

Copy link
Copy Markdown
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch about the sitemap parser being broken!

Apologies for creating extra work for you, but the regression test I thought I enabled wasn't actually testing the thing I thought it was testing -- passing unicode data to the expat parser on Python 2 "works" as long as that unicode data contains only characters in the ASCII subset.

EDIT: I thought I was reviewing #335, but actually GitHub was showing me only the most recent changes of #249. Oops.

data = url_data.get_content()
isfinal = True
try:
self.parser.Parse(data, isfinal)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explodes on Python 2 as soon as the sitemap.xml contains any non-ASCII characters. Try this:

diff --git a/tests/checker/data/sitemap.xml b/tests/checker/data/sitemap.xml
index 30e23048..b0ae9a20 100644
--- a/tests/checker/data/sitemap.xml
+++ b/tests/checker/data/sitemap.xml
@@ -6,4 +6,10 @@
       <changefreq>monthly</changefreq>
       <priority>0.8</priority>
    </url>
+   <url>
+      <loc>http://www.example.com/?ascii=nø</loc>
+      <lastmod>2005-01-01</lastmod>
+      <changefreq>monthly</changefreq>
+      <priority>0.8</priority>
+   </url>
 </urlset> 

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for creating extra work for you, but the regression test I thought I enabled wasn't actually testing the thing I thought it was testing -- passing unicode data to the expat parser on Python 2 "works" as long as that unicode data contains only characters in the ASCII subset.

Enabling more tests is great (and has proved its value). I came here to help get Python 3 supported, it's all part of the same effort.

URLs within a sitemap file were not being captured.
@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Oct 28, 2019

We've added you as a member to the linkchecker organization on GH a while back; does that give you enough permissions to push to this repo? If so, it would be convenient to have you PRs based on branches here, so we could all collaborate. If not, then that's something we ought to fix.

I tried to push to https://github.com/linkchecker/linkchecker.git and got a permission denied response.

I'm keen to not be the sole "keeper" of these big branches (and rate limiter). So am happy to store them here. To date there has been a fair bit of rebasing going on which could have made sharing difficult but I think I am pretty much done on this one now (there is hope!).

@mgedmin
Copy link
Copy Markdown
Contributor

mgedmin commented Oct 29, 2019

I tried to push to https://github.com/linkchecker/linkchecker.git and got a permission denied response.

So, being a member of the organisation does not grant one write access to the org's repos! Interesting.

I've created a new team ('developers'), invited you and @anarcat, and granted all team members write access to this repository.

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Oct 29, 2019

I've created a new team ('developers'), invited you and @anarcat, and granted all team members write access to this repository.

That worked. I've done one last rebase to remove the sitemap commit that is now in master and pushed the branch to https://github.com/linkchecker/linkchecker.git.

What to do with this PR? It is not possible to change the source branch? I haven't updated it. Is a new PR needed to run CI on the new branch? (GitHub does show a couple of links to this PR when you select the new branch).

@mgedmin
Copy link
Copy Markdown
Contributor

mgedmin commented Oct 29, 2019

Yes, a new PR is needed. GitHub lets you change the target branch of the PR, but not the source branch/repo.

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Oct 30, 2019

OK, created #337 and closing this one.

@cjmayo cjmayo closed this Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants