Skip to content

{python3_20} Python3: decode parts before submitting them to urllib.quote()#230

Merged
anarcat merged 1 commit intolinkchecker:masterfrom
cjmayo:python3_20
Sep 4, 2019
Merged

{python3_20} Python3: decode parts before submitting them to urllib.quote()#230
anarcat merged 1 commit intolinkchecker:masterfrom
cjmayo:python3_20

Conversation

@cjmayo
Copy link
Copy Markdown
Contributor

@cjmayo cjmayo commented Apr 10, 2019

No description provided.

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Apr 19, 2019

Apply after #227, this is rebased on python3_17 which is the first commit in this PR.

@cjmayo cjmayo force-pushed the python3_20 branch 2 times, most recently from a296b71 to abedcd1 Compare April 23, 2019 19:06
query = query.encode(encoding, 'ignore')
# if ? is in the query, split it off, seen at msdn.microsoft.com
append = ""
query = decode_for_unquote(query)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong. We've only just encoded it to encoding or url_encoding (which is not necessarily UTF-8) three lines up from here, and now we're decoding as UTF-8.

Copy link
Copy Markdown
Contributor Author

@cjmayo cjmayo Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried taking it out and got 67 failures including:

_____________________________________________ TestMailBad.test_error_mail ______________________________________________
[gw14] linux -- Python 3.6.8 /usr/bin/python3.6

self = <tests.checker.test_mail_bad.TestMailBad testMethod=test_error_mail>

    def test_error_mail (self):
        # too long or too short
>       self.mail_error(u"mailto:@")

tests/checker/test_mail_bad.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/checker/__init__.py:279: in mail_error
    return self.mail_test(addr, u"error", **kwargs)
tests/checker/__init__.py:283: in mail_test
    url = self.norm(addr)
tests/checker/__init__.py:191: in norm
    return linkcheck.url.url_norm(url, encoding=encoding)[0]
linkcheck/url.py:334: in url_norm
    urlparts[3] = url_parse_query(urlparts[3], encoding=encoding)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

query = b'', encoding = None

    def url_parse_query (query, encoding=None):
        """Parse and re-join the given CGI query."""
        if isinstance(query, str_text):
            if encoding is None:
                encoding = url_encoding
            query = query.encode(encoding, 'ignore')
        # if ? is in the query, split it off, seen at msdn.microsoft.com
        append = ""
>       while '?' in query:
E       TypeError: a bytes-like object is required, not 'str'

linkcheck/url.py:278: TypeError

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand removing the section above:

--- a/linkcheck/url.py
+++ b/linkcheck/url.py
@@ -269,10 +269,6 @@ def url_fix_wayback_query(path):
 
 def url_parse_query (query, encoding=None):
     """Parse and re-join the given CGI query."""
-    if isinstance(query, str_text):
-        if encoding is None:
-            encoding = url_encoding
-        query = query.encode(encoding, 'ignore')
     # if ? is in the query, split it off, seen at msdn.microsoft.com
     append = ""
     query = decode_for_unquote(query)

and the python3 branch passes tests for 2.7 and 3.6.
Although there is a warning with 2.7:

linkcheck/checker/ftpurl.py:140
  /var/tmp/portage/net-analyzer/linkchecker-9999-r100/work/linkchecker-9999/linkcheck/checker/ftpurl.py:140: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
    if self.filename in files:

I'm tempted to ignore that though and just press on bearing in mind we will go Python 3 only at the end of this.

Copy link
Copy Markdown
Contributor Author

@cjmayo cjmayo Apr 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there is a warning with 2.7:

linkcheck/checker/ftpurl.py:140
  /var/tmp/portage/net-analyzer/linkchecker-9999-r100/work/linkchecker-9999/linkcheck/checker/ftpurl.py:140: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
    if self.filename in files:

Actually looks like that is nothing to do with this, but a consequence of enabling the FTP test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to ignore that though and just press on bearing in mind we will go Python 3 only at the end of this.

I'm tempted to do that as well since I cannot muster the energy to analyze all the sources to figure out what's the right thing to do....

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Apr 28, 2019

Rebased on master and add a commit removing the encoding which I will squash in if agreed.

@yarikoptic
Copy link
Copy Markdown
Contributor

FWIW testing this merged into rebased (on master) version of #194 which adds tests of anchors (merge tagged/pushed to my fork https://github.com/yarikoptic/linkchecker/releases/tag/9.4.0.anchorfix2%2Bpy3fix if of need to reproduce), and getting

(git)hopa:~/proj/misc/linkchecker[undefined]git
$> python -m pytest -s -v --tb=short --pdb tests/checker/test_anchor.py
======================================= test session starts ========================================
platform linux -- Python 3.7.3rc1, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 -- /home/yoh/proj/misc/linkchecker/venvs/dev3/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/yoh/proj/misc/linkchecker/.hypothesis/examples')
rootdir: /home/yoh/proj/misc/linkchecker, inifile:
plugins: localserver-0.5.0, hypothesis-3.71.11
collected 2 items                                                                                  

tests/checker/test_anchor.py::TestAnchor::test_anchor FAILED
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tests/checker/test_anchor.py:31: in test_anchor
    nurl = self.norm(url)
tests/checker/__init__.py:191: in norm
    return linkcheck.url.url_norm(url, encoding=encoding)[0]
linkcheck/url.py:353: in url_norm
    if url.endswith('#') and not urlparts[4]:
E   TypeError: endswith first arg must be bytes or a tuple of bytes, not str
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /home/yoh/proj/misc/linkchecker/linkcheck/url.py(353)url_norm()
-> if url.endswith('#') and not urlparts[4]:
(Pdb) p url
b'file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchor.html'
(Pdb) up
> /home/yoh/proj/misc/linkchecker/tests/checker/__init__.py(191)norm()
-> return linkcheck.url.url_norm(url, encoding=encoding)[0]
(Pdb) p url
'file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchor.html'

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented May 10, 2019

if url.endswith('#') and not urlparts[4]:

E TypeError: endswith first arg must be bytes or a tuple of bytes, not str

I think you're testing:
https://github.com/yarikoptic/linkchecker/blob/7a231adb713a2a01f1ff0822a71c747ab85870a0/linkcheck/url.py#L353

Despite the name none of these python3_nn branches work on Python 3. For this specific error I guess the commit "Python3: fix url.endswith in url.py" 1528b4c is needed.

I test Python 3 against the htmlparser-beautifulsoup branch in #249. Sounds like something worth pursuing with the anchors. But beware this branch does get rebased.

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented May 10, 2019

OK. Squashed into one commit for merging.

@mgedmin mgedmin mentioned this pull request May 24, 2019
@cjmayo cjmayo mentioned this pull request Jul 16, 2019
Copy link
Copy Markdown

@SchoolGuy SchoolGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is a good fix for having internally the Py2 Byte-Strings and then converting to Unicode when needed. If we have a stable Py3 version I would suggest that we use the new Py3 str implementation as much as possible.

@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Jul 18, 2019

@SchoolGuy did you test the new code at all? @yarikoptic had problems with it... @yarikoptic do you think your errors are related to #194 or inherent to this PR?

@SchoolGuy
Copy link
Copy Markdown

@anarcat I wasn't able to even get the C stuff to work with the provided docs. The tests are always failing because they are not able to find the C extensions

@yarikoptic
Copy link
Copy Markdown
Contributor

@anarcat : seems to be unrelated to #194 since fails in current head of this PR (click to expand)
(git)hopa:~/proj/misc/linkchecker[pull/origin/230/head]git
$> git describe
fatal: No annotated tags can describe 'a6643034fbb442452092353b8ffcb8ae77cf8279'.
However, there were unannotated tags: try --tags.
(dev3) 1 11512 ->128 [1].....................................:Thu 18 Jul 2019 10:21:45 AM EDT:.
(git)hopa:~/proj/misc/linkchecker[pull/origin/230/head]git
$> git describe --tags
v9.4.0-132-ga6643034
(dev3) 1 11513 [1].....................................:Thu 18 Jul 2019 10:21:47 AM EDT:.
(git)hopa:~/proj/misc/linkchecker[pull/origin/230/head]git
$> python -m pytest -s -v --tb=short --pdb tests/checker/test_anchor.py
====================================================== test session starts ======================================================
platform linux -- Python 3.7.3rc1, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 -- /home/yoh/proj/misc/linkchecker/venvs/dev3/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/yoh/proj/misc/linkchecker/.hypothesis/examples')
rootdir: /home/yoh/proj/misc/linkchecker, inifile:
plugins: localserver-0.5.0, hypothesis-3.71.11
collected 1 item                                                                                                                

tests/checker/test_anchor.py::TestAnchor::test_anchor FAILED
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tests/checker/test_anchor.py:31: in test_anchor
    nurl = self.norm(url)
tests/checker/__init__.py:191: in norm
    return linkcheck.url.url_norm(url, encoding=encoding)[0]
linkcheck/url.py:353: in url_norm
    if url.endswith('#') and not urlparts[4]:
E   TypeError: endswith first arg must be bytes or a tuple of bytes, not str
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /home/yoh/proj/misc/linkchecker/linkcheck/url.py(353)url_norm()
-> if url.endswith('#') and not urlparts[4]:
(Pdb) 

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Jul 18, 2019

@anarcat I wasn't able to even get the C stuff to work with the provided docs. The tests are always failing because they are not able to find the C extensions

I think the simplest way to test is to type tox.

I can't use that inside a Gentoo ebuild so have added the function:

python_test() {
	cp "${BUILD_DIR}"/lib/linkcheck/HtmlParser/htmlsax*.so linkcheck/HtmlParser/ || die
	cp "${BUILD_DIR}"/lib/linkcheck/network/_network*.so linkcheck/network/ || die
	PYTHONMALLOC=malloc pytest -n $(makeopts_jobs) tests || die
}

@cjmayo
Copy link
Copy Markdown
Contributor Author

cjmayo commented Jul 18, 2019

@anarcat : seems to be unrelated to #194 since fails in current head of this PR (click to expand)

$> python -m pytest -s -v --tb=short --pdb tests/checker/test_anchor.py
====================================================== test session starts ======================================================
platform linux -- Python 3.7.3rc1, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 -- /home/yoh/proj/misc/linkchecker/venvs/dev3/bin/python

I don't expect any of the {python3_nn} PRs to pass the tests with Python 3. Some may pass specific tests - I guess the ones with tests named in the titles.

I have been using #210 and #249 for Python 3 testing, updating against master to check off these {python3_nn} ones - N.B. that does mean they get rebased.

@SchoolGuy
Copy link
Copy Markdown

@anarcat I wasn't able to even get the C stuff to work with the provided docs. The tests are always failing because they are not able to find the C extensions

I think the simplest way to test is to type tox.

I can't use that inside a Gentoo ebuild so have added the function:

python_test() {
	cp "${BUILD_DIR}"/lib/linkcheck/HtmlParser/htmlsax*.so linkcheck/HtmlParser/ || die
	cp "${BUILD_DIR}"/lib/linkcheck/network/_network*.so linkcheck/network/ || die
	PYTHONMALLOC=malloc pytest -n $(makeopts_jobs) tests || die
}

Okay. Even though I find it wierd to use a testing tool to build the project I will try that tomorrow at work.

Copy link
Copy Markdown
Contributor

@mgedmin mgedmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YOLO, let's merge this and see what happens.

@anarcat anarcat merged commit 59ab064 into linkchecker:master Sep 4, 2019
@anarcat
Copy link
Copy Markdown
Contributor

anarcat commented Sep 4, 2019

fire in the hole.

@cjmayo cjmayo deleted the python3_20 branch November 21, 2019 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants