Do include anchors within cache_url, mutex around the parsing by yarikoptic · Pull Request #194 · linkchecker/linkchecker

yarikoptic · 2018-10-31T05:58:18Z

Closes #179

Please see details of the "investigation" in wummel/linkchecker#557 (comment)

consider another approach (may be adding full_url record in addition) instead of populating anchor into cache_url which makes it to download them same page multiple times
introduced test demonstrates an outstanding problem in threaded mode that in some runs anchor pointing to an element defined later in the document would be considered broken (I guess because parsing of the page is not finished yet whenever anchor is verified). It actually can manifest it in two ways:

printouts of failed runs

tests/checker/__init__.py:232: in fail_unicode
    self.fail(msg)
E   AssertionError: Differences found testing file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html (records might switch order which is ok)
E   @@ -2,7 +2,7 @@
E    cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad1
E    real url file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html
E    name first local bad one
E   -warning Anchor `bad1' not found. Available anchors: `anchor1'.
E   +warning Anchor `bad1' not found. Available anchors: -.
E    valid
E    url #bad2
E    cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad2

and

E   AssertionError: Differences found testing file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html (records might switch order which is ok)
E   @@ -1,8 +1,8 @@
E   -url #bad1
E   -cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad1
E   +url #anchor1
E   +cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#anchor1
E    real url file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html
E   -name first local bad one
E   -warning Anchor `bad1' not found. Available anchors: `anchor1'.
E   +name good anchor used before actual element is defined
E   +warning Anchor `anchor1' not found. Available anchors: -.
E    valid
E    url #bad2
E    cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad2
E   @@ -10,6 +10,12 @@
E    name second local bad one
E    warning Anchor `bad2' not found. Available anchors: `anchor1'.
E    valid
E   +url #bad1
E   +cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad1
E   +real url file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html
E   +name first local bad one
E   +warning Anchor `bad1' not found. Available anchors: `anchor1'.
E   +valid
E    url anchors.html#bad3
E    cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad3
E    real url file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html

here is my command to run the test until failure:

( set -e; for f in {1..100}; do echo $f; python -m pytest -s -v tests/checker/test_anchor.py::TestAnchor::test_anchors; done; )

anarcat

This looks good to me, as long as tests are adjusted, naturally.

yarikoptic · 2018-10-31T16:15:32Z

Thanks @anarcat ,

while falling asleep I had still wondered about a more proper solution -- urls content should indeed be cached per page otherwise with this PR there might be sever performance/bandwidth hit for pages with multiple anchored urls to the same external page... so I would like to look either may be adding smth like "full_url" (what I made cache_url now) and using it instead of cache_url in the places where it matters... not sure yet when I get to it yet, so if anyone does it instead of me -- I would not mind. Otherwise I will come back some time ;-)

anarcat · 2018-10-31T16:53:29Z

linkcheck/checker/urlbase.py

+        if self.anchor:
+            # and bring anchor (now to full url) back since otherwise those
+            # urls would not be considered e.g. by AnchorCheck plugins
+            self.cache_url += '#%s' % self.anchor


hmm... i guess i didn't realize this would invalidate the cache... we should definitely cache the page only once per anchor.

maybe the problem is in the AnchorCheck plugin?

I believe it doesn't even see some of those as an analysis in the #179 by others showed

I think this is close, and only needed if using AnchorCheck.

That leads to fix up of anchors analysis and probably other issues such as floating number of found urls etc

yarikoptic · 2018-11-06T03:19:38Z

BTW, just pushed few commits with an extended test for anchors. Apparently even with my crude and cruel cache_url based one, there is one outstanding issue in the threaded mode -- it fails to properly treat to the future id in some (but not all) runs. Left that last commit not finished for adjusting the diff -- needed to disregard the order in outputs (I wish captured records were not just output lines) since that one is not guaranteed ATM in threaded execution

yarikoptic · 2018-11-06T03:20:40Z

linkcheck/parser/__init__.py



+import threading
+parse_mutex = threading.Lock()


this was extracted into #198 so it could have been merged without waiting for this one. I will rebase this one whenever that one is merged

Just a safety measure, not yet proven to be required but overall makes sense

Otherwise, depending on the order urls arrival etc many of the anchors (urls with anchors) would not be even passed to the AnchorCheck plugin.

This file should be used to test for all bad references to be detected, which is even as of current v9.4.0-38-gc9cbc276 which I thought fixed all issues, is not complete: at times there is a false positive for use of an anchor ahead of the definition of the element (i.e. here the anchor1. TODO: figure out how to feed linkchecker this entire file instead of a single url for checking as it was done in the test_anchor

yarikoptic · 2018-11-06T16:14:19Z

now this PR sits on top of #198 dedicated to the mutex, adjusting TODOs on top to add necessity to avoid fetching the same page twice

cjmayo · 2020-07-27T18:56:13Z

I've created #460 for the anchor checking issue.

Hopefully the threading problems are no longer relevant because we are using BeautifulSoup now. I did try a version of the test in the orignal message and seems OK. Thanks!

yarikoptic mentioned this pull request Oct 31, 2018

[INFRA] Use linkchecker to verify URLs bids-standard/bids-specification#79

Closed

anarcat approved these changes Oct 31, 2018

View reviewed changes

anarcat requested changes Oct 31, 2018

View reviewed changes

yarikoptic added 2 commits November 1, 2018 11:08

DOC: minor typo fix

b78c2d2

BF: place a mutex around apparently thread-unsafe parser.feed invocation

ee27e17

That leads to fix up of anchors analysis and probably other issues such as floating number of found urls etc

yarikoptic mentioned this pull request Nov 6, 2018

Mutex around parse #198

Merged

yarikoptic commented Nov 6, 2018

View reviewed changes

yarikoptic added 4 commits November 6, 2018 10:58

RF: place parser.flush() under mutex as well

7ed7919

Just a safety measure, not yet proven to be required but overall makes sense

BF: do consider anchor for cache_url

590c7cc

Otherwise, depending on the order urls arrival etc many of the anchors (urls with anchors) would not be even passed to the AnchorCheck plugin.

TST: comparison for threaded execution - need to forget about the order

e4371d9

yarikoptic force-pushed the bf-anchors branch from 551041a to e4371d9 Compare November 6, 2018 16:12

yarikoptic mentioned this pull request May 10, 2019

{python3_20} Python3: decode parts before submitting them to urllib.quote() #230

Merged

cjmayo closed this Jul 27, 2020

yarikoptic mentioned this pull request Nov 17, 2021

[INFRA] Run linkchecker in stock docker image bids-standard/bids-specification#932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do include anchors within cache_url, mutex around the parsing#194

Do include anchors within cache_url, mutex around the parsing#194
yarikoptic wants to merge 6 commits intolinkchecker:masterfrom
yarikoptic:bf-anchors

yarikoptic commented Oct 31, 2018 •

edited

Loading

Uh oh!

anarcat left a comment

Uh oh!

yarikoptic commented Oct 31, 2018

Uh oh!

anarcat Oct 31, 2018

Uh oh!

yarikoptic Oct 31, 2018

Uh oh!

cjmayo Jul 27, 2020

Uh oh!

yarikoptic commented Nov 6, 2018

Uh oh!

yarikoptic Nov 6, 2018

Uh oh!

yarikoptic commented Nov 6, 2018

Uh oh!

cjmayo commented Jul 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		import threading
		parse_mutex = threading.Lock()

Conversation

yarikoptic commented Oct 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anarcat left a comment

Choose a reason for hiding this comment

Uh oh!

yarikoptic commented Oct 31, 2018

Uh oh!

anarcat Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

yarikoptic Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

cjmayo Jul 27, 2020

Choose a reason for hiding this comment

Uh oh!

yarikoptic commented Nov 6, 2018

Uh oh!

yarikoptic Nov 6, 2018

Choose a reason for hiding this comment

Uh oh!

yarikoptic commented Nov 6, 2018

Uh oh!

cjmayo commented Jul 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yarikoptic commented Oct 31, 2018 •

edited

Loading