Skip to content

Do include anchors within cache_url, mutex around the parsing#194

Closed
yarikoptic wants to merge 6 commits intolinkchecker:masterfrom
yarikoptic:bf-anchors
Closed

Do include anchors within cache_url, mutex around the parsing#194
yarikoptic wants to merge 6 commits intolinkchecker:masterfrom
yarikoptic:bf-anchors

Conversation

@yarikoptic
Copy link
Copy Markdown
Contributor

@yarikoptic yarikoptic commented Oct 31, 2018

Closes #179

Please see details of the "investigation" in wummel/linkchecker#557 (comment)

  • consider another approach (may be adding full_url record in addition) instead of populating anchor into cache_url which makes it to download them same page multiple times
  • introduced test demonstrates an outstanding problem in threaded mode that in some runs anchor pointing to an element defined later in the document would be considered broken (I guess because parsing of the page is not finished yet whenever anchor is verified). It actually can manifest it in two ways:
printouts of failed runs
tests/checker/__init__.py:232: in fail_unicode
    self.fail(msg)
E   AssertionError: Differences found testing file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html (records might switch order which is ok)
E   @@ -2,7 +2,7 @@
E    cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad1
E    real url file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html
E    name first local bad one
E   -warning Anchor `bad1' not found. Available anchors: `anchor1'.
E   +warning Anchor `bad1' not found. Available anchors: -.
E    valid
E    url #bad2
E    cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad2

and

E   AssertionError: Differences found testing file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html (records might switch order which is ok)
E   @@ -1,8 +1,8 @@
E   -url #bad1
E   -cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad1
E   +url #anchor1
E   +cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#anchor1
E    real url file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html
E   -name first local bad one
E   -warning Anchor `bad1' not found. Available anchors: `anchor1'.
E   +name good anchor used before actual element is defined
E   +warning Anchor `anchor1' not found. Available anchors: -.
E    valid
E    url #bad2
E    cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad2
E   @@ -10,6 +10,12 @@
E    name second local bad one
E    warning Anchor `bad2' not found. Available anchors: `anchor1'.
E    valid
E   +url #bad1
E   +cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad1
E   +real url file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html
E   +name first local bad one
E   +warning Anchor `bad1' not found. Available anchors: `anchor1'.
E   +valid
E    url anchors.html#bad3
E    cache key file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html#bad3
E    real url file:///home/yoh/proj/misc/linkchecker/tests/checker/data/anchors.html

here is my command to run the test until failure:

( set -e; for f in {1..100}; do echo $f; python -m pytest -s -v tests/checker/test_anchor.py::TestAnchor::test_anchors; done; )

Copy link
Copy Markdown
Contributor

@anarcat anarcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, as long as tests are adjusted, naturally.

@yarikoptic
Copy link
Copy Markdown
Contributor Author

Thanks @anarcat ,

while falling asleep I had still wondered about a more proper solution -- urls content should indeed be cached per page otherwise with this PR there might be sever performance/bandwidth hit for pages with multiple anchored urls to the same external page... so I would like to look either may be adding smth like "full_url" (what I made cache_url now) and using it instead of cache_url in the places where it matters... not sure yet when I get to it yet, so if anyone does it instead of me -- I would not mind. Otherwise I will come back some time ;-)

if self.anchor:
# and bring anchor (now to full url) back since otherwise those
# urls would not be considered e.g. by AnchorCheck plugins
self.cache_url += '#%s' % self.anchor
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... i guess i didn't realize this would invalidate the cache... we should definitely cache the page only once per anchor.

maybe the problem is in the AnchorCheck plugin?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it doesn't even see some of those as an analysis in the #179 by others showed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is close, and only needed if using AnchorCheck.

That leads to fix up of anchors analysis and probably other issues
such as floating number of found urls etc
@yarikoptic yarikoptic mentioned this pull request Nov 6, 2018
@yarikoptic
Copy link
Copy Markdown
Contributor Author

BTW, just pushed few commits with an extended test for anchors. Apparently even with my crude and cruel cache_url based one, there is one outstanding issue in the threaded mode -- it fails to properly treat to the future id in some (but not all) runs. Left that last commit not finished for adjusting the diff -- needed to disregard the order in outputs (I wish captured records were not just output lines) since that one is not guaranteed ATM in threaded execution



import threading
parse_mutex = threading.Lock()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was extracted into #198 so it could have been merged without waiting for this one. I will rebase this one whenever that one is merged

Just a safety measure, not yet proven to be required but overall
makes sense
Otherwise, depending on the order urls arrival etc many of the anchors
(urls with anchors) would not be even passed to the AnchorCheck plugin.
This file should be used to test for all bad references to be detected,
which is even as of current v9.4.0-38-gc9cbc276 which I thought fixed all issues,
is not complete:  at times there is a false positive for use of an anchor ahead
of the definition of the element (i.e. here the anchor1.

TODO: figure out how to feed linkchecker this entire file instead of a single
url for checking as it was done in the test_anchor
@yarikoptic
Copy link
Copy Markdown
Contributor Author

now this PR sits on top of #198 dedicated to the mutex, adjusting TODOs on top to add necessity to avoid fetching the same page twice

@cjmayo
Copy link
Copy Markdown
Contributor

cjmayo commented Jul 27, 2020

I've created #460 for the anchor checking issue.

Hopefully the threading problems are no longer relevant because we are using BeautifulSoup now. I did try a version of the test in the orignal message and seems OK. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

link anchors checks broken

3 participants