Do include anchors within cache_url, mutex around the parsing#194
Do include anchors within cache_url, mutex around the parsing#194yarikoptic wants to merge 6 commits intolinkchecker:masterfrom
Conversation
anarcat
left a comment
There was a problem hiding this comment.
This looks good to me, as long as tests are adjusted, naturally.
|
Thanks @anarcat , while falling asleep I had still wondered about a more proper solution -- urls content should indeed be cached per page otherwise with this PR there might be sever performance/bandwidth hit for pages with multiple anchored urls to the same external page... so I would like to look either may be adding smth like "full_url" (what I made cache_url now) and using it instead of cache_url in the places where it matters... not sure yet when I get to it yet, so if anyone does it instead of me -- I would not mind. Otherwise I will come back some time ;-) |
| if self.anchor: | ||
| # and bring anchor (now to full url) back since otherwise those | ||
| # urls would not be considered e.g. by AnchorCheck plugins | ||
| self.cache_url += '#%s' % self.anchor |
There was a problem hiding this comment.
hmm... i guess i didn't realize this would invalidate the cache... we should definitely cache the page only once per anchor.
maybe the problem is in the AnchorCheck plugin?
There was a problem hiding this comment.
I believe it doesn't even see some of those as an analysis in the #179 by others showed
There was a problem hiding this comment.
I think this is close, and only needed if using AnchorCheck.
That leads to fix up of anchors analysis and probably other issues such as floating number of found urls etc
|
BTW, just pushed few commits with an extended test for anchors. Apparently even with my crude and cruel cache_url based one, there is one outstanding issue in the threaded mode -- it fails to properly treat to the future id in some (but not all) runs. Left that last commit not finished for adjusting the diff -- needed to disregard the order in outputs (I wish captured records were not just output lines) since that one is not guaranteed ATM in threaded execution |
linkcheck/parser/__init__.py
Outdated
|
|
||
|
|
||
| import threading | ||
| parse_mutex = threading.Lock() |
There was a problem hiding this comment.
this was extracted into #198 so it could have been merged without waiting for this one. I will rebase this one whenever that one is merged
Just a safety measure, not yet proven to be required but overall makes sense
Otherwise, depending on the order urls arrival etc many of the anchors (urls with anchors) would not be even passed to the AnchorCheck plugin.
This file should be used to test for all bad references to be detected, which is even as of current v9.4.0-38-gc9cbc276 which I thought fixed all issues, is not complete: at times there is a false positive for use of an anchor ahead of the definition of the element (i.e. here the anchor1. TODO: figure out how to feed linkchecker this entire file instead of a single url for checking as it was done in the test_anchor
551041a to
e4371d9
Compare
|
now this PR sits on top of #198 dedicated to the mutex, adjusting TODOs on top to add necessity to avoid fetching the same page twice |
|
I've created #460 for the anchor checking issue. Hopefully the threading problems are no longer relevant because we are using BeautifulSoup now. I did try a version of the test in the orignal message and seems OK. Thanks! |
Closes #179
Please see details of the "investigation" in wummel/linkchecker#557 (comment)
full_urlrecord in addition) instead of populating anchor intocache_urlwhich makes it to download them same page multiple timesprintouts of failed runs
and
here is my command to run the test until failure: