-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Describe the bug
Running make linkcheck on a document that contains an external link to a website may report the link is broken when a web browser may successfully open the link. Specifically, if the website closes its connection when receiving the HTTP HEAD request method, then linkcheck.py will receive a ConnectionError exception, which bypasses the logic that would otherwise have it make an HTTP GET request.
A specific example of a website exhibiting this behaviour is the US Patent and Trademark Office
To Reproduce
Steps to reproduce the behaviour:
$ sphinx-quickstart # accept all the default options
$ echo '\n\nThis is `a link to the US Patent Website <https://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=7840660&OS=7840660&RS=7840660>`_.\n' >> index.rst
$ make html linkcheck - Observe linkcheck reporting a broken link:
Running Sphinx v4.0.2
loading pickled environment... done
building [mo]: targets for 0 po files that are out of date
building [linkcheck]: targets for 1 source files that are out of date
updating environment: 0 added, 0 changed, 0 removed
looking for now-outdated files... none found
preparing documents... done
writing output... [100%] index
( index: line 22) broken https://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=7840660&OS=7840660&RS=7840660 - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
build finished with problems.
make: *** [linkcheck] Error 1
- Open
_build/html/index.htmlin your browser - Click "a link to the US Patent Website"
- Observe the link opening and rendering normally
Expected behavior
If a link is valid, and the website is returning valid content for an HTTP GET request, make linkcheck should not report the link as broken.
Internally, in sphinx/builders/linkcheck.py, if a call to requests.head() raises a requests.exceptions.ConnectionError exception, it should attempt a requests.get() just like it does with HTTPError and TooManyredirects.
Your project
sphinx-bug-linkcheck-reports-broken-link.zip
Environment info
- OS:
macOS 11.3.1(but this does not appear to be OS-dependent) - Python version:
3.8.10and3.9.5 - Sphinx version:
v3.5.4andv4.0.2 - Sphinx extensions: none
- Extra tools: any common web browser
Additional context
Using curl, we can see that this particular website closes connections when receiving HTTP HEAD:
$ curl -v -L -A "Sphinx/1.0" "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9,942,040.PN.&OS=PN/9,942,040&RS=PN/9,942,040"- Observe that this returns the expected HTML content
curl --head -v -L -A "Sphinx/1.0" "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9,942,040.PN.&OS=PN/9,942,040&RS=PN/9,942,040"- Observe that this fails to return any content from the server