Skip to content

Linkcheck Reports Broken Link when Remote Server Closes on HEAD Request #9306

@jamathews

Description

@jamathews

Describe the bug
Running make linkcheck on a document that contains an external link to a website may report the link is broken when a web browser may successfully open the link. Specifically, if the website closes its connection when receiving the HTTP HEAD request method, then linkcheck.py will receive a ConnectionError exception, which bypasses the logic that would otherwise have it make an HTTP GET request.

A specific example of a website exhibiting this behaviour is the US Patent and Trademark Office

To Reproduce
Steps to reproduce the behaviour:

$ sphinx-quickstart  # accept all the default options
$ echo '\n\nThis is `a link to the US Patent Website <https://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=7840660&OS=7840660&RS=7840660>`_.\n' >> index.rst
$ make html linkcheck 
  • Observe linkcheck reporting a broken link:
Running Sphinx v4.0.2
loading pickled environment... done
building [mo]: targets for 0 po files that are out of date
building [linkcheck]: targets for 1 source files that are out of date
updating environment: 0 added, 0 changed, 0 removed
looking for now-outdated files... none found
preparing documents... done
writing output... [100%] index

(           index: line   22) broken    https://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=7840660&OS=7840660&RS=7840660 - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
build finished with problems.
make: *** [linkcheck] Error 1
  • Open _build/html/index.html in your browser
  • Click "a link to the US Patent Website"
  • Observe the link opening and rendering normally

Expected behavior
If a link is valid, and the website is returning valid content for an HTTP GET request, make linkcheck should not report the link as broken.

Internally, in sphinx/builders/linkcheck.py, if a call to requests.head() raises a requests.exceptions.ConnectionError exception, it should attempt a requests.get() just like it does with HTTPError and TooManyredirects.

Your project
sphinx-bug-linkcheck-reports-broken-link.zip

Environment info

  • OS: macOS 11.3.1 (but this does not appear to be OS-dependent)
  • Python version: 3.8.10 and 3.9.5
  • Sphinx version: v3.5.4 and v4.0.2
  • Sphinx extensions: none
  • Extra tools: any common web browser

Additional context
Using curl, we can see that this particular website closes connections when receiving HTTP HEAD:

$ curl -v -L -A "Sphinx/1.0" "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9,942,040.PN.&OS=PN/9,942,040&RS=PN/9,942,040"
  • Observe that this returns the expected HTML content
curl --head -v -L -A "Sphinx/1.0" "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=9,942,040.PN.&OS=PN/9,942,040&RS=PN/9,942,040"
  • Observe that this fails to return any content from the server

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions