Skip to content

HTTP error codes in command-line inputs are ignored #1883

@katrinafyi

Description

@katrinafyi

If you type a URL on the command line which is a valid domain name but it returns a HTTP error code, then Lychee will ignore the error code and use the returned body anyway. In most cases, this returned body will be a meaningless error page.

for example:

$ cargo run --  http://example.com/this-link-does-not-exist   
  1/1 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links 
✅ 0 OK 🚫 0 Errors 🔀 1 Redirects

$ curl -I http://example.com/this-link-does-not-exist
HTTP/1.1 404 Not Found
Accept-Ranges: bytes
Content-Type: text/html

this can be very surprising if a web server switches its behaviour based on things like user agent.

for example, wikipedia returns a 403 when no user agent is set, and lychee doesn't set a user agent by default. this leads to things where the behaviour of lychee can appear to diverge from other tools and there is no warning that this is happening.

$ cargo run --  https://en.wikipedia.org/wiki/Tony_Hoare 
  0/0 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links
🔍 0 Total (in 0s) ✅ 0 OK 🚫 0 Errors

$ curl -I https://en.wikipedia.org/wiki/Tony_Hoare     
HTTP/2 200 
date: Fri, 24 Oct 2025 08:53:25 GMT
server: ATS/9.2.11
content-type: text/html; charset=UTF-8
content-length: 235766

$ curl https://en.wikipedia.org/wiki/Tony_Hoare | cargo run --  -  --offline
  169/169 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links
🔍 169 Total (in 0s) ✅ 0 OK 🚫 0 Errors 👻 169 Excluded

with some printing, we can see the input content that lychee is seeing that contains no links:

InputContent { source: RemoteUrl(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("en.wikipedia.org")), port: None, path: "/wiki/Tony_Hoare", query: None, fragment: None }), file_type: Html, content: "Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.\n" }

and finally, by setting an empty user agent to curl, we can see the same response:

$ curl -i --user-agent '' https://en.wikipedia.org/wiki/Tony_Hoare
HTTP/2 403 
content-length: 92
content-type: text/plain
x-analytics: 
x-request-id: 26087a9d-3c9b-4ab3-9d7b-0f2d251eea50
server: HAProxy
x-cache: cp5020 int
x-cache-status: int-tls

Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.

what lychee can do (and maybe should do) is tell the user if one of their inputs receives a HTTP 4xx code. open questions:

  • should it be a warning or an error?
  • what if the user legitimately wants to check links on their 404 page? that should still be possible
  • should it go through the status code --accept logic? what if you want different accept codes for link checking vs user input links?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions