If you type a URL on the command line which is a valid domain name but it returns a HTTP error code, then Lychee will ignore the error code and use the returned body anyway. In most cases, this returned body will be a meaningless error page.
for example:
$ cargo run -- http://example.com/this-link-does-not-exist
1/1 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links
✅ 0 OK 🚫 0 Errors 🔀 1 Redirects
$ curl -I http://example.com/this-link-does-not-exist
HTTP/1.1 404 Not Found
Accept-Ranges: bytes
Content-Type: text/html
this can be very surprising if a web server switches its behaviour based on things like user agent.
for example, wikipedia returns a 403 when no user agent is set, and lychee doesn't set a user agent by default. this leads to things where the behaviour of lychee can appear to diverge from other tools and there is no warning that this is happening.
$ cargo run -- https://en.wikipedia.org/wiki/Tony_Hoare
0/0 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links
🔍 0 Total (in 0s) ✅ 0 OK 🚫 0 Errors
$ curl -I https://en.wikipedia.org/wiki/Tony_Hoare
HTTP/2 200
date: Fri, 24 Oct 2025 08:53:25 GMT
server: ATS/9.2.11
content-type: text/html; charset=UTF-8
content-length: 235766
$ curl https://en.wikipedia.org/wiki/Tony_Hoare | cargo run -- - --offline
169/169 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links
🔍 169 Total (in 0s) ✅ 0 OK 🚫 0 Errors 👻 169 Excluded
with some printing, we can see the input content that lychee is seeing that contains no links:
InputContent { source: RemoteUrl(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("en.wikipedia.org")), port: None, path: "/wiki/Tony_Hoare", query: None, fragment: None }), file_type: Html, content: "Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.\n" }
and finally, by setting an empty user agent to curl, we can see the same response:
$ curl -i --user-agent '' https://en.wikipedia.org/wiki/Tony_Hoare
HTTP/2 403
content-length: 92
content-type: text/plain
x-analytics:
x-request-id: 26087a9d-3c9b-4ab3-9d7b-0f2d251eea50
server: HAProxy
x-cache: cp5020 int
x-cache-status: int-tls
Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.
what lychee can do (and maybe should do) is tell the user if one of their inputs receives a HTTP 4xx code. open questions:
- should it be a warning or an error?
- what if the user legitimately wants to check links on their 404 page? that should still be possible
- should it go through the status code
--accept logic? what if you want different accept codes for link checking vs user input links?
If you type a URL on the command line which is a valid domain name but it returns a HTTP error code, then Lychee will ignore the error code and use the returned body anyway. In most cases, this returned body will be a meaningless error page.
for example:
this can be very surprising if a web server switches its behaviour based on things like user agent.
for example, wikipedia returns a 403 when no user agent is set, and lychee doesn't set a user agent by default. this leads to things where the behaviour of lychee can appear to diverge from other tools and there is no warning that this is happening.
with some printing, we can see the input content that lychee is seeing that contains no links:
and finally, by setting an empty user agent to curl, we can see the same response:
what lychee can do (and maybe should do) is tell the user if one of their inputs receives a HTTP 4xx code. open questions:
--acceptlogic? what if you want different accept codes for link checking vs user input links?