HTTP error codes in command-line inputs are ignored

If you type a URL on the command line which is a valid domain name but it returns a HTTP error code, then Lychee will ignore the error code and use the returned body anyway. In most cases, this returned body will be a meaningless error page. 

for example:
```console
$ cargo run --  http://example.com/this-link-does-not-exist   
  1/1 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links 
✅ 0 OK 🚫 0 Errors 🔀 1 Redirects

$ curl -I http://example.com/this-link-does-not-exist
HTTP/1.1 404 Not Found
Accept-Ranges: bytes
Content-Type: text/html
```

this can be very surprising if a web server switches its behaviour based on things like user agent.

for example, wikipedia returns a 403 when no user agent is set, and lychee doesn't set a user agent by default. this leads to things where the behaviour of lychee can _appear_ to diverge from other tools and there is no warning that this is happening.
```console
$ cargo run --  https://en.wikipedia.org/wiki/Tony_Hoare 
  0/0 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links
🔍 0 Total (in 0s) ✅ 0 OK 🚫 0 Errors

$ curl -I https://en.wikipedia.org/wiki/Tony_Hoare     
HTTP/2 200 
date: Fri, 24 Oct 2025 08:53:25 GMT
server: ATS/9.2.11
content-type: text/html; charset=UTF-8
content-length: 235766

$ curl https://en.wikipedia.org/wiki/Tony_Hoare | cargo run --  -  --offline
  169/169 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links
🔍 169 Total (in 0s) ✅ 0 OK 🚫 0 Errors 👻 169 Excluded
```
with some printing, we can see the input content that lychee is seeing that contains no links:
```
InputContent { source: RemoteUrl(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("en.wikipedia.org")), port: None, path: "/wiki/Tony_Hoare", query: None, fragment: None }), file_type: Html, content: "Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.\n" }
```
and finally, by setting an empty user agent to curl, we can see the same response:
```console
$ curl -i --user-agent '' https://en.wikipedia.org/wiki/Tony_Hoare
HTTP/2 403 
content-length: 92
content-type: text/plain
x-analytics: 
x-request-id: 26087a9d-3c9b-4ab3-9d7b-0f2d251eea50
server: HAProxy
x-cache: cp5020 int
x-cache-status: int-tls

Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.
```

what lychee can do (and maybe should do) is tell the user if one of their inputs receives a HTTP 4xx code. open questions:
- should it be a warning or an error?
- what if the user legitimately wants to check links on their 404 page? that should still be possible
- should it go through the status code `--accept` logic? what if you want different accept codes for link checking vs user input links?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTTP error codes in command-line inputs are ignored #1883

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

HTTP error codes in command-line inputs are ignored #1883

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions