Skip to content

set user-agent when fetching remote input URLs#2099

Closed
mre wants to merge 2 commits into
masterfrom
fix/user-agent-cli-inputs
Closed

set user-agent when fetching remote input URLs#2099
mre wants to merge 2 commits into
masterfrom
fix/user-agent-cli-inputs

Conversation

@mre

@mre mre commented Mar 25, 2026

Copy link
Copy Markdown
Member

This is a veeery conservative attempt at fixing #1886.

Previously, the reqwest::Client used by the UrlContentResolver (which fetches CLI URL inputs) was built without a user-agent. This meant that passing a URL directly as a CLI argument resulted in a request with no user-agent header, unlike links discovered during checking which correctly use the configured user-agent.

This caused subtle bugs such as Wikipedia returning a 403 for CLI inputs (since it requires a user-agent), resulting in lychee finding zero links on pages that actually contain hundreds.

Fixed by storing the configured user-agent on the Collector and using it to build the resolver's reqwest::Client.

Let me say that I believe the collector is already doing way too much and that this is probably not the right approach going forward. The fact that the collector is responsible for building a reqwest::Client for the resolver is a sign of tight coupling and a violation of separation of concerns. The collector should ideally only be responsible for collecting links, while the resolver should be responsible for fetching content.

That said, I opened the PR to get some early feedback and perhaps to fix the immediate issue. I also wanted to avoid a larger refactor that would touch many parts of the codebase and potentially cause more disruption in the short term. (See below.)

As an alternative, I considered to unify the two code paths where we deal with "building request clients." Instead of the Collector using its own bare reqwest::Client, we can make it reuse the same fully-configured lychee_lib::Client (with rate limiting, retries, per-host config, TLS settings, etc.) that's used for link checking. The original issue already notes that unifying these paths would also help with recursion support. The downside is it's a much larger refactor touching the Client API.
But it's probably still worth it? I can create another PR for that and we can compare.
I'm a bit hesitant to do it because large PRs have caused some churn in the past and it took a while to get them merged. But I'm willing to do it if we think it's the right move. I just don't want to cause too much disruption in the short term.

Let me know what you think!

@katrinafyi @thomas-zahner @cristiklein feedback welcome.

mre added 2 commits March 25, 2026 00:57
Previously, the reqwest::Client used by the UrlContentResolver
(which fetches CLI URL inputs) was built without a user-agent.
This meant that passing a URL directly as a CLI argument resulted
in a request with no user-agent header, unlike links discovered
during checking which correctly use the configured user-agent.

This caused subtle bugs such as Wikipedia returning a 403 for
CLI inputs (since it requires a user-agent), resulting in lychee
finding zero links on pages that actually contain hundreds.

Fix by storing the configured user-agent on the Collector and
using it to build the resolver's reqwest::Client.

Fixes #1886
@thomas-zahner

Copy link
Copy Markdown
Member

I think we're all agreeing to go your less conservative approach proposed in #2100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants