set user-agent when fetching remote input URLs#2099
Closed
mre wants to merge 2 commits into
Closed
Conversation
Previously, the reqwest::Client used by the UrlContentResolver (which fetches CLI URL inputs) was built without a user-agent. This meant that passing a URL directly as a CLI argument resulted in a request with no user-agent header, unlike links discovered during checking which correctly use the configured user-agent. This caused subtle bugs such as Wikipedia returning a 403 for CLI inputs (since it requires a user-agent), resulting in lychee finding zero links on pages that actually contain hundreds. Fix by storing the configured user-agent on the Collector and using it to build the resolver's reqwest::Client. Fixes #1886
Member
|
I think we're all agreeing to go your less conservative approach proposed in #2100. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a veeery conservative attempt at fixing #1886.
Previously, the
reqwest::Clientused by theUrlContentResolver(which fetches CLI URL inputs) was built without a user-agent. This meant that passing a URL directly as a CLI argument resulted in a request with no user-agent header, unlike links discovered during checking which correctly use the configured user-agent.This caused subtle bugs such as Wikipedia returning a 403 for CLI inputs (since it requires a user-agent), resulting in lychee finding zero links on pages that actually contain hundreds.
Fixed by storing the configured user-agent on the
Collectorand using it to build the resolver'sreqwest::Client.Let me say that I believe the
collectoris already doing way too much and that this is probably not the right approach going forward. The fact that thecollectoris responsible for building areqwest::Clientfor the resolver is a sign of tight coupling and a violation of separation of concerns. Thecollectorshould ideally only be responsible for collecting links, while the resolver should be responsible for fetching content.That said, I opened the PR to get some early feedback and perhaps to fix the immediate issue. I also wanted to avoid a larger refactor that would touch many parts of the codebase and potentially cause more disruption in the short term. (See below.)
As an alternative, I considered to unify the two code paths where we deal with "building request clients." Instead of the
Collectorusing its own barereqwest::Client, we can make it reuse the same fully-configuredlychee_lib::Client(with rate limiting, retries, per-host config, TLS settings, etc.) that's used for link checking. The original issue already notes that unifying these paths would also help with recursion support. The downside is it's a much larger refactor touching theClientAPI.But it's probably still worth it? I can create another PR for that and we can compare.
I'm a bit hesitant to do it because large PRs have caused some churn in the past and it took a while to get them merged. But I'm willing to do it if we think it's the right move. I just don't want to cause too much disruption in the short term.
Let me know what you think!
@katrinafyi @thomas-zahner @cristiklein feedback welcome.