Skip to content

Remove deduplication of Request after source tracking#2075

Merged
thomas-zahner merged 6 commits into
lycheeverse:masterfrom
rina-forks:remove-hashset
Mar 5, 2026
Merged

Remove deduplication of Request after source tracking#2075
thomas-zahner merged 6 commits into
lycheeverse:masterfrom
rina-forks:remove-hashset

Conversation

@katrinafyi

Copy link
Copy Markdown
Member

After source tracking, adding a Request to a HashSet is ineffective because it contains a unique location for each request.

As a side-effect, removing the deduplication means URLs will be sent to the stream in file order, which is nice for consistency.

After source tracking, adding a Request to a HashSet is ineffective
because it contains a unique location for each request.

As a side-effect, removing the deduplication means URLs will be sent to
the stream in file order, which is nice for consistency.
@thomas-zahner

thomas-zahner commented Mar 5, 2026

Copy link
Copy Markdown
Member

Thank you. This reduces unnecessary complexity for a bit.

Basically this duplication was useless as of now (it might have been useful in the past) because a Request contains data like span and source. The "real" deduplication happens at the lower level in host.rs, independently of span, source and other details, irrelevant to performing actual link checks efficiently.

Comment thread lychee-lib/src/utils/request.rs Outdated
/// request errors are not deduplicated.
/// Create requests out of the collected URLs. Returns a vector of valid URLs
/// and errors. URLs are not deduplicated because repeated URLs may occur at
/// different source locations.

@thomas-zahner thomas-zahner Mar 5, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be worth mentioning that we don't do deduplication as it happens on a much lower level? (in host.rs) Also Returns a vector of valid URLs and errors is a bit confusing.

We could simplify it maybe something in the direction of:

Suggested change
/// different source locations.
/// Create requests out of the collected URLs.
/// [`Request`]s are not deduplicated.
/// Caching and deduplication happens on a lower level.

+/// Requests are not deduplicated because repeated URLs may occur at different
+/// source locations. Caching and deduplication happens elsewhere (e.g., in the
+/// per-host HostCache and the top-level persistent cache).
+/// Create requests out of the collected URLs. Maps each [`RawUri`] to a parsed
+/// [`Request`] or a [`RequestError`], according to the given options.
@katrinafyi

Copy link
Copy Markdown
Member Author

Let me know what you think after those commits.

@thomas-zahner thomas-zahner merged commit 7b06c31 into lycheeverse:master Mar 5, 2026
7 checks passed
@thomas-zahner

Copy link
Copy Markdown
Member

Looks great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants