Skip to content

Introduce extractor pool #55

@mre

Description

@mre

Introduction

As of now we send each URI we'd like to check to a client pool in main.rs. Code is here:

lychee/src/main.rs

Lines 128 to 135 in d2e349c

tokio::spawn(async move {
for link in links {
if let Some(pb) = &bar {
pb.set_message(&link.to_string());
};
send_req.send(link).await.unwrap();
}
});

This is not ideal for a few reasons:

  • All links get extracted on startup. This is a slow process that can take up to a few seconds for long link lists.
    It's not necessary to block the client during this step, though as we could lazy-load the links on demand from the inputs.
  • There is no clear separation of concerns between main and the link extraction. Ideally the responsibilities could should be split up to make testing and refactoring easier.

We already use a channel for sending the links to check to the client pool. We could use the same abstraction for extracting the links, too in form of an extractor pool.

In the future this would allow implementing some advanced features in an extensible way:

  • Recursively check links: Push newly discovered websites into the input channel of the extractor pool
  • Skip duplicate URLs: Filter input links with a HashSet or even a Bloom filter (for constant memory-usage) that is maintained by the extractor pool before sending it to the client pool.
  • Request throttling: Group requests per website and apply some throttling to not overload the server.

How to contribute

  1. Create an extractor pool similar to our client pool
  2. Spawn the pool inside main on startup, pass the channel to the pool and start processing the inputs.

(The other end of the channel the channel is already passed to the client pool.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions