Replace Wayback Availability API with direct snapshot URLs by mre · Pull Request #2167 · lycheeverse/lychee

mre · 2026-04-24T22:22:19Z

I came across this internal Wikipedia page and learned about Wayback Machine URL formats. Based on this, I believe we can simplify our code.

We can replace the call to https://archive.org/wayback/available with a client-side construction of https://web.archive.org/web/0/.

The '0' timestamp tells Wayback to redirect to the newest available snapshot, which is far more reliable than the Availability JSON API (which is heavily rate-limited and is a frequent source of frustration due to flakiness. It also sometimes returns empty archived_snapshots for pages that are clearly archived).

Edit: turns out "0" means "give me the first snapshot". What I was looking for was https://web.archive.org/web/ without the /0/, which returns the latest snapshot. Thanks for the correction.

thomas-zahner · 2026-04-30T08:35:52Z

The '0' timestamp tells Wayback to redirect to the newest available snapshot, which is far more reliable than the Availability JSON API (which is heavily rate-limited and is a frequent source of frustration due to flakiness. It also sometimes returns empty archived_snapshots for pages that are clearly archived).

Hmm are you really sure? On my machine this does not seem to be the case. curl 'https://web.archive.org/web/0/http://example.com/' -v yields a snapshot from 2002 with accordingly dated website contents.

Ah the /0/ in the path seems to mean "oldest" version, i.e. snapshot number 0. Getting rid of it should work as intended.

Sidenote; Wikipedia also links a CLI tool called waybackpy which can be used to obtain latest snapshots. This tool seems to use a different API called Wayback CDX Server API. This could also be investigated. But I assume using the simplest approach is the way to go. Especially since I wouldn't expect this CDX Server API to be more reliable than the direct approach.

katrinafyi

Mostly good! Hopefully it's more reliable.

I also observe the same thing as Thomas - to link to the latest snapshot, the URL should omit the /0/. This seems to match what the wikipedia help page says: https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine#Latest_archive_copy

The previous implementation queried https://archive.org/wayback/available, parsed its JSON response, and extracted the closest snapshot URL. That API is heavily rate-limited and frequently returns empty 'archived_snapshots' for pages that are clearly archived, leading to false 'no suggestion' results. Instead, request https://web.archive.org/web/<url> directly. Wayback resolves the latest snapshot server-side and either: - responds 302 with a Location pointing at the actual timestamped snapshot (e.g. /web/20260530081255/http://example.com/). We read the Location header directly, without following the redirect, so users see the capture date in the suggested link without downloading the archived page. - responds 404 when no snapshot exists, in which case we return None so lychee doesn't suggest a dead-end 'page not archived' link. Note: the timestamp is omitted from the request URL. A '0' timestamp would resolve to the *oldest* snapshot, whereas omitting it yields the latest capture. Other improvements: - Share a single LazyLock<reqwest::Client> across all suggestion lookups so they reuse the connection pool, TLS session cache, and DNS resolver. The client uses redirect::Policy::none() since we read Location ourselves. - Drop the timeout parameter from Archive::get_archive_snapshot. With a shared static client, baking a sensible default (20s) is simpler than threading the value through callers. This is a breaking change to the public lychee-lib API. - Drop the InternetArchiveResponse / ArchivedSnapshots / Closest structs and the custom StatusCode deserializer. - Replace the wiremock-based test, the API-docs scrape, and the two ignored real-network tests with two deterministic real-network tests covering the 302-with-Location and 404-becomes-None cases. The 'unknown URL' test still tolerates 503s as transient.

mre · 2026-05-30T15:04:37Z

I've rebased and fixed the bug. :) Using the latest snapshot now, not the oldest. So basically dropped the /0/.

The two real-network tests each ran on their own #[tokio::test] runtime while sharing the process-wide static reqwest::Client. The client's connection-pool tasks bind to whichever runtime initializes it first, so the sibling test could reuse a pooled connection whose runtime was already torn down, failing intermittently with 'runtime dropped the dispatch task' (DispatchGone). Run both cases in a single test (one runtime) to keep the shared-client design intact while making the tests deterministic.

Souradip121

@mre Added my review, kindly check and let me know what do you think.

mre · 2026-06-01T21:30:51Z

Hey @Souradip121, thanks for your review comments. I've applied the changes now. 👍

mre

So this turned out to not be much shorter than the previous version, but I'd argue it's still an improvement because we've got less moving parts, and we make link suggestion quicker, because we don't download the full page anymore. (We read the location from the response headers.) Besides, many changes were just comments, so it's fine.

katrinafyi

Without at least one test for the "real" API, we wouldn't know if Wayback machine makes breaking changes that break this feature. I'd hoped that with this new approach, combined with the status code filtering, the test would be much more reliable and less flaky so we could leave it enabled.

It seems a bit hasty to preemptively ignore the test without a trial period to see if it's really flaky. That said, I think Wayback machine, by its nature, is very unlikely to make breaking changes so I'm not tooo worried. The ignoring just seems like a change made in haste.

mre · 2026-06-02T11:18:15Z

Good point. I'll bring the test back in a new PR.

* Reinstate real-network Wayback test PR #2167 preemptively marked the live Wayback Machine test as #[ignore]. As discussed in review, this leaves us without any guard against the upstream API changing its redirect/404 behavior in a way that silently breaks suggestions. The new direct-snapshot approach combined with the 503 tolerance should be reliable enough to run in CI, so un-ignore it. * Bring back the Wayback Machine test

mre · 2026-06-02T13:13:47Z

This is fixed now in #2226

thomas-zahner · 2026-06-07T06:22:58Z

Awesome 🚀
I do think (and hope) that this endpoint we're using now is more stable than their API endpoint we've previously used

mre requested review from katrinafyi and thomas-zahner April 24, 2026 22:22

mre commented Apr 24, 2026

View reviewed changes

Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated

thomas-zahner reviewed Apr 30, 2026

View reviewed changes

Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated

katrinafyi reviewed May 9, 2026

View reviewed changes

Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated

Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated

mre force-pushed the wayback-direct-url branch from dc8085b to d8f9fb0 Compare May 30, 2026 14:59

mre force-pushed the wayback-direct-url branch from d8f9fb0 to d1d54ca Compare May 30, 2026 15:02

update comment

daeb050

mre requested review from katrinafyi and thomas-zahner May 30, 2026 15:11

Souradip121 reviewed May 31, 2026

View reviewed changes

Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated

Comment thread lychee-lib/src/archive/wayback/mod.rs

Comment thread lychee-lib/src/archive/wayback/mod.rs

Reinstate timeouts and update comments

8c93df2

cosmetic changes

7279eb0

mre force-pushed the wayback-direct-url branch from 5528044 to 7279eb0 Compare June 1, 2026 21:43

mre commented Jun 1, 2026

View reviewed changes

mre merged commit 6333b81 into master Jun 1, 2026
8 checks passed

mre deleted the wayback-direct-url branch June 1, 2026 21:46

mre mentioned this pull request Jun 1, 2026

chore: release v0.25.0 #2217

Open

katrinafyi reviewed Jun 2, 2026

View reviewed changes

mre mentioned this pull request Jun 2, 2026

Wayback reinstate network test #2226

Merged

Uh oh!

Conversation

mre commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thomas-zahner commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

katrinafyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mre commented May 30, 2026

Uh oh!

Souradip121 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mre commented Jun 1, 2026

Uh oh!

mre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

katrinafyi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mre commented Jun 2, 2026

Uh oh!

mre commented Jun 2, 2026

Uh oh!

thomas-zahner commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mre commented Apr 24, 2026 •

edited

Loading

thomas-zahner commented Apr 30, 2026 •

edited

Loading

katrinafyi left a comment •

edited

Loading