Skip to content

Replace Wayback Availability API with direct snapshot URLs#2167

Merged
mre merged 5 commits into
masterfrom
wayback-direct-url
Jun 1, 2026
Merged

Replace Wayback Availability API with direct snapshot URLs#2167
mre merged 5 commits into
masterfrom
wayback-direct-url

Conversation

@mre

@mre mre commented Apr 24, 2026

Copy link
Copy Markdown
Member

I came across this internal Wikipedia page and learned about Wayback Machine URL formats. Based on this, I believe we can simplify our code.

We can replace the call to https://archive.org/wayback/available with a client-side construction of https://web.archive.org/web/0/.

The '0' timestamp tells Wayback to redirect to the newest available snapshot, which is far more reliable than the Availability JSON API (which is heavily rate-limited and is a frequent source of frustration due to flakiness. It also sometimes returns empty archived_snapshots for pages that are clearly archived).

Edit: turns out "0" means "give me the first snapshot". What I was looking for was https://web.archive.org/web/ without the /0/, which returns the latest snapshot. Thanks for the correction.

@mre mre requested review from katrinafyi and thomas-zahner April 24, 2026 22:22
Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated
@thomas-zahner

thomas-zahner commented Apr 30, 2026

Copy link
Copy Markdown
Member

The '0' timestamp tells Wayback to redirect to the newest available snapshot, which is far more reliable than the Availability JSON API (which is heavily rate-limited and is a frequent source of frustration due to flakiness. It also sometimes returns empty archived_snapshots for pages that are clearly archived).

Hmm are you really sure? On my machine this does not seem to be the case. curl 'https://web.archive.org/web/0/http://example.com/' -v yields a snapshot from 2002 with accordingly dated website contents.

Ah the /0/ in the path seems to mean "oldest" version, i.e. snapshot number 0. Getting rid of it should work as intended.

Sidenote; Wikipedia also links a CLI tool called waybackpy which can be used to obtain latest snapshots. This tool seems to use a different API called Wayback CDX Server API. This could also be investigated. But I assume using the simplest approach is the way to go. Especially since I wouldn't expect this CDX Server API to be more reliable than the direct approach.

Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated

@katrinafyi katrinafyi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good! Hopefully it's more reliable.

I also observe the same thing as Thomas - to link to the latest snapshot, the URL should omit the /0/. This seems to match what the wikipedia help page says: https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine#Latest_archive_copy

Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated
Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated
@mre mre force-pushed the wayback-direct-url branch from dc8085b to d8f9fb0 Compare May 30, 2026 14:59
The previous implementation queried https://archive.org/wayback/available,
parsed its JSON response, and extracted the closest snapshot URL. That API
is heavily rate-limited and frequently returns empty 'archived_snapshots'
for pages that are clearly archived, leading to false 'no suggestion'
results.

Instead, request https://web.archive.org/web/<url> directly. Wayback
resolves the latest snapshot server-side and either:

- responds 302 with a Location pointing at the actual timestamped
  snapshot (e.g. /web/20260530081255/http://example.com/). We read the
  Location header directly, without following the redirect, so users see
  the capture date in the suggested link without downloading the archived
  page.
- responds 404 when no snapshot exists, in which case we return None so
  lychee doesn't suggest a dead-end 'page not archived' link.

Note: the timestamp is omitted from the request URL. A '0' timestamp
would resolve to the *oldest* snapshot, whereas omitting it yields the
latest capture.

Other improvements:

- Share a single LazyLock<reqwest::Client> across all suggestion lookups
  so they reuse the connection pool, TLS session cache, and DNS resolver.
  The client uses redirect::Policy::none() since we read Location ourselves.
- Drop the timeout parameter from Archive::get_archive_snapshot. With a
  shared static client, baking a sensible default (20s) is simpler than
  threading the value through callers. This is a breaking change to the
  public lychee-lib API.
- Drop the InternetArchiveResponse / ArchivedSnapshots / Closest structs
  and the custom StatusCode deserializer.
- Replace the wiremock-based test, the API-docs scrape, and the two
  ignored real-network tests with two deterministic real-network tests
  covering the 302-with-Location and 404-becomes-None cases. The
  'unknown URL' test still tolerates 503s as transient.
@mre mre force-pushed the wayback-direct-url branch from d8f9fb0 to d1d54ca Compare May 30, 2026 15:02
@mre

mre commented May 30, 2026

Copy link
Copy Markdown
Member Author

I've rebased and fixed the bug. :) Using the latest snapshot now, not the oldest. So basically dropped the /0/.

@mre mre requested review from katrinafyi and thomas-zahner May 30, 2026 15:11
The two real-network tests each ran on their own #[tokio::test] runtime while sharing the process-wide static reqwest::Client. The client's connection-pool tasks bind to whichever runtime initializes it first, so the sibling test could reuse a pooled connection whose runtime was already torn down, failing intermittently with 'runtime dropped the dispatch task' (DispatchGone).

Run both cases in a single test (one runtime) to keep the shared-client design intact while making the tests deterministic.

@Souradip121 Souradip121 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mre Added my review, kindly check and let me know what do you think.

Comment thread lychee-lib/src/archive/wayback/mod.rs Outdated
Comment thread lychee-lib/src/archive/wayback/mod.rs
Comment thread lychee-lib/src/archive/wayback/mod.rs
@mre

mre commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Hey @Souradip121, thanks for your review comments. I've applied the changes now. 👍

@mre mre force-pushed the wayback-direct-url branch from 5528044 to 7279eb0 Compare June 1, 2026 21:43

@mre mre left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this turned out to not be much shorter than the previous version, but I'd argue it's still an improvement because we've got less moving parts, and we make link suggestion quicker, because we don't download the full page anymore. (We read the location from the response headers.) Besides, many changes were just comments, so it's fine.

@mre mre merged commit 6333b81 into master Jun 1, 2026
8 checks passed
@mre mre deleted the wayback-direct-url branch June 1, 2026 21:46
@mre mre mentioned this pull request Jun 1, 2026

@katrinafyi katrinafyi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without at least one test for the "real" API, we wouldn't know if Wayback machine makes breaking changes that break this feature. I'd hoped that with this new approach, combined with the status code filtering, the test would be much more reliable and less flaky so we could leave it enabled.

It seems a bit hasty to preemptively ignore the test without a trial period to see if it's really flaky. That said, I think Wayback machine, by its nature, is very unlikely to make breaking changes so I'm not tooo worried. The ignoring just seems like a change made in haste.

@mre

mre commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Good point. I'll bring the test back in a new PR.

mre added a commit that referenced this pull request Jun 2, 2026
* Reinstate real-network Wayback test

PR #2167 preemptively marked the live Wayback Machine test as #[ignore].
As discussed in review, this leaves us without any guard against the
upstream API changing its redirect/404 behavior in a way that silently
breaks suggestions. The new direct-snapshot approach combined with the
503 tolerance should be reliable enough to run in CI, so un-ignore it.

* Bring back the Wayback Machine test
@mre

mre commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

This is fixed now in #2226

@thomas-zahner

Copy link
Copy Markdown
Member

Awesome 🚀
I do think (and hope) that this endpoint we're using now is more stable than their API endpoint we've previously used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants