Replace Wayback Availability API with direct snapshot URLs#2167
Conversation
Hmm are you really sure? On my machine this does not seem to be the case. Ah the Sidenote; Wikipedia also links a CLI tool called waybackpy which can be used to obtain latest snapshots. This tool seems to use a different API called Wayback CDX Server API. This could also be investigated. But I assume using the simplest approach is the way to go. Especially since I wouldn't expect this CDX Server API to be more reliable than the direct approach. |
katrinafyi
left a comment
There was a problem hiding this comment.
Mostly good! Hopefully it's more reliable.
I also observe the same thing as Thomas - to link to the latest snapshot, the URL should omit the /0/. This seems to match what the wikipedia help page says: https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine#Latest_archive_copy
The previous implementation queried https://archive.org/wayback/available, parsed its JSON response, and extracted the closest snapshot URL. That API is heavily rate-limited and frequently returns empty 'archived_snapshots' for pages that are clearly archived, leading to false 'no suggestion' results. Instead, request https://web.archive.org/web/<url> directly. Wayback resolves the latest snapshot server-side and either: - responds 302 with a Location pointing at the actual timestamped snapshot (e.g. /web/20260530081255/http://example.com/). We read the Location header directly, without following the redirect, so users see the capture date in the suggested link without downloading the archived page. - responds 404 when no snapshot exists, in which case we return None so lychee doesn't suggest a dead-end 'page not archived' link. Note: the timestamp is omitted from the request URL. A '0' timestamp would resolve to the *oldest* snapshot, whereas omitting it yields the latest capture. Other improvements: - Share a single LazyLock<reqwest::Client> across all suggestion lookups so they reuse the connection pool, TLS session cache, and DNS resolver. The client uses redirect::Policy::none() since we read Location ourselves. - Drop the timeout parameter from Archive::get_archive_snapshot. With a shared static client, baking a sensible default (20s) is simpler than threading the value through callers. This is a breaking change to the public lychee-lib API. - Drop the InternetArchiveResponse / ArchivedSnapshots / Closest structs and the custom StatusCode deserializer. - Replace the wiremock-based test, the API-docs scrape, and the two ignored real-network tests with two deterministic real-network tests covering the 302-with-Location and 404-becomes-None cases. The 'unknown URL' test still tolerates 503s as transient.
|
I've rebased and fixed the bug. :) Using the latest snapshot now, not the oldest. So basically dropped the |
The two real-network tests each ran on their own #[tokio::test] runtime while sharing the process-wide static reqwest::Client. The client's connection-pool tasks bind to whichever runtime initializes it first, so the sibling test could reuse a pooled connection whose runtime was already torn down, failing intermittently with 'runtime dropped the dispatch task' (DispatchGone). Run both cases in a single test (one runtime) to keep the shared-client design intact while making the tests deterministic.
Souradip121
left a comment
There was a problem hiding this comment.
@mre Added my review, kindly check and let me know what do you think.
|
Hey @Souradip121, thanks for your review comments. I've applied the changes now. 👍 |
mre
left a comment
There was a problem hiding this comment.
So this turned out to not be much shorter than the previous version, but I'd argue it's still an improvement because we've got less moving parts, and we make link suggestion quicker, because we don't download the full page anymore. (We read the location from the response headers.) Besides, many changes were just comments, so it's fine.
There was a problem hiding this comment.
Without at least one test for the "real" API, we wouldn't know if Wayback machine makes breaking changes that break this feature. I'd hoped that with this new approach, combined with the status code filtering, the test would be much more reliable and less flaky so we could leave it enabled.
It seems a bit hasty to preemptively ignore the test without a trial period to see if it's really flaky. That said, I think Wayback machine, by its nature, is very unlikely to make breaking changes so I'm not tooo worried. The ignoring just seems like a change made in haste.
|
Good point. I'll bring the test back in a new PR. |
* Reinstate real-network Wayback test PR #2167 preemptively marked the live Wayback Machine test as #[ignore]. As discussed in review, this leaves us without any guard against the upstream API changing its redirect/404 behavior in a way that silently breaks suggestions. The new direct-snapshot approach combined with the 503 tolerance should be reliable enough to run in CI, so un-ignore it. * Bring back the Wayback Machine test
|
This is fixed now in #2226 |
|
Awesome 🚀 |
I came across this internal Wikipedia page and learned about Wayback Machine URL formats. Based on this, I believe we can simplify our code.
We can replace the call to https://archive.org/wayback/available with a client-side construction of https://web.archive.org/web/0/.
The '0' timestamp tells Wayback to redirect to the newest available snapshot, which is far more reliable than the Availability JSON API (which is heavily rate-limited and is a frequent source of frustration due to flakiness. It also sometimes returns empty archived_snapshots for pages that are clearly archived).Edit: turns out "0" means "give me the first snapshot". What I was looking for was https://web.archive.org/web/ without the
/0/, which returns the latest snapshot. Thanks for the correction.