How to Get Cached Pages From Wayback Machine API
Archive.org’s Wayback Machine is a staple in the SEO industry for examining cached historical web pages. Each cached page is called a snapshot. It’s useful for tracking progress, troubleshooting issues, or—if you’re lucky—recovering data. The Wayback Machine GUI can be slow or frustrating. The steps below show how to use Python to call the free API and return the nearest snapshot for a given date. This is helpful when you don’t know the exact date of the cached page you’re looking for.
I’m not aware of any call limits, but please be considerate and only request what you need. See the Wayback API documentation for more information.
Table of Contents
Requirements and Assumptions
- Python 3 is installed and basic Python syntax is understood
- Access to a Linux installation (I recommend Ubuntu) or a Jupyter/Colab notebook
Starting the Script
First, import a user-agent module to reduce the chance of being denied by the Wayback API.
pip3 install fake-useragent
Next, import the required modules: requests and fake_useragent for calling the API, plus json and re (regular expressions) to handle the response.
import requests import json import re from fake_useragent import UserAgent
Craft the Wayback API Call
In the code below we grab the epoch timestamp for today, but you can adjust it to find the nearest snapshot for any point in history (since the Wayback Machine’s inception). Use this epoch converter to convert human-readable dates to epoch timestamps. Be sure to replace the URL below with your site or page.
url = "https://www.importsem.com"
ua = UserAgent()
headers = {"user-agent": ua.chrome}
timestamp = datetime.now().timestamp()
wburl = "https://archive.org/wayback/available?url="+url+"×tamp=" + str(timestamp)
Make the Wayback API Call
Now we are ready to use the requests module to make the API call. The API call is a simple query-string request. We send the URL and headers; in this example certificate verification is disabled (it can sometimes trip up the call). Then we load the JSON response into the data variable.
response = requests.get(wburl,headers=headers,verify=False) data = response.json()
Process the Wayback API Response
Here’s the example response from the API documentation:
{
"archived_snapshots": {
"closest": {
"available": true,
"url": "http://web.archive.org/web/20130919044612/http://example.com/",
"timestamp": "20130919044612",
"status": "200"
}
}
}
We need to extract the url property. There are two ways to get the snapshot URL. While the JSON response is usually simple, some responses can be complex or malformed, so an alternative approach may be necessary.
Another approach is to convert the JSON object to a string and search with a regular expression. Either method works. Note that the API may not find any snapshots for a URL, so use try/except to catch errors and report them.
Option 1
Load the JSON object into a Python dictionary. This creates an associative array-like structure that you can parse. Then try to access the url property; if it exists, use it. If not, set the value to “n/a” and print an error message.
geturl = json.loads(data)
try:
wayback = geturl['archived_snapshots']['closest']['url']
except:
wayback = "n/a"
print("No snapshot URL returned")
Option 2
Convert the JSON object to a Python string, then use a regular expression to search for the snapshot URL. If a match is found, load it; otherwise set the value to “n/a” and print an error.
jsonstr = json.dumps(data)
matchResult = re.search('http://web\.archive\.org[^"]*',jsonstr)
try:
wayback = matchResult[0]
except:
wayback = "n/a"
print("No snapshot URL returned")
Conclusion
From here you can use the information however you like. You can automate this script to store snapshot URLs in a database over time. For example, you could load a CSV from a Screaming Frog crawl and retrieve the last snapshot date for every URL on your site, but again be considerate of the API.
This example shows that with a small amount of Python code you can retrieve cached pages from the Wayback Machine even without knowing the exact snapshot date. Try it out! Follow me on Twitter and let me know your Wayback Machine API applications and ideas!
Wayback Machine FAQ
What is the Wayback Machine API, and how does it work for retrieving cached pages?
How can I use the Wayback Machine API to get cached pages programmatically?
Can I retrieve cached pages for any website using the Wayback Machine API?
What format does the response from the Wayback Machine API come in?
Are there any limitations or rate limits when using the Wayback Machine API?
How far back in time can I retrieve cached pages using the Wayback Machine API?
Can I retrieve only specific elements or data from the cached pages using the Wayback Machine API?
Are there any authentication requirements to use the Wayback Machine API?
Can I use the Wayback Machine API for commercial purposes?
Where can I find more detailed documentation on using the Wayback Machine API?
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024





