python wayback machine api

How to Get Cached Pages From Wayback Machine API

Estimated Read Time: 5 minute(s)
Common Topics: api, wayback, machine, data, snapshot

Archive.org’s Wayback Machine is a staple in the SEO industry for examining cached historical web pages. Each cached page is called a snapshot. It’s useful for tracking progress, troubleshooting issues, or—if you’re lucky—recovering data. The Wayback Machine GUI can be slow or frustrating. The steps below show how to use Python to call the free API and return the nearest snapshot for a given date. This is helpful when you don’t know the exact date of the cached page you’re looking for.

I’m not aware of any call limits, but please be considerate and only request what you need. See the Wayback API documentation for more information.

Requirements and Assumptions

  • Python 3 is installed and basic Python syntax is understood
  • Access to a Linux installation (I recommend Ubuntu) or a Jupyter/Colab notebook

Starting the Script

First, import a user-agent module to reduce the chance of being denied by the Wayback API.

pip3 install fake-useragent

Next, import the required modules: requests and fake_useragent for calling the API, plus json and re (regular expressions) to handle the response.

import requests
import json
import re
from fake_useragent import UserAgent

Craft the Wayback API Call

In the code below we grab the epoch timestamp for today, but you can adjust it to find the nearest snapshot for any point in history (since the Wayback Machine’s inception). Use this epoch converter to convert human-readable dates to epoch timestamps. Be sure to replace the URL below with your site or page.

url = "https://www.importsem.com"
ua = UserAgent()
headers = {"user-agent": ua.chrome}

timestamp = datetime.now().timestamp()
wburl = "https://archive.org/wayback/available?url="+url+"&timestamp=" + str(timestamp)

Make the Wayback API Call

Now we are ready to use the requests module to make the API call. The API call is a simple query-string request. We send the URL and headers; in this example certificate verification is disabled (it can sometimes trip up the call). Then we load the JSON response into the data variable.

response = requests.get(wburl,headers=headers,verify=False)
data = response.json()

Process the Wayback API Response

Here’s the example response from the API documentation:

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

We need to extract the url property. There are two ways to get the snapshot URL. While the JSON response is usually simple, some responses can be complex or malformed, so an alternative approach may be necessary.

Another approach is to convert the JSON object to a string and search with a regular expression. Either method works. Note that the API may not find any snapshots for a URL, so use try/except to catch errors and report them.

Option 1

Load the JSON object into a Python dictionary. This creates an associative array-like structure that you can parse. Then try to access the url property; if it exists, use it. If not, set the value to “n/a” and print an error message.

geturl = json.loads(data)

try:
    wayback = geturl['archived_snapshots']['closest']['url']
except:
    wayback = "n/a"
    print("No snapshot URL returned")
Option 2

Convert the JSON object to a Python string, then use a regular expression to search for the snapshot URL. If a match is found, load it; otherwise set the value to “n/a” and print an error.

jsonstr = json.dumps(data)
matchResult = re.search('http://web\.archive\.org[^"]*',jsonstr)

try:
    wayback = matchResult[0]
except:
    wayback = "n/a"
    print("No snapshot URL returned")

Conclusion

From here you can use the information however you like. You can automate this script to store snapshot URLs in a database over time. For example, you could load a CSV from a Screaming Frog crawl and retrieve the last snapshot date for every URL on your site, but again be considerate of the API.

This example shows that with a small amount of Python code you can retrieve cached pages from the Wayback Machine even without knowing the exact snapshot date. Try it out! Follow me on Twitter and let me know your Wayback Machine API applications and ideas!

Wayback Machine FAQ

What is the Wayback Machine API, and how does it work for retrieving cached pages?

The Wayback Machine API is provided by the Internet Archive and lets users access historical snapshots of websites. You request snapshots by sending HTTP requests to the API endpoint with the target URL and optional timestamp parameters.

How can I use the Wayback Machine API to get cached pages programmatically?

Make HTTP requests to the Wayback Machine API endpoint with the target URL and the desired timestamp. The API will respond with the archived content for the specified date.

Can I retrieve cached pages for any website using the Wayback Machine API?

Yes, the Wayback Machine API allows you to retrieve cached pages for many websites. However, not all sites will have complete archives, and some content may be missing.

What format does the response from the Wayback Machine API come in?

The response typically contains the archived page’s HTML content representing the snapshot. You can parse and process this HTML data as needed for your application.

Are there any limitations or rate limits when using the Wayback Machine API?

Yes, the Wayback Machine API has rate limits to prevent abuse. Review the API documentation for details on rate limits and usage policies to ensure compliance.

How far back in time can I retrieve cached pages using the Wayback Machine API?

The availability of historical snapshots depends on the specific website and the frequency of archiving by the Wayback Machine. Some sites have extensive archives while others have limited snapshots.

Can I retrieve only specific elements or data from the cached pages using the Wayback Machine API?

The Wayback Machine API primarily provides full HTML content for archived pages. To extract specific elements or data, parse the HTML response to retrieve the desired information.

Are there any authentication requirements to use the Wayback Machine API?

As of the last update, the Wayback Machine API does not require authentication for basic usage. Check the API documentation for any changes or updates to authentication requirements.

Can I use the Wayback Machine API for commercial purposes?

The Wayback Machine API is generally free to use for non-commercial purposes. For commercial or high-volume usage, review the Internet Archive’s terms of service and consider contacting them for specific agreements.

Where can I find more detailed documentation on using the Wayback Machine API?

You can find detailed documentation, including API endpoints, parameters, and examples, on the official Internet Archive website. Refer to the Wayback Machine API documentation: https://archive.org/help/wayback_api.php for comprehensive information.
Greg Bernhardt
Follow me