WebPastMachine is a powerful tool that lets you explore the history of any website through the Internet Archive's Wayback Machine. It helps you discover all archived URLs for a domain, analyze the types of content that were archived, and export the results for further analysis.
- 🔍 Search for all archived URLs of any domain
- 📊 Analyze file types and their distribution
- 🔎 Filter results by file extension
- 💾 Export results to a file
- 🎨 Colored terminal output for better readability
- ⚡ Fast and efficient processing
- 🛠️ Easy to use command-line interface
- Clone this repository:
git clone https://github.com/Shac0x/WebPastMachine
cd WebPastMachine- Install the required dependencies:
pip install -r requirements.txtNote: The tool will work without additional packages, but installing colorama provides a better visual experience with colored terminal output.
Search for all archived URLs of a domain:
python WebPastMachine.py example.com- Export to file (combine summary mode with output):
python WebPastMachine.py example.com -s -o results.json- Filter by file extension:
python WebPastMachine.py example.com -e pdf- Show only summary without listing individual URLs:
python WebPastMachine.py example.com -s- Combine filtering and export:
python WebPastMachine.py example.com -e pdf -o pdfs.json| Argument | Description | Example |
|---|---|---|
| domain | The domain to search (required) | example.com |
| -e, --extension | Filter by file extension | -e pdf |
| -o, --output | Output file to save results | -o results.json |
| -s, --summary | Show only summary without listing individual URLs | -s |
| -h, --help | Show help message | -h |
╔════════════════════════════════════════════════════════════════╗
║ Searching archived URLs for example.com... ║
╚════════════════════════════════════════════════════════════════╝
Processing URLs...
Processed 500/1200 URLs
Analysis of file types found:
--------------------------------------------------
*.html: 150 files
*.php: 45 files
*.jpg: 30 files
*.pdf: 25 files
*.js: 20 files
Total unique URLs found: 270
------------------------------------
URL: https://example.com/page.html
First capture: 2010-01-15 14:25:10
Archive link: http://web.archive.org/web/20100115142510/https://example.com/page.html
------------------------------------
With colorama installed, the output will be nicely colorized, making it easier to read and distinguish between different types of information.
The exported file will contain all URLs with their capture dates and archive links in a clean, readable format.
- 📚 Research: Investigate the history of websites
- 🔒 Security: Find old versions of sensitive pages
- 🎨 Design: Track website design evolution
- 📊 Analysis: Study content distribution over time
- 🔍 Discovery: Find lost or removed content
- Uses the Wayback Machine CDX API
- Implements efficient URL deduplication
- Handles rate limiting and timeouts
This project is licensed under the MIT License - see the LICENSE file for details.
- Internet Archive for providing the Wayback Machine
