A polite web scraper for archiving the Canadian Register of Historic Places before the site goes offline.
Target: https://www.historicplaces.ca/en/rep-reg/place-lieu.aspx?id=[ID]
- Polite scraping: Configurable delays between requests (default 1-2 seconds)
- Resume capability: Automatically continues from where it left off if interrupted
- Structured output: Exports to JSON (individual + combined) and CSV
- Raw HTML backup: Optionally saves raw HTML for each record
- Robust error handling: Retries failed requests, handles 404s gracefully
- Progress tracking: Real-time logging and statistics
pip install -r requirements.txtBefore running the full scrape, examine a few sample pages to verify the HTML selectors:
# Fetch and analyze a single page
python inspect_page.py --id 10001
# Save the HTML for offline inspection
python inspect_page.py --id 10001 --save
# Analyze a locally saved HTML file
python inspect_page.py --file sample_10001.htmlpython scraper.py --single 10001# Default: scrape IDs 1-30000
python scraper.py
# Custom range
python scraper.py --start 1000 --end 5000
# More polite (longer delays)
python scraper.py --delay-min 2.0 --delay-max 4.0usage: scraper.py [-h] [--start START] [--end END] [--output OUTPUT]
[--delay-min DELAY_MIN] [--delay-max DELAY_MAX]
[--no-resume] [--no-html] [--timeout TIMEOUT]
[--single SINGLE] [-v]
Options:
--start START Starting ID (default: 1)
--end END Ending ID (default: 30000)
--output, -o Output directory (default: data)
--delay-min Minimum delay between requests (default: 1.0s)
--delay-max Maximum delay between requests (default: 2.0s)
--no-resume Start fresh, ignoring previous progress
--no-html Don't save raw HTML files (saves disk space)
--timeout Request timeout in seconds (default: 30)
--single ID Scrape a single ID (for testing)
-v, --verbose Enable debug logging
data/
├── json/
│ ├── 1.json
│ ├── 2.json
│ └── ...
├── html/ # Raw HTML (if --no-html not set)
│ ├── 1.html
│ └── ...
├── historic_places.json # Combined JSON (all records)
├── historic_places.csv # CSV export
├── progress.json # Resume tracking
└── stats.json # Scraping statistics
Each place record contains:
| Field | Description |
|---|---|
id |
Database ID |
name |
Official place name |
other_names |
Alternative/former names |
location |
General location description |
address |
Street address |
province_territory |
Province or territory |
municipality |
City/town |
latitude / longitude |
Geographic coordinates |
recognition_type |
Type of heritage designation |
recognition_date |
When designated |
recognition_authority |
Designating body |
designation_status |
Current status |
description_of_place |
Physical description |
heritage_value |
Statement of significance |
character_defining_elements |
Key heritage features |
construction_date |
When built |
architect_designer |
Creator |
significant_events |
Associated historical events |
themes |
Heritage themes/categories |
image_urls |
Photos and images |
scraped_at |
Timestamp of scrape |
source_url |
Original page URL |
The scraper uses multiple strategies to find data:
- Definition lists (
<dl>/<dt>/<dd>) - Tables with label/value rows
- Elements with semantic class names
- Elements with IDs matching field names
If the default selectors don't work well, use inspect_page.py to analyze the actual HTML structure, then modify the _parse_page() method in scraper.py.
The scraper automatically saves progress to data/progress.json. If interrupted:
# Resume from where you left off
python scraper.py
# Start fresh (ignore previous progress)
python scraper.py --no-resumeTo be respectful to the server:
- Default: 1-2 second random delay between requests
- Increase delays if you notice issues:
--delay-min 3.0 --delay-max 5.0 - The scraper identifies itself with a standard browser User-Agent
With default settings (1-2s delay) for 30,000 IDs:
- Best case (all 404s): ~8-17 hours
- Typical (50% valid): ~12-25 hours
Consider running in a screen or tmux session.
Two workflow options are available for running the scraper via GitHub Actions.
Best for smaller ranges or testing. Manually triggered.
- Go to Actions → Scrape Historic Places
- Click Run workflow
- Configure parameters:
start_id/end_id: ID range to scrapedelay_min/delay_max: Request delayssave_html: Whether to save raw HTML
Limitations:
- 6-hour max runtime (~5,000-7,000 IDs per run at default delays)
- Progress is saved between runs via artifacts
Splits the work across multiple parallel jobs. Faster but use with caution.
- Go to Actions → Scrape Historic Places (Parallel)
- Click Run workflow
- Configure:
total_start/total_end: Full ID rangechunks: Number of parallel workers (2-6)- Important: Increase delays for parallel runs (3-5s recommended)
After all chunks complete, a final job combines the data.
Caution: Running multiple parallel scrapers may trigger rate limiting or IP blocks. Monitor the first run carefully.
After a workflow completes:
- Go to the workflow run
- Scroll to Artifacts
- Download the ZIP files containing JSON/CSV data
Artifacts are retained for 90 days.
For the full 30,000 ID range:
- Test with a small range first:
--start 1 --end 100 - Run in batches of ~5,000 IDs using the single workflow
- Or use 3-4 parallel chunks with 4-5 second delays
The site may block automated requests. Try:
- Increase delays:
--delay-min 5.0 --delay-max 10.0 - Use a VPN or different IP
- Run during off-peak hours
- Run
inspect_page.pyon a sample page - Check the HTML structure
- Adjust selectors in
scraper.py
Use --no-html to skip saving raw HTML files.
This tool is for archival/research purposes. Please respect the site's terms of service and robots.txt.