Skip to content

shawngraham/historicplaces-scraper

Repository files navigation

Historic Places Canada Scraper

A polite web scraper for archiving the Canadian Register of Historic Places before the site goes offline.

Target: https://www.historicplaces.ca/en/rep-reg/place-lieu.aspx?id=[ID]

Features

  • Polite scraping: Configurable delays between requests (default 1-2 seconds)
  • Resume capability: Automatically continues from where it left off if interrupted
  • Structured output: Exports to JSON (individual + combined) and CSV
  • Raw HTML backup: Optionally saves raw HTML for each record
  • Robust error handling: Retries failed requests, handles 404s gracefully
  • Progress tracking: Real-time logging and statistics

Installation

pip install -r requirements.txt

Quick Start

1. Inspect the HTML structure first

Before running the full scrape, examine a few sample pages to verify the HTML selectors:

# Fetch and analyze a single page
python inspect_page.py --id 10001

# Save the HTML for offline inspection
python inspect_page.py --id 10001 --save

# Analyze a locally saved HTML file
python inspect_page.py --file sample_10001.html

2. Test with a single ID

python scraper.py --single 10001

3. Run the full scrape

# Default: scrape IDs 1-30000
python scraper.py

# Custom range
python scraper.py --start 1000 --end 5000

# More polite (longer delays)
python scraper.py --delay-min 2.0 --delay-max 4.0

Command-Line Options

usage: scraper.py [-h] [--start START] [--end END] [--output OUTPUT]
                  [--delay-min DELAY_MIN] [--delay-max DELAY_MAX]
                  [--no-resume] [--no-html] [--timeout TIMEOUT]
                  [--single SINGLE] [-v]

Options:
  --start START       Starting ID (default: 1)
  --end END           Ending ID (default: 30000)
  --output, -o        Output directory (default: data)
  --delay-min         Minimum delay between requests (default: 1.0s)
  --delay-max         Maximum delay between requests (default: 2.0s)
  --no-resume         Start fresh, ignoring previous progress
  --no-html           Don't save raw HTML files (saves disk space)
  --timeout           Request timeout in seconds (default: 30)
  --single ID         Scrape a single ID (for testing)
  -v, --verbose       Enable debug logging

Output Structure

data/
├── json/
│   ├── 1.json
│   ├── 2.json
│   └── ...
├── html/                    # Raw HTML (if --no-html not set)
│   ├── 1.html
│   └── ...
├── historic_places.json     # Combined JSON (all records)
├── historic_places.csv      # CSV export
├── progress.json            # Resume tracking
└── stats.json               # Scraping statistics

Data Model

Each place record contains:

Field Description
id Database ID
name Official place name
other_names Alternative/former names
location General location description
address Street address
province_territory Province or territory
municipality City/town
latitude / longitude Geographic coordinates
recognition_type Type of heritage designation
recognition_date When designated
recognition_authority Designating body
designation_status Current status
description_of_place Physical description
heritage_value Statement of significance
character_defining_elements Key heritage features
construction_date When built
architect_designer Creator
significant_events Associated historical events
themes Heritage themes/categories
image_urls Photos and images
scraped_at Timestamp of scrape
source_url Original page URL

Customizing Selectors

The scraper uses multiple strategies to find data:

  1. Definition lists (<dl>/<dt>/<dd>)
  2. Tables with label/value rows
  3. Elements with semantic class names
  4. Elements with IDs matching field names

If the default selectors don't work well, use inspect_page.py to analyze the actual HTML structure, then modify the _parse_page() method in scraper.py.

Resuming Interrupted Scrapes

The scraper automatically saves progress to data/progress.json. If interrupted:

# Resume from where you left off
python scraper.py

# Start fresh (ignore previous progress)
python scraper.py --no-resume

Rate Limiting

To be respectful to the server:

  • Default: 1-2 second random delay between requests
  • Increase delays if you notice issues: --delay-min 3.0 --delay-max 5.0
  • The scraper identifies itself with a standard browser User-Agent

Estimated Time

With default settings (1-2s delay) for 30,000 IDs:

  • Best case (all 404s): ~8-17 hours
  • Typical (50% valid): ~12-25 hours

Consider running in a screen or tmux session.

GitHub Actions (Automated Scraping)

Two workflow options are available for running the scraper via GitHub Actions.

Option 1: Single Run (scrape.yml)

Best for smaller ranges or testing. Manually triggered.

  1. Go to ActionsScrape Historic Places
  2. Click Run workflow
  3. Configure parameters:
    • start_id / end_id: ID range to scrape
    • delay_min / delay_max: Request delays
    • save_html: Whether to save raw HTML

Limitations:

  • 6-hour max runtime (~5,000-7,000 IDs per run at default delays)
  • Progress is saved between runs via artifacts

Option 2: Parallel Chunks (scrape-parallel.yml)

Splits the work across multiple parallel jobs. Faster but use with caution.

  1. Go to ActionsScrape Historic Places (Parallel)
  2. Click Run workflow
  3. Configure:
    • total_start / total_end: Full ID range
    • chunks: Number of parallel workers (2-6)
    • Important: Increase delays for parallel runs (3-5s recommended)

After all chunks complete, a final job combines the data.

Caution: Running multiple parallel scrapers may trigger rate limiting or IP blocks. Monitor the first run carefully.

Downloading Results

After a workflow completes:

  1. Go to the workflow run
  2. Scroll to Artifacts
  3. Download the ZIP files containing JSON/CSV data

Artifacts are retained for 90 days.

Recommended Strategy

For the full 30,000 ID range:

  1. Test with a small range first: --start 1 --end 100
  2. Run in batches of ~5,000 IDs using the single workflow
  3. Or use 3-4 parallel chunks with 4-5 second delays

Troubleshooting

403 Forbidden errors

The site may block automated requests. Try:

  1. Increase delays: --delay-min 5.0 --delay-max 10.0
  2. Use a VPN or different IP
  3. Run during off-peak hours

Missing fields

  1. Run inspect_page.py on a sample page
  2. Check the HTML structure
  3. Adjust selectors in scraper.py

Out of disk space

Use --no-html to skip saving raw HTML files.

License

This tool is for archival/research purposes. Please respect the site's terms of service and robots.txt.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages