Historic Places Canada Scraper

A polite web scraper for archiving the Canadian Register of Historic Places before the site goes offline.

Target: https://www.historicplaces.ca/en/rep-reg/place-lieu.aspx?id=[ID]

Features

Polite scraping: Configurable delays between requests (default 1-2 seconds)
Resume capability: Automatically continues from where it left off if interrupted
Structured output: Exports to JSON (individual + combined) and CSV
Raw HTML backup: Optionally saves raw HTML for each record
Robust error handling: Retries failed requests, handles 404s gracefully
Progress tracking: Real-time logging and statistics

Installation

pip install -r requirements.txt

Quick Start

1. Inspect the HTML structure first

Before running the full scrape, examine a few sample pages to verify the HTML selectors:

# Fetch and analyze a single page
python inspect_page.py --id 10001

# Save the HTML for offline inspection
python inspect_page.py --id 10001 --save

# Analyze a locally saved HTML file
python inspect_page.py --file sample_10001.html

2. Test with a single ID

python scraper.py --single 10001

3. Run the full scrape

# Default: scrape IDs 1-30000
python scraper.py

# Custom range
python scraper.py --start 1000 --end 5000

# More polite (longer delays)
python scraper.py --delay-min 2.0 --delay-max 4.0

Command-Line Options

usage: scraper.py [-h] [--start START] [--end END] [--output OUTPUT]
                  [--delay-min DELAY_MIN] [--delay-max DELAY_MAX]
                  [--no-resume] [--no-html] [--timeout TIMEOUT]
                  [--single SINGLE] [-v]

Options:
  --start START       Starting ID (default: 1)
  --end END           Ending ID (default: 30000)
  --output, -o        Output directory (default: data)
  --delay-min         Minimum delay between requests (default: 1.0s)
  --delay-max         Maximum delay between requests (default: 2.0s)
  --no-resume         Start fresh, ignoring previous progress
  --no-html           Don't save raw HTML files (saves disk space)
  --timeout           Request timeout in seconds (default: 30)
  --single ID         Scrape a single ID (for testing)
  -v, --verbose       Enable debug logging

Output Structure

data/
├── json/
│   ├── 1.json
│   ├── 2.json
│   └── ...
├── html/                    # Raw HTML (if --no-html not set)
│   ├── 1.html
│   └── ...
├── historic_places.json     # Combined JSON (all records)
├── historic_places.csv      # CSV export
├── progress.json            # Resume tracking
└── stats.json               # Scraping statistics

Data Model

Each place record contains:

Field	Description
`id`	Database ID
`name`	Official place name
`other_names`	Alternative/former names
`location`	General location description
`address`	Street address
`province_territory`	Province or territory
`municipality`	City/town
`latitude` / `longitude`	Geographic coordinates
`recognition_type`	Type of heritage designation
`recognition_date`	When designated
`recognition_authority`	Designating body
`designation_status`	Current status
`description_of_place`	Physical description
`heritage_value`	Statement of significance
`character_defining_elements`	Key heritage features
`construction_date`	When built
`architect_designer`	Creator
`significant_events`	Associated historical events
`themes`	Heritage themes/categories
`image_urls`	Photos and images
`scraped_at`	Timestamp of scrape
`source_url`	Original page URL

Customizing Selectors

The scraper uses multiple strategies to find data:

Definition lists (<dl>/<dt>/<dd>)
Tables with label/value rows
Elements with semantic class names
Elements with IDs matching field names

If the default selectors don't work well, use inspect_page.py to analyze the actual HTML structure, then modify the _parse_page() method in scraper.py.

Resuming Interrupted Scrapes

The scraper automatically saves progress to data/progress.json. If interrupted:

# Resume from where you left off
python scraper.py

# Start fresh (ignore previous progress)
python scraper.py --no-resume

Rate Limiting

To be respectful to the server:

Default: 1-2 second random delay between requests
Increase delays if you notice issues: --delay-min 3.0 --delay-max 5.0
The scraper identifies itself with a standard browser User-Agent

Estimated Time

With default settings (1-2s delay) for 30,000 IDs:

Best case (all 404s): ~8-17 hours
Typical (50% valid): ~12-25 hours

Consider running in a screen or tmux session.

GitHub Actions (Automated Scraping)

Two workflow options are available for running the scraper via GitHub Actions.

Option 1: Single Run (scrape.yml)

Best for smaller ranges or testing. Manually triggered.

Go to Actions → Scrape Historic Places
Click Run workflow
Configure parameters:
- start_id / end_id: ID range to scrape
- delay_min / delay_max: Request delays
- save_html: Whether to save raw HTML

Limitations:

6-hour max runtime (~5,000-7,000 IDs per run at default delays)
Progress is saved between runs via artifacts

Option 2: Parallel Chunks (scrape-parallel.yml)

Splits the work across multiple parallel jobs. Faster but use with caution.

Go to Actions → Scrape Historic Places (Parallel)
Click Run workflow
Configure:
- total_start / total_end: Full ID range
- chunks: Number of parallel workers (2-6)
- Important: Increase delays for parallel runs (3-5s recommended)

After all chunks complete, a final job combines the data.

Caution: Running multiple parallel scrapers may trigger rate limiting or IP blocks. Monitor the first run carefully.

Downloading Results

After a workflow completes:

Go to the workflow run
Scroll to Artifacts
Download the ZIP files containing JSON/CSV data

Artifacts are retained for 90 days.

Recommended Strategy

For the full 30,000 ID range:

Test with a small range first: --start 1 --end 100
Run in batches of ~5,000 IDs using the single workflow
Or use 3-4 parallel chunks with 4-5 second delays

Troubleshooting

403 Forbidden errors

The site may block automated requests. Try:

Increase delays: --delay-min 5.0 --delay-max 10.0
Use a VPN or different IP
Run during off-peak hours

Missing fields

Run inspect_page.py on a sample page
Check the HTML structure
Adjust selectors in scraper.py

Out of disk space

Use --no-html to skip saving raw HTML files.

License

This tool is for archival/research purposes. Please respect the site's terms of service and robots.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
historic-places-7000-9000.zip		historic-places-7000-9000.zip
historic_places_all-5k-7k.json.zip		historic_places_all-5k-7k.json.zip
inspect_page.py		inspect_page.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Historic Places Canada Scraper

Features

Installation

Quick Start

1. Inspect the HTML structure first

2. Test with a single ID

3. Run the full scrape

Command-Line Options

Output Structure

Data Model

Customizing Selectors

Resuming Interrupted Scrapes

Rate Limiting

Estimated Time

GitHub Actions (Automated Scraping)

Option 1: Single Run (scrape.yml)

Option 2: Parallel Chunks (scrape-parallel.yml)

Downloading Results

Recommended Strategy

Troubleshooting

403 Forbidden errors

Missing fields

Out of disk space

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

Historic Places Canada Scraper

Features

Installation

Quick Start

1. Inspect the HTML structure first

2. Test with a single ID

3. Run the full scrape

Command-Line Options

Output Structure

Data Model

Customizing Selectors

Resuming Interrupted Scrapes

Rate Limiting

Estimated Time

GitHub Actions (Automated Scraping)

Option 1: Single Run (scrape.yml)

Option 2: Parallel Chunks (scrape-parallel.yml)

Downloading Results

Recommended Strategy

Troubleshooting

403 Forbidden errors

Missing fields

Out of disk space

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages