A lightweight scaffold for scraping the AIDR disaster events API, normalizing data, and persisting it with a CLI orchestrator.
- Create a virtual environment and install dependencies:
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# macOS/Linux
source .venv/bin/activate
pip install -r requirements.txt- Make the
srcpackage importable and load environment settings:
# Windows PowerShell
$env:PYTHONPATH = "src"
copy .env.example .env
# macOS/Linux
export PYTHONPATH=src
cp .env.example .env- Run the scraper pipeline (fetch -> normalize -> store):
python -m aidr_scraper.main scrape --start-year 2005 --end-year 2025- Refresh analytics/materialized views and show category counts:
python -m aidr_scraper.main refresh-views
python -m aidr_scraper.main analytics- Preview a few rows as CSV (defaults to stdout, or pass --output to save):
python -m aidr_scraper.main sample-csv --limit 5
python -m aidr_scraper.main sample-csv --limit 10 --output data/sample.csv- DATABASE_URL (optional): SQLAlchemy URL. Defaults to sqlite:///data/aidr.db.
- AIDR_API_URL (optional): Override the AIDR resource search endpoint.
- AIDR_TIMEOUT (optional): Request timeout in seconds (default 30).
- src/aidr_scraper/ - package with fetch, normalize, storage, transform, and CLI orchestration.
- migrations/ - SQL DDL scripts to bootstrap the database schema.
- web_scraper.py - reference script the scaffold was based on.
- The CLI uses Typer for ergonomics and python-dotenv to load .env automatically.
- BeautifulSoup is used to safely strip any HTML fragments in summaries returned by the API.
- The storage layer uses SQLAlchemy with an idempotent upsert to deduplicate events.