✈️ Weather-Flight Delay Monitor (`flight-wx`)

A robust data engineering pipeline that links U.S. domestic flight performance data with corresponding hourly weather conditions, and enriches each flight with aircraft metadata (manufacturer, model) via FAA registry. The goal is to understand which aircraft, routes, or airports operate under consistently adverse weather and how this correlates with delay metrics.

✅ Problem Motivation

Airlines frequently experience delays due to weather — but which flights operate reliably even in bad weather? This project builds a clean, reproducible pipeline to ingest, enrich, and join:

✈️ Flight performance logs (U.S. BTS On-Time Reporting)
🌤️ Weather observations (NOAA ISD-Lite hourly measurements)
🛩️ Aircraft info via tail number (FAA Aircraft Registry)

The final dataset can be used to:

Visualize airport-level weather impact
Model delay risk per aircraft or route
Track performance under adverse meteorological conditions

🔁 Step 1: Ingest & Join

Run using either IATA code or free-text:

python step1.py 2023 12 JFK
python step1.py 2023 12 "new york"

🔽 What it does

Flight Performance Data:
- Downloads monthly BTS zip (Reporting → fallback to Marketing)
- Extracts key fields: FL_DATE, DEP_DELAY, ARR_DELAY, ORIGIN, TAIL_NUM, etc.
Weather Data:
- Uses isd-history.csv to map airport → (USAF, WBAN)
- Downloads NOAA ISD-Lite gz files for all airports used in that month
- Flags "bad weather" hours using thresholds:
  - Wind speed ≥ 25 knots
  - Precipitation (mm) > 0
  - Cloud ceiling below 3000 ft
FAA Aircraft Metadata:
- Pulls tail number → Manufacturer / Model
- Uses FAA aircraft registry CSV export (via direct URL)
- Maps TAIL_NUM to MFR_NAME + MODEL_CODE
Join Everything:
- Merges flights × weather (on date/hour)
- Merges aircraft metadata using tail number
- Stores output to joined_sample_<IATA>_<YYYY>_<MM>.parquet

🧠 Features & Enhancements

✅ Dynamic IATA resolution via fuzzy match ("los angeles" → LAX)
✅ Auto-fallback from Reporting to Marketing BTS files
✅ Resilient ISD download: skips missing .gz without failure
✅ FAA tail registry fallback if download times out
✅ Clear download progress / count of stations fetched
✅ Select from top-k IATA matches interactively or via --pick
✅ Caches large lookups (FAA, airport-codes)
✅ Supports ICAO, IATA, and free-text

Environment setup

conda env create -f environment.yml
conda activate flight-wx

Required packages:

pandas, requests, pyarrow, duckdb
(Optional: plotly, superset, spark for later stages)

📊 Example Output

After a successful run:

ARR_DELAY     False  True
bad_wx_flag
0            505093  49087
1             14356   1858

This shows how many flights were delayed (>30 min) in good vs. bad weather conditions.

🏗️ Planned Extensions

🧩 Step 2: Real-time Ingestion

Integrate with FAA SWIM or FlightAware API for live flight data
Track near-real-time impact of weather

📈 Step 3: ML Modeling

Build classification models for delay likelihood
Use weather, airline, route, aircraft type as features
Output delay-risk scores per tail / route / carrier

📊 Step 4: Dashboard

Visualize which aircraft models fly most in bad weather
Heatmaps of airport-level weather impact
Tail-level reliability charts

⚙️ Setup Instructions

1. Clone + Create Conda Env

git clone https://github.com/Amaan165/flight-wx.git
cd flight-wx
conda env create -f environment.yml
conda activate flight-wx

2. Run First Ingest

You can run the ingestion for any airport and month in several flexible ways:

Using Exact IATA Code (3-letter)

python step1.py 2023 12 JFK      # Standard IATA
python step1.py 2023 12 KJFK     # ICAO-style

Using Natural Language (fuzzy match)

python step1.py 2023 12 "new york"

If multiple matching airports are found (e.g. JFK, LGA, EWR), you'll be prompted to pick one interactively.

To skip the prompt and select a specific match automatically:

python step1.py 2023 12 "new york" --pick 2

The script will:

Download BTS flight data for that month
Resolve airports dynamically from input
Fetch ISD-Lite weather logs for all departure airports in the month
Join flights + weather + tail-number metadata
Output to: filesjoined_sample_<IATA>_<YYYY>_<MM>.parquet

3. Data Sources

✈️ BTS On-Time Performance — monthly flight logs
🌦️ NOAA ISD-Lite — hourly station weather
🛩️ FAA Registry — N-Number → Manufacturer, Model
🌍 OpenFlights Airport Metadata — location info

4. Output Schema

The final joined dataset includes:

Column	Description
`FL_DATE`	Flight date
`ORIGIN`	Origin airport IATA
`DEP_DELAY`	Departure delay (min)
`ARR_DELAY`	Arrival delay (min)
`DEP_TIME`	Actual departure (local HHMM)
`TAIL_NUM`	Aircraft tail number (N-code)
`mfr_name`	Manufacturer (Boeing, Airbus, etc.)
`wx_score`	Computed weather severity score
`bad_wx_flag`	1 if weather was "bad" at departure

🛠️ Future Goals

Add unit tests for weather scoring
Add DuckDB dashboard preview
Parallelize station downloads across CPUs
Add step2.py for real-time ingestion
Integrate with Airflow or Dagster pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
_depricated		_depricated
datasets		datasets
files		files
README.md		README.md
environment.yml		environment.yml
step1.py		step1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✈️ Weather-Flight Delay Monitor (`flight-wx`)

✅ Problem Motivation

🔁 Step 1: Ingest & Join

🔽 What it does

🧠 Features & Enhancements

Environment setup

📊 Example Output

🏗️ Planned Extensions

🧩 Step 2: Real-time Ingestion

📈 Step 3: ML Modeling

📊 Step 4: Dashboard

⚙️ Setup Instructions

1. Clone + Create Conda Env

2. Run First Ingest

Using Exact IATA Code (3-letter)

Using Natural Language (fuzzy match)

3. Data Sources

4. Output Schema

🛠️ Future Goals

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

✈️ Weather-Flight Delay Monitor (flight-wx)

✅ Problem Motivation

🔁 Step 1: Ingest & Join

🔽 What it does

🧠 Features & Enhancements

Environment setup

📊 Example Output

🏗️ Planned Extensions

🧩 Step 2: Real-time Ingestion

📈 Step 3: ML Modeling

📊 Step 4: Dashboard

⚙️ Setup Instructions

1. Clone + Create Conda Env

2. Run First Ingest

Using Exact IATA Code (3-letter)

Using Natural Language (fuzzy match)

3. Data Sources

4. Output Schema

🛠️ Future Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

✈️ Weather-Flight Delay Monitor (`flight-wx`)

Packages