Automated Data Collection: A case study with weather data and GitHub Actions

Millions of Americans rely on weather alerts to keep them informed and safe, and part of that reliability is the knowledge that alerts are sent 24/7, whether people are awake or not.

Recently, in collaboration with the Texas Disaster Information System (TDIS), January Advisors created an MVP for an automated weather alert map for the state of Texas, showing each county and its active weather alerts at any given time. The map runs 24 hours a day, and works just like regular service alerts.

The map evolved based on user requirements, and it can now be found on the newly launched TDIS portal site. However, this was a useful case study in how and why automated data collection works, so let’s walk through the process of how we made that happen. 

How does weather data work?

In the US, there are 122 National Weather Service (NWS) forecast offices. Each office is responsible for issuing forecasts and warnings for a defined area in the country, as well as American territories. Here’s a map showing the current offices. 

Map of the 122 National Weather Services territories.

Although it’s hard to determine how many weather alerts are sent out each day, a project from weather.com was able to scrape weather alerts over a 10 year period. They retrieved about 300,000 weather alerts during that time. Just think about the sheer volume of information that represents.

When the weather gets rough, NWS forecasters issue official alerts based on three types of thresholds: warnings, watches, and advisories.

  • Warning – Immediate danger; severe weather is happening or imminent. Example: tornado on the ground, flash flooding already occurring.
  • Watch – Conditions are favorable for dangerous weather to develop, but it’s not happening yet. Example: severe thunderstorm watch, winter storm watch.
  • Advisory – Less severe but still potentially disruptive weather. Example: heat advisory, dense fog advisory.

Once a certain office issues an alert, the message is distributed through multiple channels in order to reach as many people as possible. The most popular channel is the Wireless Emergency Alerts (WEA). You probably know it as the things that show up in your apps and sometimes your texts. These are sent to smartphones via nearby cell towers and are geo-targeted, ensuring only people in the affected area receive them.

How does automation play into this process?

Alerts also appear through popular online sources and APIs. This is how we were able to create this weather alerts map with TDIS.

First, we tapped into the National Weather Service API. This rich data source provides real-time alerts for all 3,000+ counties across the U.S. As our work was focused on Texas, we narrowed our focus to Texas’ 254 counties in the API call.

Each alert is tied to a specific county, so we used the county information to geocode the alerts and display them directly on the Texas map. We also included other helpful weather data, like humidity and precipitation chances, since that is also available from the same API. It was easy to add, and it gives viewers more context with minimal extra effort.

The first step was to create a script that could fetch the data from the API. Luckily, the NWS keeps an open API, so all we needed to do was send our request with a valid email address. 

To keep the data fresh, we took the script and automated it through Github Actions.

A screenshot of tasks ran on Github Actions, as part of a piece talking about Automated Data Collection
Github Actions runs several automated data collection tasks.

Github Actions allows us to run code virtually and automatically. Kind of like other, more serious deployments. Most weather alerts expire within 24 hours unless renewed, so this hourly update ensures our map stays current without needing manual updates.

To set up a GitHub Action, you start by creating a .yml (YAML) file that defines what the workflow should do. Inside this file, you organize your workflow into a series of steps, and each step has a descriptive name. These steps tell GitHub what commands to run – such as executing Python scripts or other code files in your project – much like running commands directly in a terminal. By structuring your workflow this way, you can track the progress of each Action as its happening, and even incorporate your own quality assurance checks to make sure everything runs smoothly and consistently.

Challenges and Improvements

While automation can save time and reduce manual effort, it comes with its own set of challenges. Automated data collection systems are not immune to errors, and when something goes wrong, the failure can cascade.

One of the most common issues when using platforms like GitHub Actions is failed jobs. These can happen for any number of reasons: an expired API key, missing environment variables, changes in a data source’s format, or simply an internet timeout. A single failed job can stop the entire workflow, leaving gaps in your dataset or preventing your analysis from running altogether.

Even though automation implies “set it and forget it,” most automated systems still require regular check-ins to ensure that dependencies are up to date and that data pipelines haven’t been broken. Things break, authentication methods change, new rate limits are introduced – any number of changes can happen that affect the data pipeline – and they can cause scripts that once worked to return errors or incomplete data.

There’s also the issue of data reliability. Automated systems collect data continuously, but without oversight, they can end up storing duplicates, corrupted entries, or data that doesn’t meet quality standards. Over time, these small issues compound, affecting the accuracy of any analysis built on top of them.

Lastly, when data collection is entirely handled by scripts and cloud-based workflows, there is a layer of opacity introduced, and it can become harder to pinpoint where a problem occurred. Proper logging and error handling are critical. Even then, debugging an automated process can be time-consuming.

Juweek Adolphe

Recent posts: