Project Overview
We approached this project from the perspective of a farmer deciding which land to buy for crop production. A farmer faces uncertainty because weather conditions vary each year, and some locations perform better than others even under similar conditions. Our goal was to use historical weather and yield data to identify counties that consistently perform well relative to weather expectations, which helps inform better land investment and risk management decisions.
Ideation Process
We started by asking a practical question. If you were a farmer choosing where to buy land, which counties would offer the most reliable yield given historical weather patterns. We considered predicting yield directly, but realized that identifying locations that outperform expectations provides more actionable insight. This led us to focus on comparing expected yield based on weather with actual observed yield to find strong and weak performers.
Development Process
We collected historical crop yield data from the USDA Risk Management Agency and weather observations from the NOAA Global Historical Climatology Network dataset. We cleaned the datasets, filtered relevant years, and standardized fields. We then mapped counties to nearby weather stations using geographic coordinates so that each county could be associated with local weather conditions. After aggregating temperature and precipitation by month and year, we built a linear regression model to estimate expected yield based on weather variables. We compared predicted values with actual yields and calculated residual differences to identify counties that consistently overperformed or underperformed relative to weather conditions. We visualized these results using scatter plots and anomaly tables to highlight patterns.
How We Used Databricks
We used Databricks to ingest large datasets, perform data cleaning, and join multiple sources efficiently using SQL and notebooks. Unity Catalog tables allowed us to organize intermediate and final datasets. Databricks enabled fast iteration when building feature tables, running analysis, and generating visualizations.
What Was Great About Databricks
Databricks handled large datasets smoothly and made complex joins manageable. The notebook workflow supported quick experimentation and debugging. Integrated SQL tools simplified data transformation and aggregation steps.
What Was Challenging or Frustrating
We encountered challenges with permissions, schema loading errors, and geographic mapping between counties and weather stations. Understanding table locations and catalog structure required additional learning. Debugging joins and ensuring correct mappings also took time.
External Tools and Resources
We used NOAA Global Historical Climatology Network data for weather observations and USDA Risk Management Agency data for crop yield. We used linear regression methods for modeling and Databricks notebooks and SQL for analysis.
Conclusion
From a farmer’s perspective, this project provides a data driven way to evaluate land performance under different weather conditions. We found temperature has a stronger relationship with yield variability than precipitation, and some counties consistently outperform expectations, which suggests advantages such as irrigation or soil quality. These insights help guide land purchasing decisions and risk planning. In the future, we would extend this approach by adding soil data, humidity, and forecasting models to support earlier decision making.
Built With
- data
Log in or sign up for Devpost to join the conversation.