Inspiration
The challenge asked us to find the most compelling "spurious" correlation between unrelated datasets. We thought anyone can find a funny correlation. What if we could turn one into something that looks like a legitimate quant trading signal? We wanted to blur the line between absurd data alchemy and real financial methodology.
What it does
We systematically scan 15+ public datasets using z-score product analysis to find hidden three way correlations. Our best discovery:
$$S(t) = Z_{\text{coal_max}}(t) \cdot Z_{\text{uk_crash_casualties}}(t)$$
This signal predicts UK humidity with r = 0.99, which we then link to agricultural commodity prices (r = 0.75) through humidity's documented effect on crop yields. The result is a complete pipeline from coal mines and car crashes to wheat futures.
How we built it
- Collected 15+ public datasets (coal production, UK road accidents, weather, asylum data, chess games, flight data, and more)
- Aggregated everything to annual frequency and ran an exhaustive pairwise correlation scan
- Introduced z-score products, multiplying normalised variables from different datasets and correlating the result with a third
- Slid 6-year windows across all combinations to find the strongest signals
- Validated the winning signal against agricultural commodity futures (DBA, wheat, corn, sugar, coffee all r > 0.70)
- Wrote it up as a proper research paper with methodology, formulas, and figures
Built with Python, pandas, matplotlib, and yfinance.
Challenges we ran into
- Data overlap. Most datasets covered different year ranges with only 4-6 years of overlap enough to find correlations but too few to be statistically robust.
- GitHub file limits. Several datasets exceeded 100MB, requiring us to gitignore large files and restructure the repo multiple times.
- The Y4 problem. Our signal and humidity tracked perfectly, but the agriculture ETF diverged in one year due to the 2012 US drought a supply shock from the wrong continent.
Accomplishments that we're proud of
- Achieving r = 0.99 between two completely unrelated variables (coal × car crashes) and a physical quantity (humidity)
- Building a complete chain from absurd inputs through a real physical mechanism to a tradeable asset
- The signal correlates with six different agricultural commodities above r = 0.70, not just one lucky match
- Presenting it with genuine quant methodology (z-scores, sliding windows, robustness checks) that makes it look disturbingly credible
What we learned
- With enough datasets and flexible windowing, you can find near-perfect correlations between literally anything
- Z-score products are a dangerously powerful tool for generating false positives, the combinatorial explosion guarantees you'll find something
- A plausible narrative (coal -> emissions -> humidity -> crops) makes a "spurious" correlation far more convincing, which is exactly why data mining is so valuable and dangerous in real quant finance
What's next for Deanos
- Test out-of-sample with updated coal and crash data (2021+) to see if the signal holds or collapses
- Try sub-annual trading using monthly crash data with coal held constant
- Explore whether the methodology generalises, can z-score products systematically uncover hidden climate proxies from non-climate data?
Log in or sign up for Devpost to join the conversation.