Inspiration

The challenge asked us to find the most compelling "spurious" correlation between unrelated datasets. We thought anyone can find a funny correlation. What if we could turn one into something that looks like a legitimate quant trading signal? We wanted to blur the line between absurd data alchemy and real financial methodology.

What it does

We systematically scan 15+ public datasets using z-score product analysis to find hidden three way correlations. Our best discovery:

$$S(t) = Z_{\text{coal_max}}(t) \cdot Z_{\text{uk_crash_casualties}}(t)$$

This signal predicts UK humidity with r = 0.99, which we then link to agricultural commodity prices (r = 0.75) through humidity's documented effect on crop yields. The result is a complete pipeline from coal mines and car crashes to wheat futures.

How we built it

  1. Collected 15+ public datasets (coal production, UK road accidents, weather, asylum data, chess games, flight data, and more)
  2. Aggregated everything to annual frequency and ran an exhaustive pairwise correlation scan
  3. Introduced z-score products, multiplying normalised variables from different datasets and correlating the result with a third
  4. Slid 6-year windows across all combinations to find the strongest signals
  5. Validated the winning signal against agricultural commodity futures (DBA, wheat, corn, sugar, coffee all r > 0.70)
  6. Wrote it up as a proper research paper with methodology, formulas, and figures

Built with Python, pandas, matplotlib, and yfinance.

Challenges we ran into

  • Data overlap. Most datasets covered different year ranges with only 4-6 years of overlap enough to find correlations but too few to be statistically robust.
  • GitHub file limits. Several datasets exceeded 100MB, requiring us to gitignore large files and restructure the repo multiple times.
  • The Y4 problem. Our signal and humidity tracked perfectly, but the agriculture ETF diverged in one year due to the 2012 US drought a supply shock from the wrong continent.

Accomplishments that we're proud of

  • Achieving r = 0.99 between two completely unrelated variables (coal × car crashes) and a physical quantity (humidity)
  • Building a complete chain from absurd inputs through a real physical mechanism to a tradeable asset
  • The signal correlates with six different agricultural commodities above r = 0.70, not just one lucky match
  • Presenting it with genuine quant methodology (z-scores, sliding windows, robustness checks) that makes it look disturbingly credible

What we learned

  • With enough datasets and flexible windowing, you can find near-perfect correlations between literally anything
  • Z-score products are a dangerously powerful tool for generating false positives, the combinatorial explosion guarantees you'll find something
  • A plausible narrative (coal -> emissions -> humidity -> crops) makes a "spurious" correlation far more convincing, which is exactly why data mining is so valuable and dangerous in real quant finance

What's next for Deanos

  • Test out-of-sample with updated coal and crash data (2021+) to see if the signal holds or collapses
  • Try sub-annual trading using monthly crash data with coal held constant
  • Explore whether the methodology generalises, can z-score products systematically uncover hidden climate proxies from non-climate data?

Built With

Share this project:

Updates