Inspiration
LLMs are increasingly used in longevity research to interpret biomarker data and assist clinical decision-making. But no benchmark existed to measure whether these models actually understand epigeneticclock biology — a core pillar of aging science. Models produce fluent, confident responses while fundamentally misinterpreting clock values, conflating Horvath with GrimAge, or missing critical discordance patterns that carry real clinical significance. You can't improve what you can't measure. We built the measurement tool.
What it does
Epigenetic Clock Reasoning Bench (ECRB) is a three-stage pipeline:
Stage 1 — Simulate: MESA agent-based simulation models 900 cells aging over 200 months. Each cell tracks DNA methylation at 5 CpG sites calibrated against real GEO data (GSE40279, 656 blood samples), telomere length, oxidative damage, and senescence state. Senescent cells spray SASP inflammatory signals to neighbors, creating biologically realistic aging cascades.
Stage 2 — Generate: Simulation outputs become clinical vignettes via Claude API. Ground truthlabels are computed algorithmically from simulation parameters — zero hand annotation, infinite scalability.
Stage 3 — Evaluate: LLMs answer structured clinical questions about Horvath, GrimAge, and DunedinPACE clock profiles. Responses are scored across 5 dimensions including clock discordance detection, intervention reasoning, and confounder awareness.
We evaluated Claude Sonnet 4.6 and BioLLM (Longevity-Tuned) across 200 scenarios in 4 task categories:
- Type A: Clock interpretation
- Type B: Intervention reasoning
- Type C: Multi-tissue discordance
- Type D: Confounders and artifacts
Key Finding
BioLLM outperforms Claude on intervention reasoning (45.0% vs 40.0%), suggesting domain-specific training advantages for longitudinal clinical trajectory tasks. However, Claude dominates overall (77.2% vs 55.4%) and produces zero parse failures vs BioLLM's 16/200. Both models struggle most with confounder awareness — a critical gap for clinical deployment.
How we built it
- MESA 3.0 for agent-based cell simulation
- GEOparse + pandas to download and calibrate drift rates from real NCBI GEO methylation datasets
- Claude Haiku to generate natural language clinical vignettes from simulation outputs
- Claude Sonnet 4.6 + BioLLM as evaluated models
- SALib for sensitivity analysis of scoring dimensions
- React + deck.gl for the visualization dashboard showing the live aging simulation and leaderboard
Challenges
Calibrating simulation parameters against real methylation data required downloading and processing 7GB+ of GEO SOFT files, filtering 473k CpG sites to 102k age-correlated sites (r² > 0.02). Getting MESA 3.0 running on Python 3.14 required workarounds for API breaking changes. Designing ground truth labels that are both algorithmically derivable and clinically meaningful required careful threshold selection from published literature.
Accomplishments
200 benchmark scenarios with fully algorithmic ground truth. Real GEO-calibrated drift rates. A surprising research finding (BioLLM beats Claude on Type B). Zero-annotation scalability to thousands of scenarios. A live deck.gl visualization showing cellular senescence spreading through tissue in real time.
What we learned
Domain-specific LLM training matters for longitudinal biological reasoning even when general capability is lower. Confounder awareness is the most critical gap in current LLMs for clinical epigenetics applications. Agent-based modeling captures SASP-driven senescence cascades that differential equations fundamentally cannot.
What's next
Expand to 1000+ scenarios. Evaluate GPT-4o and Gemini. Validate against MESA cohort (dbGaP) and TwinsUK longitudinal methylation data. Contribute to LongevityBench framework. Add GrimAge v2 and DunedinPACE v2 task categories.
Log in or sign up for Devpost to join the conversation.