Epigenetic Clock Reasoning Bench

Inspiration

LLMs are increasingly used in longevity research to interpret biomarker data and assist clinical decision-making. But no benchmark existed to measure whether these models actually understand epigeneticclock biology — a core pillar of aging science. Models produce fluent, confident responses while fundamentally misinterpreting clock values, conflating Horvath with GrimAge, or missing critical discordance patterns that carry real clinical significance. You can't improve what you can't measure. We built the measurement tool.

What it does

Epigenetic Clock Reasoning Bench (ECRB) is a three-stage pipeline:

Stage 1 — Simulate: MESA agent-based simulation models 900 cells aging over 200 months. Each cell tracks DNA methylation at 5 CpG sites calibrated against real GEO data (GSE40279, 656 blood samples), telomere length, oxidative damage, and senescence state. Senescent cells spray SASP inflammatory signals to neighbors, creating biologically realistic aging cascades.

Stage 2 — Generate: Simulation outputs become clinical vignettes via Claude API. Ground truthlabels are computed algorithmically from simulation parameters — zero hand annotation, infinite scalability.

Stage 3 — Evaluate: LLMs answer structured clinical questions about Horvath, GrimAge, and DunedinPACE clock profiles. Responses are scored across 5 dimensions including clock discordance detection, intervention reasoning, and confounder awareness.

We evaluated Claude Sonnet 4.6 and BioLLM (Longevity-Tuned) across 200 scenarios in 4 task categories:

Type A: Clock interpretation
Type B: Intervention reasoning
Type C: Multi-tissue discordance
Type D: Confounders and artifacts

Key Finding

BioLLM outperforms Claude on intervention reasoning (45.0% vs 40.0%), suggesting domain-specific training advantages for longitudinal clinical trajectory tasks. However, Claude dominates overall (77.2% vs 55.4%) and produces zero parse failures vs BioLLM's 16/200. Both models struggle most with confounder awareness — a critical gap for clinical deployment.

How we built it

MESA 3.0 for agent-based cell simulation
GEOparse + pandas to download and calibrate drift rates from real NCBI GEO methylation datasets
Claude Haiku to generate natural language clinical vignettes from simulation outputs
Claude Sonnet 4.6 + BioLLM as evaluated models
SALib for sensitivity analysis of scoring dimensions
React + deck.gl for the visualization dashboard showing the live aging simulation and leaderboard

Challenges

Calibrating simulation parameters against real methylation data required downloading and processing 7GB+ of GEO SOFT files, filtering 473k CpG sites to 102k age-correlated sites (r² > 0.02). Getting MESA 3.0 running on Python 3.14 required workarounds for API breaking changes. Designing ground truth labels that are both algorithmically derivable and clinically meaningful required careful threshold selection from published literature.

Accomplishments

200 benchmark scenarios with fully algorithmic ground truth. Real GEO-calibrated drift rates. A surprising research finding (BioLLM beats Claude on Type B). Zero-annotation scalability to thousands of scenarios. A live deck.gl visualization showing cellular senescence spreading through tissue in real time.

What we learned

Domain-specific LLM training matters for longitudinal biological reasoning even when general capability is lower. Confounder awareness is the most critical gap in current LLMs for clinical epigenetics applications. Agent-based modeling captures SASP-driven senescence cascades that differential equations fundamentally cannot.

What's next

Expand to 1000+ scenarios. Evaluate GPT-4o and Gemini. Validate against MESA cohort (dbGaP) and TwinsUK longitudinal methylation data. Contribute to LongevityBench framework. Add GrimAge v2 and DunedinPACE v2 task categories.

Built With

anthropic-claude-api-(haiku-for-scenario-generation
biollm-(huggingface-longevity-llm)
deck.gl
fastapi
geoparse
github
gse51057)
mesa-(agent-based-simulation)
ncbi-geo-(gse40279
numpy
pandas
pyaging
python
react
recharts
salib-(sensitivity-analysis)
scipy
sonnet-4.6-for-evaluation)
tailwind-css
typescript

Updates

Varun Nair started this project — May 24, 2026 01:43 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.