Necessary Links
| Description | Link |
|---|---|
| CardioFusion v1.0.0 Installer | https://drive.google.com/file/d/1wADID_NSUhKmgAudOM3drTsgVCxWLqOf/view?usp=sharing |
| CardioFusion Paper | https://drive.google.com/file/d/1AktE9ngK7x6H5chFmFgB2xKiXY5dosrZ/view?usp=sharing |
Inspiration
Cardiovascular disease (CVD) is the leading cause of mortality globally, accounting for approximately 17.9 million deaths annually. While early risk stratification can save lives, traditional clinical risk scores rely on a small set of manually curated measurements and completely ignore the rich, temporal data hidden in raw electrocardiogram (ECG) signals.
More importantly, a massive global disparity exists: many hospitals in low- and middle-income countries (LMICs) lack the expensive GPU infrastructure required to run modern medical AI. I wanted to build a system that extracts deep, multi-modal insights from cardiac data but is intentionally constrained to run entirely on standard CPUs to ensure globally equitable deployment.
What it does
CardioFusion is a research-grade, self-supervised multi-modal framework for cardiovascular risk assessment. It performs four main tasks:
- ECG Representation Learning: Learns cardiac patterns from raw, unlabelled 1D ECG signals.
- Cross-Modal Fusion: Dynamically aligns these ECG embeddings with tabular clinical features (e.g., age, blood pressure) so the modalities can "attend" to each other.
- Calibrated Uncertainty: Instead of outputting a single, overconfident probability, it uses split conformal prediction to generate statistically guaranteed "prediction sets," telling clinicians exactly when the model is uncertain.
- Clinical Interpretability: Grounds every prediction in visual evidence using 1D Grad-CAM, SHAP waterfall plots, and cross-modal attention maps.
Crucially, the entire framework operates locally with fewer than 600K parameters, ensuring absolute patient data privacy without needing cloud compute.
How we built it
The architecture was built entirely to run efficiently on CPU without sacrificing state-of-the-art methodology:
- Temporal Masked Autoencoder (TMAE): Adapted Vision MAE for 1D cardiac timeseries. By masking 75% of the ECG signal and forcing the model to reconstruct it, the encoder internalises the global rhythmic structure of the heart without needing a single label.
- Cross-Modal Attention Fusion (CMAF): Implemented bidirectional multi-head attention to fuse the tabular and ECG latent streams.
- Tabular Engine: Used LightGBM, which leverages Gradient-based One-Side Sampling (GOSS) to handle 70,000 patient records highly efficiently.
- Unsupervised Discovery: Combined UMAP and HDBSCAN on the TMAE latent space to cluster patients based purely on cardiac morphology.
- Conformal Prediction: Applied a Bonferroni-style finite-sample correction to calibration scores to guarantee $\ge 90\%$ empirical coverage.
Challenges we ran into
The most significant architectural challenge was a dataset constraint: the ECG corpus (528 patients) and the tabular clinical dataset (70,000 patients) contained disjoint patient populations. To build and test the CMAF module, I had to sample simulated ECG embeddings from the TMAE latent distribution as a proxy for true alignment.
On the engineering side, ensuring the entire pipeline remained strictly CPU-deployable led to some deep debugging. I encountered a torch._dynamo incompatibility on Windows/CPU, which I had to bypass by writing a custom ManualAdamFixed optimiser. Additionally, tuning the TMAE required finding the exact 75% masking "sweet spot"—lower ratios allowed trivial interpolation, while higher ratios starved the encoder of context.
Accomplishments that we're proud of
- High Performance on Tabular Baselines: Achieved an AUC-ROC of 0.801 [95% CI: 0.790-0.811] on the 70,000-patient Cardiac Failure dataset, fully backed by bootstrap resampling and 5-fold cross-validation.
- Unsupervised Morphological Discovery: The TMAE achieved an 86.1% relative reduction in reconstruction MSE. When passed through UMAP and HDBSCAN, the latent space naturally segregated into 11 morphologically distinct sub-phenotypes—without a single medical label.
- Extreme Efficiency: Successfully kept the entire framework under 600K trainable parameters. It trains in ~15 minutes on a standard laptop CPU and achieves an inference latency of under 5 milliseconds per patient.
What we learned
This project proved that you do not need billion-parameter foundation models to solve critical healthcare problems. By forcing a lightweight encoder to solve a hard, self-supervised reconstruction task (75% masking), it naturally learns complex clinical representations.
I also learned the practical necessity of conformal prediction in medical AI. Clinicians cannot blindly trust point estimates; generating sets of plausible labels provides the honest mathematical uncertainty that safety-critical environments demand.
What's next for CardioFusion
The immediate next step is scaling the framework to a matched dataset like MIMIC-IV-ECG (which contains over 800,000 matched ECG and EHR records) to move the cross-modal fusion from simulated embeddings to genuine patient alignment.
Architecturally, I plan to expand the TMAE from encoding a single lead (Lead-0) to jointly encoding full 12-lead ECGs via inter-lead cross-attention, perfectly matching standard clinical practice! Finally, I aim to collaborate with a clinical cardiologist to formally validate the 11 unsupervised sub-phenotypes discovered by the model.
Log in or sign up for Devpost to join the conversation.