CAS CDx

Feature Selection Final
ROC
Validation Evidence
Permutation

CAS CDx — Complement Activation Signature Companion Diagnostic

Inspiration

My partner, Dr Rafael Bayarri-Olmos, and I are working to commercialize targeted bifusion complement therapy. We want to pursue orphan drugs and underserved indications. Obviously, rare diseases are challenging to fund. The patient pools size, enrollment timelines, and ineffective screening compound costs and undermine data collection. Complement-mediated rare diseases fit in this gap, and the trial economics struggle similarly. Complement mediated sample prevalence can be as low as <30% in indications. We enroll them anyway and burn time, sanity, and capital. Hackrare inspired me to try using existing algorithms to support my own preference for disproportionate and positive impact. And so Yassin Mudawi, Niels Newlin, and myself constructed CAS CDx. What it does

Three components.

Patient Classifier. You put in a patient's complement protein levels (SomaScan or equivalent), it gives you a probability that their pathology is complement-driven, plus a 90% bootstrap confidence interval. Five proteins spanning central, terminal, and alternative complement pathways:

C3a anaphylatoxin (central pathway activation) C5 (terminal pathway consumption) Component C6 (terminal pathway) C5a anaphylatoxin (terminal pathway activation) Factor B (alternative pathway)

Trial Cost Simulator.

The classifier gives a continuous score, not a binary yes/no. So the threshold is tunable. You set your enrollment target, cost per patient, screening cost, prevalence estimate for an indication — it sweeps across every threshold and shows you cost savings in real time. When your patient pool is already small, that flexibility matters.

Validation Dashboard.

Cross-validated performance, permutation testing, SHAP feature importance, calibration curves — all interactive. A reviewer can interrogate the model directly instead of trusting a static table.

How we built it

Data

Zurich Long COVID cohort (Cervia-Hasler et al., Science 2024). 113 patients, SomaScan 7k proteomics at 6 months. SomaScan measures ~7,000 proteins via aptamer fluorescence, reports RFU on a log₁₀ scale. Preprocessing converts log₁₀ to log₂: xlog2=xlog10log⁡10(2)x_{\text{log}2} = \frac{x{\text{log}{10}}}{\log{10}(2)}xlog2=log10(2)xlog10 Then median normalization per sample against a global reference (from training split), MICE imputation for missing values (20 iterations, fit on train only). Healthy controls excluded — classifier distinguishes complement-driven Long COVID from non-complement-driven, not sick from healthy. The question is treatment selection, not diagnosis.

Feature Selection

Started with 54 complement proteins. Mann-Whitney U with Benjamini-Hochberg FDR, then correlation filtering at Spearman ( |\rho| > 0.9 ). Converged on five. More on why that matters in challenges.

Model

XGBoost, binary:logistic, AUC metric. RandomizedSearchCV, 50 iterations, 5-fold stratified CV over n_estimators, max_depth, learning_rate, subsample, colsample_bytree, min_child_weight, gamma, reg_alpha, reg_lambda. Threshold set by Youden's J on test set. Validation

Repeated CV (10 × 5 folds): AUC ( 0.940 \pm 0.043 ) Held-out test AUC: 0.935 Permutation test (1,000 shuffles): ( p < 0.001 ) Bootstrap CIs (2,000 resamples) Calibration + Brier score + ECE

Trial Enrichment Math

PPV from sensitivity, specificity, and prevalence: PPV=Sens×PrevSens×Prev+(1−Spec)(1−Prev)\text{PPV} = \frac{\text{Sens} \times \text{Prev}}{\text{Sens} \times \text{Prev} + (1 - \text{Spec})(1 - \text{Prev})}PPV=Sens×Prev+(1−Spec)(1−Prev)Sens×Prev Enriched enrollment: Nenriched=Nstandard×PrevPPVN_{\text{enriched}} = N_{\text{standard}} \times \frac{\text{Prev}}{\text{PPV}}Nenriched=Nstandard×PPVPrev Cost: Cenriched=Nenriched⋅cpatient+Nscreened⋅cscreenC_{\text{enriched}} = N_{\text{enriched}} \cdot c_{\text{patient}} + N_{\text{screened}} \cdot c_{\text{screen}}Cenriched=Nenriched⋅cpatient+Nscreened⋅cscreen Simulator does a bootstrap-smoothed threshold sweep (500 resamples, 5% steps) to maximize ( C_{\text{standard}} - C_{\text{enriched}} ).

Stack

Backend is FastAPI + XGBoost + SHAP, deployed on Render. Frontend is Next.js 16 / React 19 / Tailwind / Recharts on Vercel. Model registry has 11 pre-trained models for different biomarker subsets — all require C3a, API picks the best match for whatever markers a patient has.

Challenges

We started with 54 proteins. Tried dozens of feature combinations, trained and tested hundreds of configs. SHAP analysis — Shapley values from game theory applied to ML feature attribution — ranked what the model actually uses across all combinations. After all that work it converged on five known markers. Overengineering made things worse. Stripping it back corroborated what immunologists already knew about which complement proteins matter. The model didn't find anything new — it just packaged existing knowledge into something you can deploy in a trial protocol. Second thing: 113 patients, one center. Enough to show the approach works, not enough to claim generalizability. No individual biomarker passes FDR (best adjusted ( p = 0.126 )) — the signal is multivariate. Independent cohort validation is required before this goes anywhere clinical. Third: making threshold flexibility understandable. A continuous score is more powerful than a binary cutoff but it pushes complexity onto the user. The trial simulator exists to make that visible — full cost-savings landscape so you can pick your operating point knowing what you're trading off. What we learned Simplicity that confirms existing biology beats complexity that overfits. A 5-protein panel matching known complement markers, packaged into a deployable classifier, is more useful than a 54-feature model chasing noise. Small dataset forced that discipline. The economics are the actual leverage point. The classifier alone isn't enough — what makes rare disease trials viable is the cost reduction from enriched enrollment. Even a 30–40% reduction in screening changes an indication from "not worth it" to "economically viable."

What's next

Train on data from indications at different levels of rareness:

Cold Agglutinin Disease IgA Nephropathy Paroxysmal Nocturnal Hemoglobinuria Buerger's Disease Diabetic Nephropathy

More diseases, more patients, subtler activation patterns. Move from a general complement classifier to indication-specific stratification. A CDx that makes enrollment tractable can make a rare indication worth pursuing.

Built With

cervia-hasler-et-al.
python
render
science-2023.-113-long-covid-patients-from-the-university-hospital-zurich
typescript
vercel

Updates

cameron-mahon Mahon started this project — Mar 01, 2026 02:19 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.