About OriGen

Inspiration

Advances in genomic sequencing have made detecting genetic variation routine, yet rare disease diagnosis remains limited by interpretation, not data generation.

Despite whole-exome and whole-genome sequencing, rare disease diagnostic yield remains only 25–30%.

$$ \text{Rare Disease Diagnostic Yield} \approx 25\text{–}30\% $$

Through discussions with clinicians and researchers, we repeatedly encountered the same bottleneck: Variants of Uncertain Significance (VUS) — mutations that are confidently detected but cannot be biologically interpreted.

Three systemic challenges drive this problem:


1. Variant Overload

Modern sequencing identifies hundreds of rare candidate variants per patient genome. Long-read studies report approximately ~700 rare structural or sequence variants per genome, far exceeding what clinicians can manually evaluate.

As sequencing improves:

$$ \text{Detected Variants} \uparrow \Rightarrow \text{Interpretation Difficulty} \uparrow $$

Diagnosis increasingly fails at the interpretation stage.


2. Dependence on Statistical Evidence

Current clinical tools such as SIFT, PolyPhen-2, CADD, and REVEL prioritize mutations using evolutionary conservation or statistical similarity to previously observed pathogenic variants.

These methods assume:

  • functionally important residues are evolutionarily conserved
  • pathogenic mutations resemble known disease variants

However, rare disease mutations are often:

  • novel,
  • family-specific,
  • or previously unseen.

When sample size becomes small \(N \rightarrow 0\), statistical predictors lose reliability and frequently classify variants as VUS.


3. Functional Validation Does Not Scale

Experimental assays that demonstrate functional disruption are slow and resource-intensive. As a result, most detected variants lack mechanistic evidence connecting:

$$ \text{Genotype} \rightarrow \text{Molecular Dysfunction} \rightarrow \text{Phenotype} $$

Clinicians therefore receive probabilistic scores rather than biological explanations.


What OriGen Does

OriGen is a mechanistic variant interpretation platform that evaluates how missense mutations alter protein structure, stability, and functional interactions.

Input

  • Gene symbol
  • Protein HGVS variant (e.g., p.Arg408Trp)

Mechanistic Interpretation Pipeline

OriGen integrates structural and biochemical evidence:

  1. Retrieve canonical protein sequence using UniProt
  2. Extract functional annotations:
    • active sites
    • binding interfaces
    • domains
  3. Align AlphaFold structural models to UniProt residue numbering
  4. Compute mechanistic evidence signals:
  • BLOSUM62 substitution severity
  • Chemical property disruption (charge, polarity, aromaticity)
  • 3D distance to functional residues
  • AlphaFold confidence score (pLDDT)
  • REVEL score (when available)
  • ClinVar annotation (when available)

Instead of asking:

“Has this mutation been seen before?”

OriGen asks:

  • Is the mutation structurally near a functional region?
  • Does the amino acid change disrupt chemistry or stability?
  • Is structural confidence sufficient?
  • What evidence is missing?

Output

OriGen generates:

  • Impact classification (Low / Moderate / High)
  • Explicit uncertainty estimation
  • Mechanistic explanation of disruption
  • Recommended next diagnostic action

This reframes variant interpretation from statistical correlation to mechanistic plausibility.


How We Built It

OriGen integrates multiple public biological resources:

  • UniProt REST API for sequence and functional annotations
  • AlphaFold DB for predicted protein structures
  • MyVariant.info for REVEL and ClinVar aggregation
  • Streamlit for an interactive clinician-facing interface

Example system logic:

if distance_to_active_site < threshold and chemical_disruption == True:
    impact = "High"

## Challenges We Faced

### **Residue Mapping Instability**

Mapping UniProt residue numbering to experimentally resolved PDB structures proved unreliable due to:

- incomplete structural coverage  
- inconsistent residue indexing  
- API instability across structure repositories  

These issues introduced fragility into real-time variant interpretation pipelines.

**Solution**

We pivoted to an **AlphaFold-first architecture**, leveraging models already aligned to UniProt canonical sequences. This ensured:

- consistent residue indexing  
- improved structural coverage  
- stable downstream computation  

---

### **Incomplete Clinical Evidence**

Real-world diagnostic pipelines rarely contain complete biological evidence. Many variants lack:

- structural annotations  
- functional site annotations  
- population prediction scores  
- ClinVar classifications  
- experimental validation data  

Traditional systems often fail silently under missing inputs.

Instead, OriGen explicitly models uncertainty:

$$
\text{Confidence} \propto \text{Available Evidence}
$$

Missing evidence is surfaced directly to clinicians, allowing interpretation decisions to remain transparent rather than artificially definitive.

---

## What We Learned

- Structural context substantially improves variant interpretation clarity.
- Explicit uncertainty communication is more clinically valuable than overconfident classification.
- Decision-support framing better matches clinical workflows than binary pathogenic predictions.
- Engineering robustness is as critical as modeling sophistication in translational biomedical systems.

---

## Accomplishments

- Built an end-to-end prototype connecting genomic variants to protein structural reasoning.
- Enabled interpretation of previously unseen mutations without reliance on historical datasets.
- Integrated multi-source biological data into a unified diagnostic workflow.
- Generated clinician-interpretable mechanistic evidence reports.

---

## What's Next for OriGen

- Incorporate phenotype similarity scoring using Human Phenotype Ontology (HPO).
- Add family segregation and inheritance modeling.
- Implement batch VCF processing for clinical-scale deployment.
- Perform retrospective validation on benchmark rare disease datasets.
- Introduce clinician-in-the-loop calibration workflows.
- Establish partnerships with hospitals and genomic diagnostic laboratories.

Built With

Share this project:

Updates