Turning biological crime scene evidence into ranked suspect lists using ML — with confidence scores that model uncertainty, not just answers.
The Forensic Biological Evidence Analyzer is a machine learning system that takes biological evidence collected from a crime scene and ranks 30,000 suspects by match confidence. It models uncertainty through transparent confidence scores rather than claiming one correct answer — directly mirroring how real forensic science works.
Built in 24 hours for BACSA Hacks 2026 — Closed Challenge: Forensics Support System.
- 🧬 6 DNA STR markers — same standard used in real forensic labs
- 🩸 19 biological features — blood type, hair, eyes, fingerprint, height + DNA
- 🤖 Gradient Boosting ML model — 96.2% accuracy trained on 30,000 records
- 📊 Confidence scoring — combines ML probability + DNA overlap + physical match
- 📝 Written reasoning — explains WHY each suspect ranked high
- 📈 5-page Streamlit UI — interactive, dark-themed, production-grade
- ⬇️ Export results — download full ranked CSV of all 30,000 suspects
| Page | Description |
|---|---|
| 🏠 Home | Project overview and stats dashboard |
| 🧬 Enter Evidence | Input crime scene biological evidence manually or via CSV upload |
| 📂 Suspect Database | Browse and filter all 30,000 suspects |
| 📊 Analysis & Results | Ranked suspect list with confidence scores and reasoning |
| 📈 Visualizations | Feature importance, heatmaps, and database distribution charts |
Each suspect receives a final confidence score calculated as:
Final Score = (0.5 × ML Model Probability)
+ (0.3 × DNA STR Allele Overlap %)
+ (0.2 × Physical Trait Match %)
| Score | Label |
|---|---|
| ≥ 70% | 🔴 HIGH |
| 40–70% | 🟠 MEDIUM |
| < 40% | 🔵 LOW |
| Tool | Purpose |
|---|---|
| Python | Core language |
| Pandas & NumPy | Data generation and feature engineering |
| Scikit-learn | Gradient Boosting Classifier |
| Streamlit | Interactive 5-page web app |
| Seaborn & Matplotlib | Visualizations |
git clone https://github.com/yourusername/forensic-analyzer.git
cd forensic-analyzerpip install scikit-learn pandas numpy matplotlib seaborn streamlitpython generate_dataset.pyCreates:
data/suspects.csv— 30,000 suspect recordsdata/crime_scene_evidence.csv— sample crime scene evidence
python train_model.pyCreates:
models/xgb_model.pkl— trained modelmodels/metadata.pkl— encoding metadatamodels/feature_importance.csv— feature weights
streamlit run app.pyforensic-analyzer/
│
├── app.py # Main Streamlit application
├── generate_dataset.py # Synthetic dataset generator (30,000 records)
├── train_model.py # Model training script
│
├── data/
│ ├── suspects.csv # 30,000 suspect records (17 columns)
│ └── crime_scene_evidence.csv # Sample crime scene evidence
│
└── models/
├── xgb_model.pkl # Trained Gradient Boosting model
├── metadata.pkl # Feature encoding metadata
└── feature_importance.csv # Feature importance scores
| Feature | Type | Description |
|---|---|---|
| suspect_id | String | Unique ID (S00001–S30000) |
| blood_type | Categorical | A+, A-, B+, B-, AB+, AB-, O+, O- |
| hair_color | Categorical | Black, Brown, Blonde, Red, Gray, White |
| eye_color | Categorical | Brown, Blue, Green, Hazel, Gray, Amber |
| fingerprint_class | Categorical | Loop, Whorl, Arch, Tented Arch |
| height_cm | Integer | 150–200 cm |
| age | Integer | 18–65 years |
| prior_record | Binary | 0 = No, 1 = Yes |
| dna_marker_1–6 | String | STR allele pairs (e.g. "12,18") |
"Success is measured not by finding one correct answer, but by how well teams model uncertainty, justify assumptions, and reason their forensic interpretations."
- ✅ Ranked suspect list — all 30,000 suspects ranked by confidence
- ✅ Confidence scores — multi-component scoring with uncertainty modeling
- ✅ Reasoning — written explanation for every suspect's ranking
- ✅ Assumption justification — feature importance shows what the model weighted
Built with ❤️ for BACSA Hacks 2026 — Biotech and Computer Science Association, University of Toronto.
MIT License — free to use, modify, and distribute.