Skip to content

Charul04/Forensic-Biological-Evidence-Analyzer

Repository files navigation

🔬 Forensic Biological Evidence Analyzer

BACSA Hacks 2026 — University of Toronto

Turning biological crime scene evidence into ranked suspect lists using ML — with confidence scores that model uncertainty, not just answers.

Python Streamlit Scikit-learn Accuracy


📌 About

The Forensic Biological Evidence Analyzer is a machine learning system that takes biological evidence collected from a crime scene and ranks 30,000 suspects by match confidence. It models uncertainty through transparent confidence scores rather than claiming one correct answer — directly mirroring how real forensic science works.

Built in 24 hours for BACSA Hacks 2026 — Closed Challenge: Forensics Support System.


🎯 Features

  • 🧬 6 DNA STR markers — same standard used in real forensic labs
  • 🩸 19 biological features — blood type, hair, eyes, fingerprint, height + DNA
  • 🤖 Gradient Boosting ML model — 96.2% accuracy trained on 30,000 records
  • 📊 Confidence scoring — combines ML probability + DNA overlap + physical match
  • 📝 Written reasoning — explains WHY each suspect ranked high
  • 📈 5-page Streamlit UI — interactive, dark-themed, production-grade
  • ⬇️ Export results — download full ranked CSV of all 30,000 suspects

🖥️ App Pages

Page Description
🏠 Home Project overview and stats dashboard
🧬 Enter Evidence Input crime scene biological evidence manually or via CSV upload
📂 Suspect Database Browse and filter all 30,000 suspects
📊 Analysis & Results Ranked suspect list with confidence scores and reasoning
📈 Visualizations Feature importance, heatmaps, and database distribution charts

🧠 How the Confidence Score Works

Each suspect receives a final confidence score calculated as:

Final Score = (0.5 × ML Model Probability)
            + (0.3 × DNA STR Allele Overlap %)
            + (0.2 × Physical Trait Match %)
Score Label
≥ 70% 🔴 HIGH
40–70% 🟠 MEDIUM
< 40% 🔵 LOW

🛠️ Tech Stack

Tool Purpose
Python Core language
Pandas & NumPy Data generation and feature engineering
Scikit-learn Gradient Boosting Classifier
Streamlit Interactive 5-page web app
Seaborn & Matplotlib Visualizations

🚀 Getting Started

1. Clone the repo

git clone https://github.com/yourusername/forensic-analyzer.git
cd forensic-analyzer

2. Install dependencies

pip install scikit-learn pandas numpy matplotlib seaborn streamlit

3. Generate the dataset

python generate_dataset.py

Creates:

  • data/suspects.csv — 30,000 suspect records
  • data/crime_scene_evidence.csv — sample crime scene evidence

4. Train the model

python train_model.py

Creates:

  • models/xgb_model.pkl — trained model
  • models/metadata.pkl — encoding metadata
  • models/feature_importance.csv — feature weights

5. Run the app

streamlit run app.py

📁 Project Structure

forensic-analyzer/
│
├── app.py                    # Main Streamlit application
├── generate_dataset.py       # Synthetic dataset generator (30,000 records)
├── train_model.py            # Model training script
│
├── data/
│   ├── suspects.csv          # 30,000 suspect records (17 columns)
│   └── crime_scene_evidence.csv  # Sample crime scene evidence
│
└── models/
    ├── xgb_model.pkl         # Trained Gradient Boosting model
    ├── metadata.pkl          # Feature encoding metadata
    └── feature_importance.csv # Feature importance scores

📊 Dataset Features

Feature Type Description
suspect_id String Unique ID (S00001–S30000)
blood_type Categorical A+, A-, B+, B-, AB+, AB-, O+, O-
hair_color Categorical Black, Brown, Blonde, Red, Gray, White
eye_color Categorical Brown, Blue, Green, Hazel, Gray, Amber
fingerprint_class Categorical Loop, Whorl, Arch, Tented Arch
height_cm Integer 150–200 cm
age Integer 18–65 years
prior_record Binary 0 = No, 1 = Yes
dna_marker_1–6 String STR allele pairs (e.g. "12,18")

🏆 Judging Criteria Addressed

"Success is measured not by finding one correct answer, but by how well teams model uncertainty, justify assumptions, and reason their forensic interpretations."

  • Ranked suspect list — all 30,000 suspects ranked by confidence
  • Confidence scores — multi-component scoring with uncertainty modeling
  • Reasoning — written explanation for every suspect's ranking
  • Assumption justification — feature importance shows what the model weighted

👨‍💻 Built By

Built with ❤️ for BACSA Hacks 2026 — Biotech and Computer Science Association, University of Toronto.


📄 License

MIT License — free to use, modify, and distribute.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages