UChicago AI+Science Hackathon 2026 | Team: Data Geeks
We built a clinical knowledge graph extraction pipeline that takes doctor-patient transcripts and extracts structured entities and relationships. The pipeline went from a naive baseline score of 0.562 up to 0.874 on training data and 0.872 on a holdout set of patients the model had never seen.
We also added three bonus features on top: automated differential diagnosis using a local LLM, full speech-to-text transcription with speaker diarization, and audio timestamp tagging that links every KG node back to the exact moment it was mentioned in the recording.
| Method | Composite | Entity F1 | Population | Relation | Schema |
|---|---|---|---|---|---|
| Naive baseline | 0.562 | 0.428 | 0.432 | 0.674 | 0.714 |
| Our pipeline (training, 20 patients) | 0.874 | 0.550 | 0.900 | 0.850 | 1.000 |
| Our pipeline (holdout, 12 unseen) | 0.872 | 0.488 | 1.000 | 1.000 | 1.000 |
The 0.002 gap between training and holdout confirms the pipeline is not overfitting to the provided transcripts.
The main KG extraction runs three passes per transcript:
-
Pass 1 (GLM flash): entity extraction using curator-aligned vocabulary. Targets 22-26 nodes per transcript including negated symptoms ("absent cough"), family history, and lifestyle factors.
-
Pass 2 (GLM flash): enrichment pass that fills gaps in MEDICAL_HISTORY and LAB_RESULT nodes that the first pass may have missed.
-
Pass 3 (gpt-oss-20b): relation extraction over the finalized node set. Runs in parallel with Pass 2.
After extraction, dump_graph.py merges all 20 per-patient KGs into a unified graph using BGE-M3 embeddings (0.85 cosine threshold) for entity resolution.
Cost: ~$0.012 per patient | Time: ~1.3 min per patient | Total for 20 patients: ~$0.24, ~26 min
git clone https://github.com/Estella-Hu/Data-Geeks.git
cd Data-Geeks/Clinical_KG_OS_LLM
# Install uv first: https://docs.astral.sh/uv/getting-started/installation/
uv sync
cp api_keys_example.json api_keys.json
# Add your OpenRouter key to api_keys.json# Step 1: KG extraction (all 20 patients)
uv run python -m Clinical_KG_OS_LLM.kg_extraction --output ./my_kg
# Step 2: Entity resolution and merging
uv run python -m Clinical_KG_OS_LLM.dump_graph --input ./my_kg --output ./my_kg_unified
# Step 3: Score against the curated baseline
uv run python -m Clinical_KG_OS_LLM.kg_similarity_scorer \
--student ./my_kg_unified/unified_graph_my_kg.json \
--baseline ./data/human_curated/unified_graph_curated.jsonTo test on a subset first:
uv run python -m Clinical_KG_OS_LLM.kg_extraction --output ./my_kg --res-ids RES0198 RES0199 RES0200Reads the unified KG and generates a differential diagnosis with probabilities for each patient using llama3.1:8b running locally via Ollama. No API calls needed for this step.
Requires Ollama running locally (ollama serve) with llama3.1:8b pulled.
# Open the notebook
jupyter lab automatized_diagnosis/demo_diagnosis.ipynbPre-computed results for all 20 patients are saved in automatized_diagnosis/results/.
Sample output for RES0200:
Most likely: COPD Exacerbation (60%)
Differential:
1. COPD Exacerbation 60%
2. Pneumonia 30%
3. Asthma 10%
Goes directly from .mp3 to a KG-ready transcript using OpenAI Whisper with a clinical vocabulary prompt and heuristic speaker diarization (pause detection to assign [D-N]/[P-N] labels).
Requires openai-whisper and ffmpeg.
pip install openai-whisper
# ffmpeg: brew install ffmpeg (Mac) / apt install ffmpeg (Linux)
# Transcribe all 20 patients
python asr/asr_pipeline.py --all --model large-v3
# Evaluate WER vs ground truth transcripts
python asr/evaluate_wer.pyBaseline to beat: WER 16.7% (Whisper base model).
Tags every KG node with the exact second it was mentioned in the audio recording. Uses faster-whisper with word_timestamps=True and matches each node's evidence text against the word stream using normalized word overlap and sequence similarity.
This makes the KG interactive: clicking a node can jump to that moment in the recording.
pip install faster-whisper
# Step 1: extract word-level timestamps from audio
python audio_timestamps/extract_timestamps.py --all --model base
# Step 2: tag KG nodes with timestamps
python audio_timestamps/tag_kg_nodes.py --all \
--kg-dir my_kg \
--timestamps-dir audio_timestamps/timestamps \
--output-dir my_kg_timestampedEach node gets two new fields:
{
"audio_start": 42.3,
"audio_end": 45.1
}Clinical_KG_OS_LLM/
src/Clinical_KG_OS_LLM/
kg_extraction.py main V7f extraction pipeline
dump_graph.py entity resolution and KG merging
kg_similarity_scorer.py composite score evaluation
paths.py path utilities
asr/
asr_pipeline.py mp3 to transcript with speaker diarization
evaluate_wer.py WER vs ground truth, comparison to baseline
create_splits.py train/holdout split management
automatized_diagnosis/
demo_diagnosis.ipynb differential diagnosis notebook
results/ pre-computed diagnosis for all 20 patients
audio_timestamps/
extract_timestamps.py faster-whisper word-level timestamp extraction
tag_kg_nodes.py links KG node evidence to audio timestamps
data/
transcripts/ 20 patient folders with txt and mp3
human_curated/ reference KG for scoring
my_kg/ per-patient KG output files
holdout_eval/ holdout evaluation results
All extraction uses OpenRouter. No local GPU needed for the KG pipeline.
| Stage | Model | Purpose |
|---|---|---|
| Pass 1 + 2 | z-ai/glm-4.7-flash | Entity extraction and enrichment |
| Pass 3 | openai/gpt-oss-20b | Relation extraction |
| Diagnosis | llama3.1:8b (Ollama, local) | Differential diagnosis |
| ASR | Whisper large-v3 (local) | Speech-to-text |
| Timestamps | faster-whisper base (local) | Word-level timestamps |
The submission JSON is my_kg_unified/unified_graph_Data_Geeks.json.