Clinical KG Extraction Pipeline

UChicago AI+Science Hackathon 2026 | Team: Data Geeks

What we built

We built a clinical knowledge graph extraction pipeline that takes doctor-patient transcripts and extracts structured entities and relationships. The pipeline went from a naive baseline score of 0.562 up to 0.874 on training data and 0.872 on a holdout set of patients the model had never seen.

We also added three bonus features on top: automated differential diagnosis using a local LLM, full speech-to-text transcription with speaker diarization, and audio timestamp tagging that links every KG node back to the exact moment it was mentioned in the recording.

Scores

Method	Composite	Entity F1	Population	Relation	Schema
Naive baseline	0.562	0.428	0.432	0.674	0.714
Our pipeline (training, 20 patients)	0.874	0.550	0.900	0.850	1.000
Our pipeline (holdout, 12 unseen)	0.872	0.488	1.000	1.000	1.000

The 0.002 gap between training and holdout confirms the pipeline is not overfitting to the provided transcripts.

Pipeline: V7f (3-pass extraction)

The main KG extraction runs three passes per transcript:

Pass 1 (GLM flash): entity extraction using curator-aligned vocabulary. Targets 22-26 nodes per transcript including negated symptoms ("absent cough"), family history, and lifestyle factors.
Pass 2 (GLM flash): enrichment pass that fills gaps in MEDICAL_HISTORY and LAB_RESULT nodes that the first pass may have missed.
Pass 3 (gpt-oss-20b): relation extraction over the finalized node set. Runs in parallel with Pass 2.

After extraction, dump_graph.py merges all 20 per-patient KGs into a unified graph using BGE-M3 embeddings (0.85 cosine threshold) for entity resolution.

Cost: ~$0.012 per patient | Time: ~1.3 min per patient | Total for 20 patients: ~$0.24, ~26 min

Setup

git clone https://github.com/Estella-Hu/Data-Geeks.git
cd Data-Geeks/Clinical_KG_OS_LLM

# Install uv first: https://docs.astral.sh/uv/getting-started/installation/
uv sync

cp api_keys_example.json api_keys.json
# Add your OpenRouter key to api_keys.json

Running the pipeline

# Step 1: KG extraction (all 20 patients)
uv run python -m Clinical_KG_OS_LLM.kg_extraction --output ./my_kg

# Step 2: Entity resolution and merging
uv run python -m Clinical_KG_OS_LLM.dump_graph --input ./my_kg --output ./my_kg_unified

# Step 3: Score against the curated baseline
uv run python -m Clinical_KG_OS_LLM.kg_similarity_scorer \
  --student ./my_kg_unified/unified_graph_my_kg.json \
  --baseline ./data/human_curated/unified_graph_curated.json

To test on a subset first:

uv run python -m Clinical_KG_OS_LLM.kg_extraction --output ./my_kg --res-ids RES0198 RES0199 RES0200

Bonus Features

1. Automated Differential Diagnosis

Reads the unified KG and generates a differential diagnosis with probabilities for each patient using llama3.1:8b running locally via Ollama. No API calls needed for this step.

Requires Ollama running locally (ollama serve) with llama3.1:8b pulled.

# Open the notebook
jupyter lab automatized_diagnosis/demo_diagnosis.ipynb

Pre-computed results for all 20 patients are saved in automatized_diagnosis/results/.

Sample output for RES0200:

Most likely: COPD Exacerbation (60%)
Differential:
  1. COPD Exacerbation  60%
  2. Pneumonia          30%
  3. Asthma             10%

2. Speech-to-Text (ASR)

Goes directly from .mp3 to a KG-ready transcript using OpenAI Whisper with a clinical vocabulary prompt and heuristic speaker diarization (pause detection to assign [D-N]/[P-N] labels).

Requires openai-whisper and ffmpeg.

pip install openai-whisper
# ffmpeg: brew install ffmpeg (Mac) / apt install ffmpeg (Linux)

# Transcribe all 20 patients
python asr/asr_pipeline.py --all --model large-v3

# Evaluate WER vs ground truth transcripts
python asr/evaluate_wer.py

Baseline to beat: WER 16.7% (Whisper base model).

3. Audio Timestamp Tagging

Tags every KG node with the exact second it was mentioned in the audio recording. Uses faster-whisper with word_timestamps=True and matches each node's evidence text against the word stream using normalized word overlap and sequence similarity.

This makes the KG interactive: clicking a node can jump to that moment in the recording.

pip install faster-whisper

# Step 1: extract word-level timestamps from audio
python audio_timestamps/extract_timestamps.py --all --model base

# Step 2: tag KG nodes with timestamps
python audio_timestamps/tag_kg_nodes.py --all \
  --kg-dir my_kg \
  --timestamps-dir audio_timestamps/timestamps \
  --output-dir my_kg_timestamped

Each node gets two new fields:

{
  "audio_start": 42.3,
  "audio_end": 45.1
}

Project Structure

Clinical_KG_OS_LLM/
  src/Clinical_KG_OS_LLM/
    kg_extraction.py          main V7f extraction pipeline
    dump_graph.py             entity resolution and KG merging
    kg_similarity_scorer.py   composite score evaluation
    paths.py                  path utilities

  asr/
    asr_pipeline.py           mp3 to transcript with speaker diarization
    evaluate_wer.py           WER vs ground truth, comparison to baseline
    create_splits.py          train/holdout split management

  automatized_diagnosis/
    demo_diagnosis.ipynb      differential diagnosis notebook
    results/                  pre-computed diagnosis for all 20 patients

  audio_timestamps/
    extract_timestamps.py     faster-whisper word-level timestamp extraction
    tag_kg_nodes.py           links KG node evidence to audio timestamps

  data/
    transcripts/              20 patient folders with txt and mp3
    human_curated/            reference KG for scoring

  my_kg/                      per-patient KG output files
  holdout_eval/               holdout evaluation results

Models used

All extraction uses OpenRouter. No local GPU needed for the KG pipeline.

Stage	Model	Purpose
Pass 1 + 2	z-ai/glm-4.7-flash	Entity extraction and enrichment
Pass 3	openai/gpt-oss-20b	Relation extraction
Diagnosis	llama3.1:8b (Ollama, local)	Differential diagnosis
ASR	Whisper large-v3 (local)	Speech-to-text
Timestamps	faster-whisper base (local)	Word-level timestamps

The submission JSON is my_kg_unified/unified_graph_Data_Geeks.json.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Clinical_KG_OS_LLM		Clinical_KG_OS_LLM
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical KG Extraction Pipeline

What we built

Scores

Pipeline: V7f (3-pass extraction)

Setup

Running the pipeline

Bonus Features

1. Automated Differential Diagnosis

2. Speech-to-Text (ASR)

3. Audio Timestamp Tagging

Project Structure

Models used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Clinical KG Extraction Pipeline

What we built

Scores

Pipeline: V7f (3-pass extraction)

Setup

Running the pipeline

Bonus Features

1. Automated Differential Diagnosis

2. Speech-to-Text (ASR)

3. Audio Timestamp Tagging

Project Structure

Models used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages