Inspiration
In healthcare, a single missed drug interaction or an overlooked allergy can be fatal. Existing clinical decision support systems are either too rigid (simple rule engines that miss context) or too unreliable (LLMs that hallucinate clinical facts). We asked: what if we could build an AI agent that combines the reliability of deterministic rules with the intelligence of large language models — and prove that nothing is hallucinated?
The answer is ClinicalGuard: a 4-layer architecture where clinical safety is never left to an LLM's judgment alone.
What it does
ClinicalGuard is a 27-tool autonomous clinical safety agent that connects to a patient's FHIR health record and performs comprehensive medication safety screening in seconds.
What it screens for:
- 🔴 Drug-Allergy Conflicts — Cross-reactivity detection (e.g., penicillin → amoxicillin)
- 💊 Drug Interactions — 60+ severity-rated interaction pairs
- 👴 Beers Criteria — 75+ drugs inappropriate for elderly (AGS 2023)
- ❤️ QT Prolongation Risk — 28 QT-prolonging drugs + electrolyte screening (CredibleMeds/AHA 2023)
- ⚠️ Opioid + Serotonin Syndrome — FDA black box combinations (FDA REMS 2023)
- 🫁 NEWS2 Early Warning — Real-time clinical deterioration scoring (RCP 2017)
- 🧪 Renal Safety — CKD-EPI 2021 eGFR calculation with 34 drug dose adjustments
- 📋 Polypharmacy, Duplicate Therapy, Fall Risk, Sepsis Risk (qSOFA)
- 💉 HF Therapy Gaps (ACC/AHA 2022), Diabetic Care (HEDIS 2024), Immunization Gaps (ACIP 2024)
- 🔍 Data Completeness Scoring — Confidence levels for every safety screen
The critical difference: Every finding is deterministically computed by 738 lines of coded clinical rules — the LLM synthesizes results but never invents clinical facts. Then a second, independent AI model reviews everything.
How we built it
The 4-Layer Anti-Hallucination Architecture
Layer 0 — Semantic Router: A Python-based tool gating layer that analyzes patient demographics before running screens. A 25-year-old patient won't trigger Beers Criteria (age ≥65 only) — making every report bespoke to the patient.
Layer 1 — Truth Tools (10 FHIR fetchers):
All 10 FHIR resources (Patient, Medications, Conditions, Labs, Vitals, Allergies, Immunizations, Procedures, Encounters, Social History) are fetched in parallel using ThreadPoolExecutor(max_workers=10) in the before_model_callback — before the LLM even starts reasoning. This eliminates the #1 latency bottleneck in multi-tool agents.
Layer 2 — Intelligence Tools (16 deterministic screens):
Pure Python. No LLM involvement. Every drug interaction, every Beers flag, every eGFR calculation is a hardcoded lookup from 14 clinical knowledge bases (738 lines of clinical_rules.py). No hallucination is physically possible in this layer.
Layer 3 — Orchestration (Primary Model via LiteLLM): The primary LLM acts strictly as a narrator — it orchestrates tool calls, formats reports, and quotes exact deterministic rules. It practices:
- Negative space reporting: Explicitly states clean results ("Drug-allergy check: 5 meds vs 2 allergies — No conflicts detected")
- Semantic justification: Quotes the exact rule, never paraphrases ("Flagged by I5: Lisinopril + Spironolactone — Rule: Risk of Hyperkalemia")
Layer 4 — Cross-Model Verification + Arbitration (Independent Verifier Model): A completely different verification model architecture independently reviews ALL findings:
- ✅ Verified — agrees with the finding
- ⚠️ Challenged — disputes it → triggers the Arbitration Loop
- 🔍 Missed — identifies additional concerns
The Arbitration Loop: If challenged, the primary model must either ACCEPT the critique (and correct its output) or REJECT it (and cite the specific clinical guideline from clinical_rules.py). Unresolved disputes are flagged as "⚠️ System Dispute — Manual Review Recommended" — making the AI look humble and clinically safe rather than overconfident.
Why LiteLLM & Model-Agnostic Architecture?
We built ClinicalGuard to be completely model-agnostic using LiteLLM. Our default configuration runs Gemini 2.5 Flash for blazing-fast 27-tool orchestration, and uses the separate Gemini 2.5 Flash-Lite architecture for independent verification. This ensures that when the two models agree, confidence is high, and when they disagree, it's flagged for human review — exactly what clinical safety requires.
Challenges we ran into
- FHIR server compatibility — Different FHIR implementations support different query parameters. We removed unsupported
_sortfields and built graceful "clinical empty states" that return professional notes like "No allergies documented — confirm NKDA with patient." - Model response format variability — Output formats vary wildly between foundation models. We built robust response parsers that handle every edge case to maintain our model-agnostic architecture.
- API rate limits & Model Quotas — Running 27 tools across multiple verification layers easily hits free-tier rate limits. We solved this by using LiteLLM to split the orchestration load and verification load across completely different model endpoints and quota pools.
- Token budget management — 16 tools can generate massive JSON payloads. We added intelligent tool gating and kept tool returns compact.
Accomplishments that we're proud of
- 738 lines of deterministic clinical rules covering 14 knowledge bases — zero hallucination possible
- 27 tools working together in a coherent 4-layer pipeline
- Parallel FHIR prefetch that eliminates Layer 1 latency entirely
- Arbitration loop that resolves model disagreements automatically
- Negative space reporting — proving the system checked, not just reporting problems
- Building a completely model-agnostic architecture via LiteLLM capable of swapping foundation models on the fly
- The anti-hallucination guarantee: FHIR-only data → deterministic Python → constrained LLM → independent verification
What we learned
- In clinical AI, what you don't say is as important as what you do — negative space reporting builds trust
- Deterministic rules + LLM synthesis is far more trustworthy than pure LLM reasoning for safety-critical applications
- Cross-model verification catches errors that self-review misses
- The A2A protocol and Google ADK make it elegant to build multi-tool agentic systems
- Parallel prefetch is a simple optimization that dramatically improves agent responsiveness
What's next for ClinicalGuard
- MLflow integration for full audit logging of every clinical decision
- Domain-grouped synthesis to handle patients with 50+ medications without context loss
- SMART on FHIR authentication for direct EHR integration
- Real-time monitoring mode that watches vitals and alerts on NEWS2 score changes
- Expanding to pediatric safety screens and pregnancy contraindications
- Regulatory pathway toward clinical decision support certification
Built With
- a2aprotocol
- ai
- databricksaigateway
- googleadk
- litellm
- llama
- promptopinion
- python
- uvicorn


Log in or sign up for Devpost to join the conversation.