Phoenix

PHOENIX: THE SELF-EVOLVING CRISIS VOICE AGENT

ELEVATOR PITCH Phoenix is an autonomous, self-optimizing crisis response agent designed to scale emergency triage during natural disasters. By utilizing a closed-loop "Autotune" architecture, Phoenix doesn't just respond to callers—it programmatically iterates on its own internal logic to maximize de-escalation and safety accuracy without human intervention.

THE PROBLEM Emergency infrastructure, specifically FEMA and 911 dispatch, is a fundamentally unscalable human queue. During localized catastrophes, call volume spikes 100x, creating a lethal informational bottleneck. Existing AI solutions are static; if a prompt fails to handle a specific regional crisis, it requires manual developer intervention. In a crisis, we don't have hours to wait for a patch.

THE SOLUTION Phoenix is a system that evolves in production. It acts as a first-line triage agent that queries real-time weather and flood APIs to provide life-saving context. The core of Phoenix is a long-running Autotune Service: it analyzes its own performance traces, runs automated back-testing against "Golden" datasets, and promotes superior logic variants to the production environment in real-time.

TECHNICAL ARCHITECTURE We have moved beyond static prompt engineering to a Continuous Evaluation and Promotion Pipeline:

TRACE INGESTION: Every call trace is ingested into Braintrust for full-stack observability.

STRATEGY PROPOSING: Gemini 1.5 Pro acts as a "Meta-Agent," analyzing traces to identify friction points and generating two candidate prompt variants.

AUTOMATED BACKTESTING: The Autotune Service executes these variants against a specialized train/test split derived from 911 emergency transcripts.

GATED PROMOTION: A candidate is only promoted to the ElevenLabs production environment if it clears a 0-regression gate across our technical metric suite.

TECHNICAL EVALUATION SUITE (TRAIN/TEST SPLIT) Success is quantified using four primary metrics to ensure grounded crisis protocols:

-- CALMER_END_STATE_BINARY: Score 1 if the user appears calmer at the end of the call vs. the start. Evaluated via Gemini judge over conversation context. -- EMERGENCY_SERVICES_WHEN_NEEDED_BINARY: Score 1 if the agent escalates when the case requires it (based on ground-truth metadata) and penalizes unnecessary escalation. -- TURNS_TO_EMERGENCY_SERVICES: Integer index of the first escalation turn. We optimize for a lower index (N approaching 0) to ensure immediate safety. -- TURNS_TO_CALM_STATE: Integer index where the user first appears calmer. This measures de-escalation velocity.

SPONSOR TOOL INTEGRATION -- BRAINTRUST: Our CI/CD for LLMs. Braintrust manages our datasets, executes the evaluation runners, and handles the experiment versioning that drives the self-evolving loop. -- GEMINI API: The Reasoning Core. Gemini powers the agent's logic and the meta-analysis required to propose architectural prompt improvements. -- ELEVENLABS: The Empathetic Interface. We utilized the low-latency Conversational AI agent to provide human-grade voice synthesis to maintain trust in high-stress environments. -- MODULATE: The Signal Integrity Layer. Integrated to ensure the agent’s tone remains professional and de-escalatory despite chaotic environmental noise.

THE IMPACT Phoenix represents a shift from "Chatbots" to Autonomous Systems. By automating the feedback loop between call performance and logic optimization, we ensure that during a crisis, the software is getting smarter as the situation gets harder. We are not just building an agent; we are building a scalable, self-correcting safety net.

Built With

braintrust
braintrusteavls
elevenlabs
gemini
modulate

Updates

endy diaz started this project — Feb 21, 2026 04:38 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.