Inspiration

Modern production systems fail in complex and unpredictable ways, yet incident response is still largely manual. Engineers jump between logs, dashboards, and codebases trying to understand what broke and how to fix it. We were inspired to build a system where AI doesn’t just assist — it actively participates in the full incident lifecycle.

VoiceOps was born from a simple question: What if your system could detect its own failures, explain them out loud, fix them safely, and learn from the outcome?


What it does

VoiceOps is a self-improving AI incident response agent that:

  1. Detects production errors via Datadog logs
  2. Traces the issue back to the responsible file in the local codebase
  3. Uses Gemini (LLM) to generate a minimal, safe patch
  4. Applies the patch locally (no Git required)
  5. Verifies the fix automatically
  6. Logs structured evaluation metrics to Braintrust
  7. Generates a voice summary of the incident and resolution using ElevenLabs

This creates a closed loop:

Detect → Diagnose → Patch → Verify → Score → Learn → Speak

It transforms incident response from reactive debugging into an autonomous AI-assisted workflow.


How we built it

We designed VoiceOps as a modular multi-agent system:

  • Datadog Integration Pulls real-time logs or simulates incidents for demo purposes.

  • Trace Agent Analyzes log fingerprints and locates the likely failing file in the repository.

  • Gemini Patch Agent Sends incident summary + file contents to Gemini and receives a minimal safe patch.

  • Local Patch Engine Applies full-file LLM-generated patches directly to the filesystem.

  • Evaluation Engine Verifies correctness (e.g., division-by-zero guard), measures duplication, and tracks patch metrics.

  • Braintrust Integration Logs structured experiment data including:

    • Inputs
    • Outputs
    • Verification results
    • Quantitative scores
  • ElevenLabs Integration Converts the incident + patch result into a natural voice summary.

The UI was built in Streamlit for rapid prototyping and demo clarity.


Challenges we ran into

  • Braintrust SDK version inconsistencies We encountered initialization and logging API changes and had to switch to experiment-based logging.

  • Circular import errors in agents We refactored module structure to eliminate dependency cycles.

  • Duplicate patch insertion Early patch logic inserted multiple guards. We implemented evaluation metrics to detect this.

  • Ensuring safe local patching We shifted from diff-based patching to full-file LLM regeneration to simplify and stabilize application.

  • Making the demo deterministic We added demo mode so judges always see a reproducible incident.


Accomplishments that we're proud of

  • Built a true closed-loop self-improving system in a hackathon timeframe.
  • Integrated three sponsor technologies meaningfully:

    • Datadog for observability
    • Gemini for intelligent patch generation
    • Braintrust for evaluation and learning
    • ElevenLabs for voice interface
  • Created measurable AI evaluation instead of just generating text.

  • Designed a system that feels like a production-ready autonomous DevOps assistant.


What we learned

  • LLMs are strongest when paired with structured verification.
  • Evaluation frameworks like Braintrust are essential for AI systems that modify code.
  • Observability + AI + automated evaluation is a powerful combination.
  • Building multi-agent systems requires careful state handling and modular design.
  • The real value is not just generating fixes, but scoring and learning from them.

What's next for VoiceOps

  • Expand beyond division-by-zero into generalized bug classification.
  • Add automatic rollback if verification fails.
  • Integrate real CI pipelines instead of local patching.
  • Add reinforcement-style learning where Braintrust scores influence future patch prompts.
  • Introduce real-time voice conversations for interactive incident handling.
  • Support multi-file patch generation.
  • Deploy as a GitHub App or DevOps Copilot.

VoiceOps is evolving from a hackathon prototype into an autonomous AI reliability engineer.

Built With

Share this project:

Updates