We will be undergoing planned maintenance on January 16th, 2026 at 1:00pm UTC. Please make sure to save your work.

Inspiration

Voltaros was inspired by the critical need for reliable cloud systems in today’s digital world, where even a minute of downtime can cost businesses millions. As a beginner in chaos engineering, I was fascinated by the idea of proactively breaking systems to make them stronger, like training a Pokémon to withstand battles. Reading about high-profile outages and Google Cloud’s Agent Development Kit (ADK) sparked the vision for Voltaros—a tool that automates resilience testing with AI-driven agents, making chaos engineering accessible to all. The #adkhackathon’s focus on Google Cloud and multi-agent systems was the perfect opportunity to bring this idea to life!

What it does

Voltaros is an AI-powered chaos engineering orchestrator that stress-tests cloud applications to ensure unbreakable reliability. Users interact with a sleek Next.js dashboard to trigger pod crashes and network latency on a GKE-hosted microservice, while ADK agents autonomously execute experiments, collect real-time metrics via BigQuery, and visualize results with Vertex AI-generated plots stored in Cloud Storage. Built for DevOps engineers, Voltaros automates resilience testing, turning complex chaos experiments into a few clicks.

How I built it

I built Voltaros using a modern, cloud-native stack, leveraging Google Cloud and the ADK Starter Pack:

  • Frontend: A Next.js app with Tailwind CSS, hosted on Vercel, provides a user-friendly dashboard. Users click buttons to trigger chaos, collect metrics, or view visualizations, with API routes (pages/api/chaos.ts) proxying requests to the backend.
  • Backend: A Python FastAPI app, containerized and deployed on Cloud Run, uses the ADK to orchestrate three agents:
    • Chaos Injector Agent (agents/chaos_injector.py): Fetches experiment files (experiment.json, latency_experiment.json) from Cloud Storage and applies pod crashes or 200ms latency to a GKE app using Chaos Toolkit (chaosgcp).
    • Monitor Agent (agents/monitor.py): Queries Cloud Monitoring for CPU and latency metrics, storing them in BigQuery (voltaros_dataset.metrics).
    • Reporter Agent (agents/reporter.py): Queries BigQuery, generates line plots using Vertex AI Workbench (Matplotlib), and saves images to Cloud Storage (voltaros-reports).
  • Target App: A sample microservice (voltaros-app) runs on GKE, with pods labeled app=voltaros-app, serving as the chaos testing target.
  • Google Cloud Services: GKE hosts the target app, BigQuery stores metrics and logs, Cloud Storage holds experiment files and plots, Cloud Monitoring provides real-time data, and Vertex AI visualizes results.
  • Development: I used Visual Studio for local testing, GitHub for version control, and Docker for containerization. The ADK’s AgentOrchestrator enabled seamless agent communication, while Chaos Toolkit integrated via tools/chaos_toolkit.yaml.

Challenges I ran into

As a chaos engineering novice, I faced several hurdles:

  • Learning Chaos Engineering: Grasping concepts like pod crashes and latency injection was daunting. Studying Chaos Toolkit and chaosgcp documentation helped, but configuring experiments for GKE took trial and error.
  • ADK Integration: Setting up the ADK Starter Pack and orchestrating multiple agents was complex. I struggled with async messaging in main.py but resolved it by debugging with Firebase Studio.
  • Vertex AI Visualization: Replacing Looker Studio (unfamiliar to me) with Vertex AI Workbench required learning to generate plots programmatically. Managing Matplotlib in a serverless context was tricky, but saving images to Cloud Storage simplified delivery.
  • GKE Permissions: Ensuring the backend’s service account had roles/container.admin for GKE and roles/aiplatform.user for Vertex AI involved multiple IAM tweaks.
  • Time Constraints: Balancing frontend polish, backend logic, and demo prep in a hackathon timeframe was intense. I prioritized the pod crash trigger for the MVP, adding latency as a stretch goal.

Accomplishments that I am proud of

  • Functional MVP: I built a working chaos engineering tool that triggers pod crashes and (optionally) latency, collects metrics, and visualizes results—all in a few days!
  • Google Cloud Integration: Seamlessly combining GKE, BigQuery, Cloud Storage, Cloud Monitoring, and Vertex AI showcased our ability to leverage Google Cloud’s ecosystem.
  • ADK Mastery: Orchestrating three ADK agents (Chaos Injector, Monitor, Reporter) demonstrated multi-agent automation, a core hackathon goal.
  • User-Friendly UI: The Vercel-hosted Next.js dashboard is intuitive, making chaos engineering accessible to non-experts.

What I learned

  • Chaos Engineering: I learned how to simulate failures (pod crashes, latency) to improve system resilience, and why it’s critical for cloud apps.
  • Google Cloud: Deepened my skills in GKE (from prior use) and mastered BigQuery, Cloud Storage, and Vertex AI for data and visualization.
  • ADK: Gained hands-on experience with multi-agent systems, async Python, and the ADK Starter Pack.
  • Frontend Deployment: Discovered Vercel’s ease for Next.js, streamlining our frontend hosting.
  • Teamwork: Collaborating under hackathon pressure taught me to prioritize tasks and communicate effectively.

What's next for Voltaros

  • More Chaos Experiments: Add disk failure or resource exhaustion triggers to test broader resilience scenarios.
  • Vertex AI Enhancements: Train an anomaly detection model on BigQuery metrics to flag unusual patterns post-chaos.
  • Enterprise Features: Introduce user authentication, experiment scheduling, and multi-cluster support for production use.
  • Open Source: Release Voltaros on GitHub to empower the community to contribute chaos engineering tools.
  • Commercialization: Explore integrating Voltaros with Google Cloud Marketplace as a DevOps solution.

Built With

Share this project:

Updates