Inspiration
Voltaros was inspired by the critical need for reliable cloud systems in today’s digital world, where even a minute of downtime can cost businesses millions. As a beginner in chaos engineering, I was fascinated by the idea of proactively breaking systems to make them stronger, like training a Pokémon to withstand battles. Reading about high-profile outages and Google Cloud’s Agent Development Kit (ADK) sparked the vision for Voltaros—a tool that automates resilience testing with AI-driven agents, making chaos engineering accessible to all. The #adkhackathon’s focus on Google Cloud and multi-agent systems was the perfect opportunity to bring this idea to life!
What it does
Voltaros is an AI-powered chaos engineering orchestrator that stress-tests cloud applications to ensure unbreakable reliability. Users interact with a sleek Next.js dashboard to trigger pod crashes and network latency on a GKE-hosted microservice, while ADK agents autonomously execute experiments, collect real-time metrics via BigQuery, and visualize results with Vertex AI-generated plots stored in Cloud Storage. Built for DevOps engineers, Voltaros automates resilience testing, turning complex chaos experiments into a few clicks.
How I built it
I built Voltaros using a modern, cloud-native stack, leveraging Google Cloud and the ADK Starter Pack:
- Frontend: A Next.js app with Tailwind CSS, hosted on Vercel, provides a user-friendly dashboard. Users click buttons to trigger chaos, collect metrics, or view visualizations, with API routes (
pages/api/chaos.ts) proxying requests to the backend. - Backend: A Python FastAPI app, containerized and deployed on Cloud Run, uses the ADK to orchestrate three agents:
- Chaos Injector Agent (
agents/chaos_injector.py): Fetches experiment files (experiment.json,latency_experiment.json) from Cloud Storage and applies pod crashes or 200ms latency to a GKE app using Chaos Toolkit (chaosgcp). - Monitor Agent (
agents/monitor.py): Queries Cloud Monitoring for CPU and latency metrics, storing them in BigQuery (voltaros_dataset.metrics). - Reporter Agent (
agents/reporter.py): Queries BigQuery, generates line plots using Vertex AI Workbench (Matplotlib), and saves images to Cloud Storage (voltaros-reports).
- Chaos Injector Agent (
- Target App: A sample microservice (
voltaros-app) runs on GKE, with pods labeledapp=voltaros-app, serving as the chaos testing target. - Google Cloud Services: GKE hosts the target app, BigQuery stores metrics and logs, Cloud Storage holds experiment files and plots, Cloud Monitoring provides real-time data, and Vertex AI visualizes results.
- Development: I used Visual Studio for local testing, GitHub for version control, and Docker for containerization. The ADK’s
AgentOrchestratorenabled seamless agent communication, while Chaos Toolkit integrated viatools/chaos_toolkit.yaml.
Challenges I ran into
As a chaos engineering novice, I faced several hurdles:
- Learning Chaos Engineering: Grasping concepts like pod crashes and latency injection was daunting. Studying Chaos Toolkit and
chaosgcpdocumentation helped, but configuring experiments for GKE took trial and error. - ADK Integration: Setting up the ADK Starter Pack and orchestrating multiple agents was complex. I struggled with async messaging in
main.pybut resolved it by debugging with Firebase Studio. - Vertex AI Visualization: Replacing Looker Studio (unfamiliar to me) with Vertex AI Workbench required learning to generate plots programmatically. Managing Matplotlib in a serverless context was tricky, but saving images to Cloud Storage simplified delivery.
- GKE Permissions: Ensuring the backend’s service account had
roles/container.adminfor GKE androles/aiplatform.userfor Vertex AI involved multiple IAM tweaks. - Time Constraints: Balancing frontend polish, backend logic, and demo prep in a hackathon timeframe was intense. I prioritized the pod crash trigger for the MVP, adding latency as a stretch goal.
Accomplishments that I am proud of
- Functional MVP: I built a working chaos engineering tool that triggers pod crashes and (optionally) latency, collects metrics, and visualizes results—all in a few days!
- Google Cloud Integration: Seamlessly combining GKE, BigQuery, Cloud Storage, Cloud Monitoring, and Vertex AI showcased our ability to leverage Google Cloud’s ecosystem.
- ADK Mastery: Orchestrating three ADK agents (Chaos Injector, Monitor, Reporter) demonstrated multi-agent automation, a core hackathon goal.
- User-Friendly UI: The Vercel-hosted Next.js dashboard is intuitive, making chaos engineering accessible to non-experts.
What I learned
- Chaos Engineering: I learned how to simulate failures (pod crashes, latency) to improve system resilience, and why it’s critical for cloud apps.
- Google Cloud: Deepened my skills in GKE (from prior use) and mastered BigQuery, Cloud Storage, and Vertex AI for data and visualization.
- ADK: Gained hands-on experience with multi-agent systems, async Python, and the ADK Starter Pack.
- Frontend Deployment: Discovered Vercel’s ease for Next.js, streamlining our frontend hosting.
- Teamwork: Collaborating under hackathon pressure taught me to prioritize tasks and communicate effectively.
What's next for Voltaros
- More Chaos Experiments: Add disk failure or resource exhaustion triggers to test broader resilience scenarios.
- Vertex AI Enhancements: Train an anomaly detection model on BigQuery metrics to flag unusual patterns post-chaos.
- Enterprise Features: Introduce user authentication, experiment scheduling, and multi-cluster support for production use.
- Open Source: Release Voltaros on GitHub to empower the community to contribute chaos engineering tools.
- Commercialization: Explore integrating Voltaros with Google Cloud Marketplace as a DevOps solution.
Built With
- gke
- google-bigquery
- google-cloud
- next.js
- node.js
- python
- tailwind
- vercel
- vertex-ai
Log in or sign up for Devpost to join the conversation.