Voltaros

Dashboard
Workflow
Landing Page
Chaos Experiments
Automated Workflows
Reports
Integrations
AI Suggestions

Inspiration

Voltaros was inspired by the critical need for reliable cloud systems in today’s digital world, where even a minute of downtime can cost businesses millions. As a beginner in chaos engineering, I was fascinated by the idea of proactively breaking systems to make them stronger, like training a Pokémon to withstand battles. Reading about high-profile outages and Google Cloud’s Agent Development Kit (ADK) sparked the vision for Voltaros—a tool that automates resilience testing with AI-driven agents, making chaos engineering accessible to all. The #adkhackathon’s focus on Google Cloud and multi-agent systems was the perfect opportunity to bring this idea to life!

What it does

Voltaros is an AI-powered chaos engineering orchestrator that stress-tests cloud applications to ensure unbreakable reliability. Users interact with a sleek Next.js dashboard to trigger pod crashes and network latency on a GKE-hosted microservice, while ADK agents autonomously execute experiments, collect real-time metrics via BigQuery, and visualize results with Vertex AI-generated plots stored in Cloud Storage. Built for DevOps engineers, Voltaros automates resilience testing, turning complex chaos experiments into a few clicks.

How I built it

I built Voltaros using a modern, cloud-native stack, leveraging Google Cloud and the ADK Starter Pack:

Frontend: A Next.js app with Tailwind CSS, hosted on Vercel, provides a user-friendly dashboard. Users click buttons to trigger chaos, collect metrics, or view visualizations, with API routes (pages/api/chaos.ts) proxying requests to the backend.
Backend: A Python FastAPI app, containerized and deployed on Cloud Run, uses the ADK to orchestrate three agents:
- Chaos Injector Agent (agents/chaos_injector.py): Fetches experiment files (experiment.json, latency_experiment.json) from Cloud Storage and applies pod crashes or 200ms latency to a GKE app using Chaos Toolkit (chaosgcp).
- Monitor Agent (agents/monitor.py): Queries Cloud Monitoring for CPU and latency metrics, storing them in BigQuery (voltaros_dataset.metrics).
- Reporter Agent (agents/reporter.py): Queries BigQuery, generates line plots using Vertex AI Workbench (Matplotlib), and saves images to Cloud Storage (voltaros-reports).
Target App: A sample microservice (voltaros-app) runs on GKE, with pods labeled app=voltaros-app, serving as the chaos testing target.
Google Cloud Services: GKE hosts the target app, BigQuery stores metrics and logs, Cloud Storage holds experiment files and plots, Cloud Monitoring provides real-time data, and Vertex AI visualizes results.
Development: I used Visual Studio for local testing, GitHub for version control, and Docker for containerization. The ADK’s AgentOrchestrator enabled seamless agent communication, while Chaos Toolkit integrated via tools/chaos_toolkit.yaml.

Challenges I ran into

As a chaos engineering novice, I faced several hurdles:

Learning Chaos Engineering: Grasping concepts like pod crashes and latency injection was daunting. Studying Chaos Toolkit and chaosgcp documentation helped, but configuring experiments for GKE took trial and error.
ADK Integration: Setting up the ADK Starter Pack and orchestrating multiple agents was complex. I struggled with async messaging in main.py but resolved it by debugging with Firebase Studio.
Vertex AI Visualization: Replacing Looker Studio (unfamiliar to me) with Vertex AI Workbench required learning to generate plots programmatically. Managing Matplotlib in a serverless context was tricky, but saving images to Cloud Storage simplified delivery.
GKE Permissions: Ensuring the backend’s service account had roles/container.admin for GKE and roles/aiplatform.user for Vertex AI involved multiple IAM tweaks.
Time Constraints: Balancing frontend polish, backend logic, and demo prep in a hackathon timeframe was intense. I prioritized the pod crash trigger for the MVP, adding latency as a stretch goal.

Accomplishments that I am proud of

Functional MVP: I built a working chaos engineering tool that triggers pod crashes and (optionally) latency, collects metrics, and visualizes results—all in a few days!
Google Cloud Integration: Seamlessly combining GKE, BigQuery, Cloud Storage, Cloud Monitoring, and Vertex AI showcased our ability to leverage Google Cloud’s ecosystem.
ADK Mastery: Orchestrating three ADK agents (Chaos Injector, Monitor, Reporter) demonstrated multi-agent automation, a core hackathon goal.
User-Friendly UI: The Vercel-hosted Next.js dashboard is intuitive, making chaos engineering accessible to non-experts.

What I learned

Chaos Engineering: I learned how to simulate failures (pod crashes, latency) to improve system resilience, and why it’s critical for cloud apps.
Google Cloud: Deepened my skills in GKE (from prior use) and mastered BigQuery, Cloud Storage, and Vertex AI for data and visualization.
ADK: Gained hands-on experience with multi-agent systems, async Python, and the ADK Starter Pack.
Frontend Deployment: Discovered Vercel’s ease for Next.js, streamlining our frontend hosting.
Teamwork: Collaborating under hackathon pressure taught me to prioritize tasks and communicate effectively.

What's next for Voltaros

More Chaos Experiments: Add disk failure or resource exhaustion triggers to test broader resilience scenarios.
Vertex AI Enhancements: Train an anomaly detection model on BigQuery metrics to flag unusual patterns post-chaos.
Enterprise Features: Introduce user authentication, experiment scheduling, and multi-cluster support for production use.
Open Source: Release Voltaros on GitHub to empower the community to contribute chaos engineering tools.
Commercialization: Explore integrating Voltaros with Google Cloud Marketplace as a DevOps solution.