Vulkan

🧠 Inspiration

Modern DevOps teams drown in alerts and dashboards, reacting to issues long after they start. We wanted to flip that model—creating agents that monitor, think, and act autonomously. Inspired by self-healing systems and AI copilots, we envisioned infrastructure that fixes itself before engineers even wake up.

⚙️ What it does

Vulkan is a network of AI-driven DevOps agents that continuously monitor CPU, memory, and network metrics. When anomalies occur, agents detect the issue, trigger automated fixes (like restarts or scaling), and open detailed Jira tickets with post-incident summaries. It learns over time, reducing noise and improving incident response.

🧩 How we built it

We used Stack AI to orchestrate agents, Snowflake for real-time metric storage, and Anthropic Claude models for reasoning and decision-making. Alerts trigger automated workflows through GitHub, Jira, and Gmail, while dashboards visualize live CPU data. The system includes a feedback loop that helps agents adjust thresholds and improve autonomously.

🧱 Challenges we ran into

Integrating multiple APIs (Snowflake, Stack AI, Anthropic) into a cohesive real-time loop was tough. Managing JSON schemas, quoted identifiers in Snowflake, and ensuring stable data flow across tools required careful debugging. Building realistic log simulations to trigger alerts was another challenge.

🏆 Accomplishments that we're proud of

We built a fully functional self-healing prototype that detects CPU spikes, triggers actions automatically, and learns from results. Watching our agent autonomously shut down a high-load service—and document the incident end-to-end—was a huge milestone.

📚 What we learned

We learned how to combine LLM reasoning with structured data pipelines. AI can do more than chat—it can act. We also deepened our understanding of real-time observability, automation pipelines, and Snowflake’s JSON/VARIANT handling for storing intelligent agent decisions.

🚀 What’s next for Vulkan

We plan to expand Vulkan’s scope to full-stack observability—integrating with Datadog, Grafana, and Kubernetes APIs. Future iterations will use reinforcement learning to optimize responses and orchestrate multiple agents collaborating in real time. The long-term vision: fully autonomous, self-optimizing infrastructure.