A professional-grade log diagnosis agent that leverages Google's Generative AI to autonomously investigate production incidents, analyze logs, search codebases, and formulate root causes.
Our vision is to build an Agentic AI system tailored specifically for the Cloud Ops domain. While deploying a fully integrated system on Google Cloud or Azure is the ultimate goal—and where the real business value lies—we are simulating the core mechanics locally using VMs and Docker for this hackathon.
Whenever an issue occurs, IT and Ops teams currently have to manually dig through terminals and scroll through endless logs to troubleshoot. It’s tedious, time-consuming, and repetitive.
We are building a system that leverages AI agents and Retrieval-Augmented Generation (RAG) to automate troubleshooting and help Ops teams resolve common issues instantly.
- Smart Log Extraction: Instead of dumping everything into an LLM, the Agent uses timestamps and specific error tags to run a log-locator function (an approximate algorithm). This pulls only the most relevant log snippets to use as context.
- AI Diagnostics: This precise context is sent to an LLM (like Google Gemini). The AI analyzes the error characteristics and generates a suggested solution along with a diagnostic report.
- Human-in-the-Loop UI: We provide a clean UI for IT administrators to review the AI's diagnostic report—similar to viewing and editing a README.md on GitHub. Crucially, any high-risk scripts or commands suggested by the AI are explicitly flagged for safety.
If the Ops team reviews the AI's solution, approves it, and it successfully fixes the issue, that entire ticket is saved into a Vector Database. This builds a highly accurate, company-specific corpus.
The next time a similar issue triggers an alert:
- Saving Compute: We don't need to waste AI inference tokens re-analyzing the problem from scratch.
- Semantic Search: The system uses vector/semantic search to find the exact reference case in the database.
- Automated or Guided Resolution: The AI Agent can either execute the validated step-by-step instructions strictly as written, or serve the solution up as a quick reference guide for the Ops staff to implement manually.
- Project Vision: Agentic AI for Cloud Ops
- Features
- Architecture
- Setup and Installation
- Configuration
- API Endpoints
- Usage
- Autonomous Incident Triage: Employs a Senior Production Incident Engineer persona (ReAct pattern) to investigate and mitigate production failures.
- Dynamic Log Analysis: Uses
fetch_dynamic_log_snippetto perform targeted temporal searches around incident timestamps, reducing token overhead while increasing context accuracy. - Codebase Search: Employs a
SelectiveCodeRetrieverto correlate log patterns with specific codebase logic. - Root Cause Formulation: Delivers evidence-backed hypotheses with quantified confidence levels and actionable remediation steps.
- API Interface: FastAPI-powered endpoints for seamless integration with monitoring webhooks (e.g., Uptime Kuma).
- In-Memory Store: Thread-safe in-memory database for efficient job management and reporting.
The project is structured around several key components:
main.py: The FastAPI application entry point, utilizing modernlifespanhandlers for managed startup and shutdown of the background worker.core/worker.py: TheAgentWorkerthat orchestrates theReasoningAgentinvestigation loop.agent/core.py: Implements theReasoningAgent, a professional-grade AI agent using Google's Generative AI and native function calling for deep investigation.memory/store.py: AnInMemoryStoresingleton for tracking jobs, updates, and final reports.tools/agent_tools.py: A suite of professional engineering tools:fetch_dynamic_log_snippet: Retrieves logs within specific time windows.search_code: Performs semantic and keyword searches across the codebase.read_incident_context: Extracts metadata and initial state from the incident job.update_investigation_report: Finalizes the root cause analysis with structured findings.
tools/retriever_logic.py: The logic for surgical code extraction.config.py: Pydantic-based configuration management with support for.envoverrides.schemas.py: Strict validation models for incident payloads and responses.
- Python 3.11 or higher.
This project uses uv for dependency management.
-
Install
uv: If you don't haveuvinstalled, you can install it using pip:pip install uv
Or refer to the uv documentation for other installation methods.
-
Install project dependencies:
uv sync
Create a .env file in the project root directory and populate it with the necessary environment variables:
GEMINI_API_KEY="YOUR_GEMINI_API_KEY"
Replace "YOUR_GEMINI_API_KEY" with your actual Google Gemini API key. You can obtain one from the Google AI Studio.
You can also create .env.local to override values from .env on your machine (for example when rotating API keys).
Order of precedence in this app is:
- real environment variables
.env.local.env
After changing either file, restart the API process so the worker picks up the new credentials.
The config.py file defines the application's settings. These settings can be configured via environment variables or a .env file. Key configurable settings include:
GEMINI_API_KEY: Your API key for Google Gemini.GEMINI_MODEL: The Gemini model to use (default:gemini-2.0-flash).MAX_CONTEXT_FILES: Maximum number of context files to retrieve.MAX_CONTEXT_EXCERPT_CHARS: Maximum characters for code excerpts.ALLOWED_READ_ROOTS: Comma-separated list of directories the agent is allowed to read (e.g.,src,app,config,etc,services,scripts,infra,deploy,opt).LOG_DIRECTORY: The directory where log files are stored (default:logs).
The API is built using FastAPI and provides the following endpoints:
-
POST /api/v1/jobs- Description: Submits a new incident investigation job.
- Request Body:
UptimeKumaJobCreateschema (e.g., from an Uptime Kuma webhook). - Response:
JobCreatedResponsecontaining thejob_idandstatus("queued").
-
GET /api/v1/jobs/{job_id}- Description: Retrieves the current status and details of a specific investigation job.
- Response: Job details from the in-memory store.
-
GET /api/v1/jobs/{job_id}/result- Description: Retrieves the final investigation report for a completed job.
- Response: Investigation report details.
-
GET /health- Description: Health check endpoint.
- Response:
{"status": "alive", "storage": "in-memory"}
These aliases exist for compatibility with the HackCanada frontend while keeping the current in-memory architecture and worker flow.
-
POST /api/v1/analysis/jobs- Behavior: Alias of
POST /api/v1/jobs.
- Behavior: Alias of
-
GET /api/v1/analysis/jobs/{job_id}- Behavior: Alias of
GET /api/v1/jobs/{job_id}.
- Behavior: Alias of
-
GET /api/v1/analysis/jobs/{job_id}/result- Behavior: Alias of
GET /api/v1/jobs/{job_id}/result.
- Behavior: Alias of
-
GET /api/v1/analysis/jobs/{job_id}/summary- Response:
{ "incident_id": str, "summary_text": str, "summary_markdown": str, "confidence": float }for completed reports.
- Response:
-
GET /api/v1/analysis/jobs/{job_id}/download- Response: Downloadable JSON report attachment (
analysis-report-{job_id}.json).
- Response: Downloadable JSON report attachment (
-
GET /api/v1/analysis/incidents- Response shape (frontend list):
id,service,serviceType,status,logs,confidence,proposedFix proposedFixis{ description, steps, markdown, destructiveActions, targetNode }when a report exists, otherwisenull.
- Response shape (frontend list):
To run the FastAPI application, ensure you have set up your environment variables and installed dependencies.
uvicorn src.diagnosis_agent.main:app --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000.
1. Cloud Services Monitoring Dashboard
Real-time visibility into the health and connectivity of nodes within the infrastructure, providing the primary signal for incident detection.

2. Automated Incident Detection
The system identifies a service failure (e.g., Plex Media Server "Offline") and captures the diagnostic stream for investigation.

3. AI-Powered Diagnostic Report
A comprehensive analysis generated by the Gemini-powered agent, featuring evidence-backed root causes, confidence scores, and surgical remediation steps.

You can use curl or any API client to submit an incident job:
curl -X POST "http://localhost:8000/api/v1/jobs" \
-H "Content-Type: application/json" \
-d '{
"monitor": "my-service-monitor",
"status": "down",
"msg": "Service is down, critical errors in logs.",
"url": "http://my-service.com",
"time": "2023-10-27T10:00:00Z",
"log_snippets": [
{
"timestamp": "2023-10-27T09:59:00Z",
"source": "backend-service",
"line": "ERROR: Database connection failed: Connection refused"
},
{
"timestamp": "2023-10-27T09:59:05Z",
"source": "backend-service",
"line": "CRITICAL: Unable to process request, shutting down."
}
],
"metadata": {"team": "devops", "severity": "P1"}
}'The project uses pytest for testing. The test suite includes unit and integration tests for the FastAPI endpoints, using mocks for external dependencies like the Gemini AI worker.
-
Install pytest (if not already installed):
pip install pytest
-
Create local test env file (optional, never commit):
cp .env.test.example .env.test
-
Run the API tests:
PYTHONPATH=src pytest tests/test_api.py
The tests verify:
- API Health status.
- Job creation and queuing.
- Job retrieval and status tracking.
- Integration with sample input files (e.g.,
src/sources/sample_input.json). - Error handling for missing resources.


