SherLock

Inspiration Modern distributed systems are too complex for human intuition alone. When a production outage hits, SREs often spend 80% of their time just finding the root cause—jumping between logs, traces, and metrics in silos. We built SherLock to turn that "investigation rabbit hole" into a 60-second autonomous process.

What it does SherLock is an autonomous SRE platform that automates incident investigation and remediation.

It ingests telemetry (AWS) and maps it into a knowledge graph. An AI Agent (OpenAI + Tavily) analyzes the graph to identify the root cause and blast radius. It presents a human-readable dashboard (Next.js) with 1-click, security-gated (MFA) remediation steps.

How we built it Intelligence: OpenAI GPT-4o with LangGraph for multi-step reasoning. Topology: Neo4j Aura to store service dependencies and propagate failure signals. External Intel: Tavily.ai for automated vendor status and CLI-based investigation. Backend: FastAPI with Pydantic v2 and Prometheus metrics.

Frontend: Next.js 14 with Tailwind CSS and customized dark-theme aesthetics. Infrastructure: AWS MSK (Kafka) for ingestion, S3 for audit logs, and Terraform for IaaC. Challenges we ran into Graph Mapping: Translating flat telemetry signals into meaningful Neo4j relationships was complex. AI Reliability: Ensuring the agent produced deterministic, falsifiable hypotheses instead of just creative prose.

Security Gates: Designing an autonomous system that feels safe to use in production led us to implement MFA gates and confidence-based execution thresholds. Accomplishments that we're proud of End-to-End Flow: We successfully built a pipeline that goes from a raw alert to a full RCA and remediation plan in under a minute.

Design: Creating a "premium" dashboard that makes complex SRE data feel intuitive and actionable. 100% Test Pass: Achieving a 100% pass rate across 29 comprehensive pytests covering the entire API layer.

What we learned The power of Knowledge Graphs (Neo4j) in providing the "system context" that LLMs need to stop hallucinating. How to orchestrate complex AI workflows using stateful graphs (LangGraph) rather than simple linear chains.

What's next for SherLock Part 1 Integration: Moving beyond mocks to connect full-scale production telemetry. Proactive Investigation: Moving from "post-incident" to "pre-incident" by analyzing pre-alert signals. Auto-Tuning: Implementing a feedback loop where engineer ratings automatically retrain the confidence model.