AegisDevOps: Your System's Self-Healing Guardian
What is this project, in plain English?
Imagine you run a business that depends on software — an online store, a payment platform, a hospital records system. Software fails. Servers crash. Databases time out. When they do, someone gets paged at 3 AM, they diagnose the problem, apply a fix, and write it down somewhere.
AegisDevOps automates that entire process.
It watches your system around the clock, recognises when something goes wrong, decides what kind of problem it is, and sends the right specialist to fix it — automatically, in seconds, while your team sleeps.
The problem it solves
| Before AegisDevOps | After AegisDevOps |
|---|---|
| Engineer gets paged at 3 AM | System heals itself while the engineer sleeps |
| 20–40 minutes to diagnose and fix | Under 30 seconds |
| Fix knowledge lives in someone's head or a forgotten wiki | Fix knowledge is stored in a searchable, mathematical memory |
| Same error causes the same incident twice | System learns from every fix and gets faster each time |
| Engineers spend weekends on repeat incidents | Engineers spend time building new things |
How does it work? (The short version)
Think of AegisDevOps like a hospital emergency room with three roles:
The Triage Nurse (Orchestrator) When an alarm comes in, the triage nurse reads it and decides: "Is this a hospital infrastructure problem — do we need more beds, more staff, more equipment? Or is this a medical procedure problem — does the treatment plan need to be rewritten?"
The Infrastructure Team (AWS Agent) If the problem is infrastructure — the hospital is overloaded, a wing has gone offline, or a machine has broken down — the infrastructure team handles it. They scale up capacity, restart equipment, and restore normal operations. No one needs to rewrite any medical procedures.
The Medical Team (GitHub Agent) If the problem is the treatment plan itself — a medication dosage is wrong, a procedure has a bug in it — the medical team steps in. They read the plan, correct it, and re-issue it. The hospital automatically picks up the updated version and starts using it.
The Medical Record (Moorcheh Memory) Every incident and its resolution gets recorded in a permanent, searchable memory. The next time a similar alarm comes in, the system already knows what worked — and responds faster.
The full flow, step by step
- An alarm fires — your production system detects a problem and sends an alert.
- The Orchestrator reads it — it checks the memory for similar past incidents, then decides: is this an infrastructure problem or a code problem?
- The right agent gets called:
- Infrastructure problem → AWS Agent connects to your cloud (AWS) and fixes it directly: scales up servers, restarts containers, adjusts capacity.
- Code problem → GitHub Agent reads your codebase, writes the fix, and commits it. This automatically triggers a rebuild and redeployment — the broken version is replaced with a working one.
- The resolution is saved — the fix gets added to memory so the system handles it even faster next time.
- You see it all live — a real-time dashboard shows every step as it happens, with confidence scores explaining why each decision was made.
The "confidence check" explained simply
Before acting, the system always asks: "How sure am I that this fix is right?"
It uses a scoring system called Information-Theoretic Scoring (ITS) — think of it as a percentage of certainty based on meaning, not just keywords.
- 60% or higher → "I've seen this before and I'm confident. Acting now."
- 30–60% → "This looks familiar but I'm not certain. Flagging for human review."
- Below 30% → "I don't recognise this at all. I will not guess. Escalating to the on-call team."
This means the system never takes a wild guess on something unfamiliar. It would rather admit it doesn't know than make things worse — a critical property for any system running in a real business.
Real-world use cases
1. E-commerce — traffic spike on sale day
"Every Black Friday, our checkout page goes down for 10 minutes because the servers get overwhelmed."
AegisDevOps detects the spike, classifies it as an infrastructure problem, and tells AWS to spin up more servers automatically. The checkout page stays up. Black Friday runs without a human touching anything.
2. Hospital — disk space running out
"Our patient records server occasionally crashes because log files accumulate and fill the disk. Nurses can't access records when it happens."
AegisDevOps detects the disk error, classifies it as infrastructure, and automatically clears old log files to free up space — all before any nurse notices an issue.
3. Startup — broken code deployment
"A developer pushed a change that broke a dependency. Our app is crashing with 'module not found' errors."
AegisDevOps detects the error, classifies it as a code problem, and sends the GitHub Agent to read the file, fix the dependency, and commit the change. GitHub automatically rebuilds and redeploys the app. The whole thing happens without waking anyone up.
4. Payments platform — API rate limits
"Our payments service occasionally hits rate limits from a third-party API, causing failed transactions."
AegisDevOps detects the 429 error, classifies it as infrastructure (the service needs to slow down its requests), and tells AWS to adjust the service configuration. The fix is also saved to memory, so next time it responds in milliseconds.
5. A brand new error — the safety case
"Something we've never seen before breaks."
The confidence check kicks in. The system says: "I don't recognise this. My confidence is 18%. I will not act. Escalating to on-call." A human investigates, applies the fix, and the agent saves that new fix to memory. The next time this exact error appears, it's handled automatically.
Why this beats existing solutions
Most automated monitoring tools work on rules someone writes in advance: "If log contains X, run script Y." These are brittle — they only catch problems you already predicted.
AegisDevOps is different in three ways:
- It understands meaning, not just words. Two error messages that say different things but mean the same problem will both be caught.
- It knows who to call. Rather than one generic response, it routes to a specialist — the AWS team for infrastructure, the GitHub team for code.
- It gets smarter over time. Every incident it handles makes it more capable for the next one. No one needs to update the rules.
What we built, technically
| Component | Plain English |
|---|---|
| Moorcheh | The long-term memory — stores every past incident and its resolution |
| Railtracks | The coordination layer — manages the agents and the flow of information between them |
| Orchestrator Agent | The triage nurse — classifies the problem and decides who handles it |
| AWS Agent | The infrastructure team — connected directly to your cloud, fixes server-level problems |
| GitHub Agent | The code team — reads your repo, writes fixes, and triggers redeployment |
| Claude Sonnet 4.6 | The intelligence behind each agent — understands natural language and reasons about complex situations |
| Confidence gate | The safety check — the system refuses to act unless it's certain enough |
| Write-back | The learning mechanism — every fix gets saved so the system improves continuously |
| AWS SNS + SQS | The alarm system — how production errors reach AegisDevOps in real time |
| GitHub Actions + ECR + ECS | The deployment pipeline — automatically rebuilds and redeploys fixed code to the cloud |
What makes this "production-ready"?
Production-ready means it's not just a demo — it's safe to run on real systems that real people depend on. Here's what we built in:
- It won't guess. The confidence gate means the agent explicitly refuses to act when uncertain — and tells you why.
- It knows what it doesn't know. New error types it's never seen get escalated to a human rather than attempted blindly.
- It routes intelligently. Infrastructure problems and code problems get sent to different specialists, not handled with a one-size-fits-all script.
- It's fully transparent. Every decision is logged with its confidence score. You can always audit what happened and why.
- It scales. The memory system (Moorcheh) is 96% more efficient than traditional databases — it stays fast even as the knowledge base grows to thousands of incidents.
- It's visible. A live dashboard shows every step in real time — what error came in, what the agent decided, what action was taken, and what was saved to memory.
- It learns continuously. Every fix adds to the knowledge base. The longer it runs, the fewer incidents require human attention.
Who is this for?
- Engineering teams tired of repeat 3 AM incidents
- SRE / DevOps teams who want intelligent automation without writing fragile rule scripts
- CTOs and technical leaders looking to reduce incident response time and on-call burden
- Any organisation where software downtime directly costs money or harms people
Built With
- agent
- database
- mcp
- moorcheh
- python
- railtracks
Log in or sign up for Devpost to join the conversation.