Submitter Type: Individual Category: Agentic AI

Inspiration

Every network engineer has lived through the 2am incident an alert fires, Slack blows up, someone scrambles to SSH into a server, runs traceroutes, checks dashboards, and tries to figure out what broke and why. By the time a fix is deployed, it's been 45 minutes and thousands of users have already churned. We built NetHealer because that entire process detect, diagnose, fix is something AI should be doing autonomously, in seconds, not humans doing it manually in the middle of the night.

What it does

NetHealer is an autonomous network self-healing system. It continuously monitors real infrastructure telemetry from multiple sources local machine metrics, ThousandEyes network tests, Prometheus, and SNMP and runs a four-stage AI pipeline powered by Amazon Nova via AWS Bedrock:

Detects anomalies by correlating latency, packet loss, CPU pressure, and node health in real time Diagnoses the root cause using Amazon Nova distinguishing between a DDoS attack, a BGP route flap, a node failure, or a security breach based on the pattern of signals Plans a specific remediation action with confidence scoring and severity classification Executes the fix automatically against real AWS infrastructure Route53 DNS failover, SSM service restarts, Security Group IP blocking, SNS escalation alerts The entire cycle runs in under 10 seconds. A real-time NOC dashboard shows the live network topology with animated data flow, telemetry streams, AI analysis results, and a natural language AI Operator chat for querying infrastructure state.

How We Built It

We designed NetHealer to feel less like a monitoring tool and more like an autonomous operations system. From the beginning, the goal was to build something that could observe infrastructure, reason about failures, and respond in real time — the same way a human NOC team would, but faster and continuously. To make that possible, we split the architecture into two clearly defined layers: a Python backend that handles AI reasoning and infrastructure automation, and a Next.js frontend that functions as a live Network Operations Center dashboard. The two communicate through persistent WebSocket streams so that every telemetry signal, AI decision, and remediation step appears instantly on the screen. The result is a system where you can literally watch the network think and heal itself in real time.

The AI Pipeline

At the heart of NetHealer is a four-stage multi-agent reasoning pipeline, powered by Amazon Nova Lite through AWS Bedrock. Whenever new telemetry arrives, it enters the pipeline and flows through four reasoning stages:

Telemetry Analysis → Root Cause Diagnosis → Remediation Planning → Automated Execution

Each stage acts like a specialized AI operator. Instead of using a single prompt, we structured the pipeline so that each agent receives a clear role and a full snapshot of the network state. The agent then sends a structured prompt to Nova, receives a structured response, and passes that output to the next stage. This creates a deterministic reasoning chain, where every AI decision directly influences the next step in the process. To make the system reliable enough to control infrastructure, Nova runs with a low temperature setting, ensuring consistent and structured outputs. When the result of an AI response could trigger a real infrastructure action — like rerouting DNS traffic or restarting services — predictability matters more than creativity. Under the hood, the orchestrator runs each Nova inference in a background thread. This allows the FastAPI event loop to stay free and continue broadcasting updates to the dashboard between pipeline stages. The effect is surprisingly powerful: the moment telemetry arrives, the dashboard begins updating as the AI analyzes the problem, identifies a root cause, and generates a remediation plan — all within seconds. You’re not just seeing the result. You’re watching the reasoning process happen live.

Telemetry

A system that heals infrastructure has to understand it first. NetHealer pulls telemetry from two real data sources to give the AI a full picture of what’s happening. The first source is the local machine collector, built with the psutil library. Every three seconds it gathers host-level metrics including CPU utilization, memory usage, disk usage, network interface throughput, active TCP connections, and battery state. These signals provide insight into the health of the machine running the system. At the same time, NetHealer pulls real network performance data from ThousandEyes using their v7 API. These tests measure latency, packet loss, and jitter across real internet paths — including probes targeting Google DNS, AWS US-East, and the Bedrock API endpoint itself. Combining these two telemetry streams gives Nova a powerful advantage. Instead of seeing only infrastructure metrics or only network metrics, the AI sees both layers at once. This allows it to reason about complex failures distinguishing, for example, between a server overload and a network path degradation. The telemetry streams are merged into a single unified snapshot before entering the AI pipeline, ensuring every decision is based on a consistent view of the system.

Execution

Once Nova generates a remediation plan, NetHealer turns those decisions into real infrastructure actions. Every action produced by the AI is automatically categorized into three groups: ThousandEyes actions, safe AWS actions, and destructive AWS actions. ThousandEyes actions such as pausing monitoring tests, swapping agent pools, or adjusting alert thresholds execute immediately. These actions improve monitoring visibility without affecting production traffic. Safe AWS actions also run automatically. These include operations like rerouting traffic using Route53 DNS weighting or restarting services on EC2 instances through AWS Systems Manager. For more aggressive responses, such as blocking IP addresses or isolating infrastructure nodes, NetHealer adds a human checkpoint. These actions appear on the dashboard and can be approved with a single click. This approach keeps the system fully autonomous for safe recovery tasks, while still maintaining responsible oversight for actions that could disrupt live systems.

Frontend

The NetHealer dashboard is designed to feel like a modern Network Operations Center. It’s built as a Next.js application, with every component implemented from scratch. We intentionally avoided external UI libraries so we could tailor the interface exactly to the system’s behavior. The centerpiece of the interface is the live network topology map, rendered entirely with HTML Canvas. The map runs at 60 frames per second using requestAnimationFrame, enabling fluid animation and real-time updates. Nodes represent infrastructure components, and connections between them show the active data paths across the network. Each connection contains animated light streaks that travel continuously between nodes, simulating real network traffic. These visual signals aren’t static they respond directly to telemetry from the backend. When latency rises, pulses slow down. When packet loss occurs, colors shift from blue to yellow to red. As the AI remediates issues, the network visibly returns to a healthy state. The entire interface updates instantly through a WebSocket connection to the backend, meaning every anomaly detection, root cause analysis, and remediation action appears on screen as it happens. Instead of reading logs after the fact, operators can watch the system diagnose and repair the network live.

In the end, NetHealer behaves less like a monitoring dashboard and more like an autonomous infrastructure control system one that continuously observes, reasons, and repairs the network before small issues become real outages.

Challenges We Ran Into

Building NetHealer wasn’t just about wiring APIs together — the hardest problems were making the system reliable, fast, and trustworthy enough to automate infrastructure decisions.

Multi-Signal Reasoning with AI

One of the hardest challenges was teaching the AI to reason across multiple simultaneous telemetry signals instead of reacting to single metrics. In real infrastructure, a single spike means almost nothing. CPU at 80% might be normal load, a batch job, or the start of a DDoS. Packet loss might be a transient routing change rather than a failure. Early versions of the pipeline frequently misclassified normal fluctuations as incidents.

We solved this by redesigning how telemetry is presented to Nova. Instead of sending raw metrics, we pass a structured network snapshot that includes correlated signals — latency trends, packet loss ratios, CPU deltas, connection counts, and node health states — all within the same context window. Prompt instructions explicitly tell Nova to evaluate patterns across signals rather than individual thresholds, which dramatically improved accuracy.

Making AI Decisions Safe Enough to Touch Infrastructure

Allowing AI to generate infrastructure actions introduces a serious problem: trust. Nova can propose remediation steps, but blindly executing them could be dangerous if the model produces an incorrect action or misinterprets a situation. We had to design safeguards to prevent the AI from accidentally isolating nodes or blocking legitimate traffic. The solution was to introduce a structured action schema and a safety classification system. Every action returned by Nova must match a predefined action type, which maps to a verified handler in the automation layer. Any action capable of taking infrastructure offline is automatically flagged as destructive and routed through a human approval gate. This ensures the system remains autonomous for safe recovery tasks while preventing catastrophic decisions.

Latency in a Multi-Agent AI Pipeline

Another major challenge was pipeline latency. Each stage of the AI pipeline calls the Bedrock API. Running four Nova inference calls sequentially quickly adds up, and the early version of the system felt slow when telemetry spikes triggered full analysis cycles. We redesigned the orchestration model so telemetry ingestion, anomaly scoring, and pipeline scheduling run concurrently. The system now performs lightweight anomaly scoring first and only triggers the full multi-agent reasoning chain when an actual incident threshold is crossed. Agent calls are executed in background threads to prevent blocking the event loop, allowing the dashboard to update while the AI pipeline is still processing. This reduced response latency dramatically and made the system feel responsive instead of batch-oriented.

Real-Time Visualization Performance

The topology visualization turned out to be far more difficult than expected. The first two implementations used SVG and D3, but both struggled when animating dozens of simultaneous data pulses across edges. Frame rates dropped, and pulse animations flickered when node states changed. We eventually rebuilt the renderer entirely using HTML Canvas with manual draw loops, which allowed us to control rendering at the pixel level. Pulse effects are rendered using layered strokes and gradient falloffs to create smooth light streaks without artifacts. The renderer also needed to account for device pixel ratios so the visualization remains sharp on high-resolution displays while maintaining 60fps performance.

Reconciling Telemetry from Multiple Sources

Another challenge was combining telemetry from different systems that don’t always agree. Local machine metrics and ThousandEyes network probes sometimes report contradictory signals. For example, the host might appear healthy while network probes report high latency due to an upstream routing issue. To avoid false alerts, we built a telemetry normalization layer that merges signals into a single node health model. External network telemetry is given priority for connectivity status, while local system metrics drive host-level health indicators. This merging logic ensures the AI pipeline receives a coherent view of the system rather than conflicting signals.

Handling Imperfect Monitoring Data

External monitoring APIs don’t always behave perfectly. ThousandEyes tests sometimes return incomplete or delayed results, and probes may fail for reasons unrelated to the infrastructure being monitored. Without safeguards, these gaps could cause the system to interpret missing data as failures. To address this, we implemented graceful degradation logic in the telemetry collectors. When test results are missing or stale, the system marks the node state as unknown rather than degraded and avoids triggering remediation until corroborating signals appear. This prevents the system from reacting to monitoring artifacts instead of real incidents. These challenges forced us to treat NetHealer not just as an AI demo, but as a production-style infrastructure control system where reliability, safety, and real-time performance matter just as much as the intelligence of the AI itself.

Accomplishments that we're proud of

  1. A genuinely working end-to-end autonomous remediation pipeline not a demo with mocked responses, but real Amazon Nova reasoning on real telemetry data making real AWS API calls Sub-10-second detection-to-remediation cycles on actual network anomalies.

  2. The network topology visualization it looks and feels like a real NOC tool, with smooth animated data flow that reacts to live metrics.

  3. ThousandEyes integration providing real latency measurements from a global network intelligence platform feeding directly into the AI pipeline.

  4. The human-in-the-loop approval gate for destructive actions the system is autonomous but not reckless.

What we learned

Amazon Nova is genuinely capable of multi-signal reasoning when given well-structured context. The key insight was treating each agent's prompt as a structured data handoff rather than a freeform question giving Nova a JSON snapshot of the current network state, the delta from baseline, and the historical incident context produces dramatically better root cause analysis than asking it to reason from raw log text.

We also learned that building agentic systems requires thinking carefully about failure modes at every stage. What happens when Nova returns an unexpected format? What happens when an AWS API call fails mid-remediation? The orchestrator needs to handle partial failures gracefully without leaving infrastructure in an inconsistent state.

What's next for NetHealer

  1. Predictive healing — using historical incident patterns to predict failures before they happen and pre-emptively reroute traffic or scale resources
  2. Multi-cloud support — extending the execution layer to Azure and GCP so the system can orchestrate remediation across hybrid cloud environments
  3. Runbook learning — having Nova learn from every incident it handles, building a knowledge base of what worked and what didn't to improve future remediation plans
  4. Slack and PagerDuty integration — so the AI can communicate with the on-call team in natural language, explain what it did and why, and escalate only when it genuinely needs human judgment
  5. Amazon Nova Pro upgrade — moving the root cause and remediation agents to Nova Pro for more complex multi-hop reasoning on large-scale network topologies with hundreds of nodes
Share this project:

Updates