💡 Inspiration

Penetration testing costs $15,000–$40,000 per engagement and takes weeks - because every step requires a skilled human: choose the right tools, run them, parse dense terminal output, separate real findings from noise, and write a coherent report. Meanwhile, most software teams ship code that's never been properly tested.

We wanted to find out: can an LLM-orchestrated agentic pipeline close that gap? Not a wrapper around a single tool - a genuine end-to-end pipeline that plans its own approach, runs 12 industry-standard security tools, interprets the results, constructs a causal exploit graph, and delivers a professional vulnerability report, all without a human in the chair.

The constraint we set: the entire system must run offline (including LLMs) on ARM architecture. When you're scanning a target, you're mapping its attack surface - that data can't go to a cloud API. Pulse defaults to Ollama with llama3.1:8b, meaning zero bytes leave your machine.

Revive your application before it becomes a headline with Pulse!


⚙️ What it does

Pulse accepts a URL or a local repository path and runs a fully automated penetration testing pipeline end-to-end:

  • Dynamic planning: An LLM reads a structural fingerprint of the target and decides which security agents to invoke. For repositories, this fingerprint is built from file tree/extensions/dependency manifests. For URLs, Pulse first runs a quick pre-scan fingerprint (httpx + whatweb) and then selects relevant web agents. Under weak/uncertain URL signal, guardrails keep a conservative baseline (recon + SQLi + XSS) to avoid under-testing.
  • Parallel tool execution per agent: Each agent node runs its toolset, drops the LLM call if tools produce no output, and accumulates structured findings into shared state. Operational tool errors — missing binaries, timeouts — are logged but never promoted into findings.
  • Live LLM streaming: Every LLM call — planner reasoning, tool interpretation, attack chain construction, and report writing — streams individual tokens to the UI in real time via Server-Sent Events.
  • Attack chain synthesis: After all agents complete, a dedicated node reasons over the full findings set using a MITRE ATT&CK-aligned prompt to produce a causal exploit graph with nodes, directed edges, justifications, and a rendered Mermaid diagram.
  • Structured vulnerability report: A final agent synthesises everything into a Markdown report: executive summary, findings table, detailed per-finding breakdown with evidence and remediation, attack chain narrative, and a risk score out of 10.

✨ Features

  • LangGraph state machine with 12 registered agent nodes
  • LLM-driven dynamic agent selection for both repository and URL targets
  • Real-time LLM token streaming to the UI via SSE
  • MITRE ATT&CK-aligned multi-step attack chain graph with causal edge validation
  • Live execution pipeline panel with per-agent running/done/queued status
  • Findings sorted by severity with expandable evidence snippets
  • Findings-by-component bar chart and interactive attack chain visualisation
  • Exportable Markdown vulnerability report
  • Swappable LLM backend: Ollama (local/offline), OpenAI, or Anthropic Claude
  • Fully Dockerised backend — no security tools installed on the host machine
  • Hot-reload of backend Python code via Docker volume mounts
  • OWASP Juice Shop bundled as a ready-to-scan test target
  • Dual target mode: web URLs and source code repositories, with distinct planning pipelines for each
  • Full evidence drawer: click any finding to open a slide-in panel with untruncated raw tool output

- Port map visualisation allowing nmap results rendered as a scannable table with risk-tiered colour coding

Tools our Agents Use

🌐 Recon httpx probes live HTTP endpoints for status codes, titles, redirect chains, and detected technologies. nmap performs a real TCP port scan across 10,000 ports with service and version detection. whatweb fingerprints server software, frameworks, and CMS versions.

💉 SQL Injection sqlmap fires real injection payloads against discovered forms and endpoints in batch detection mode. If a parameter is injectable, sqlmap finds it — these are confirmed exploitable injections, not theoretical warnings.

🎯 XSS dalfox injects real reflected XSS payloads using its built-in payload library against every discovered input surface. Findings are confirmed hits against the live target.

🔬 Static Analysis semgrep runs its full auto-detect ruleset across the repository source. bandit applies Python-specific security linting. cppcheck checks C/C++ code for memory safety violations and undefined behaviour.

📦 Dependency Auditing pip-audit queries the OSV database for CVEs in Python packages. npm audit queries the npm advisory registry for Node.js packages. Only agents relevant to the detected stack are selected — a pure-Python repo never runs npm audit.

🔐 Secrets Scanning trufflehog scans files and commit history for high-entropy strings and known credential patterns. detect-secrets runs a second independent pass with its own pattern matcher. Two tools, one pass, no single point of failure.

⛓️ Attack Chain Synthesis After all agents complete, a dedicated reasoning node receives the full confirmed findings set and constructs a causal exploit graph aligned to MITRE ATT&CK. A "no phantom edges" rule is enforced — an edge from A → B is only drawn if exploiting A is a necessary prerequisite for B. Output includes typed nodes, justified directed edges, a plain-English narrative, and a rendered Mermaid diagram.

📄 Report Generation A final agent synthesises everything into a structured Markdown report: executive summary, findings table sorted by severity, per-finding breakdown with evidence and remediation, the full attack chain narrative, and a risk score out of 10.


Every finding Pulse reports was produced by an industry-standard offensive tool executing against the real target. sqlmap found the injection. dalfox confirmed the reflection. trufflehog found the secret. The LLM interprets and synthesises — the ground truth comes from the tools.

🛠️ How we built it

Pulse is split into a Python backend and a Next.js frontend, connected by a FastAPI REST + SSE API.

Orchestration layer — LangGraph The core of Pulse is a compiled StateGraph. All nodes share a single GraphState (Pydantic model) that accumulates findings, the attack chain, and the report as execution progresses. Routing between nodes is fully deterministic — driven by the agents_plan list the planner emits — using conditional edges that advance through the plan without any further LLM routing decisions.

Planner For repository targets the planner walks the file tree (up to 1,000 files), builds a compact fingerprint of root files and per-directory extension counts, and passes it to the LLM with a strict JSON schema prompt. For URL targets it first performs a fast pre-scan fingerprint (httpx + whatweb) and then asks the LLM to pick relevant web agents, with conservative guardrails when signal quality is weak. Planner responses are parsed with brace-depth JSON extraction, validated against allowlists, and retried once with a JSON-repair prompt before a safe fallback is used.

Tool layer Each tool is a @tool-decorated function wrapping a subprocess.run call. Tools return a typed dict with results, total, and error. Every agent node calls _has_real_output() before passing anything to the LLM — operational errors never become findings.

Tools in use: httpx, nmap, whatweb, sqlmap, dalfox, cppcheck, semgrep, bandit, pip-audit, npm audit, trufflehog, detect-secrets.

LLM interpretation Each agent node calls a shared _llm_interpret() helper with the combined raw tool output. The LLM returns a JSON array of findings (severity, title, description, evidence, remediation, component). A markdown-fence stripper and brace-depth JSON extractor handle minor formatting deviations before parsing.

Streaming ScanStreamCallback (a LangChain callback) intercepts tokens on every LLM call and appends them — prefixed with null-byte sentinels to delimit agent blocks — to scan.llm_log. The frontend consumes this via an EventSource on the SSE endpoint, re-parsing the sentinel structure to render per-agent reasoning blocks with a live cursor animation.

Frontend Next.js 15 app router, Tailwind CSS, shadcn/ui components, Motion for animations, ReactFlow for the attack chain graph, Recharts for the findings bar chart, and ReactMarkdown with remark-gfm for the vulnerability report.


🚧 Challenges we ran into

  • LLM output reliability: Smaller local models (llama3.1:8b on Ollama) struggle to emit clean JSON on the first pass, especially for the planner. We layered three recovery strategies: markdown-fence stripping, brace-depth JSON extraction, and a second-pass JSON-repair LLM call, before falling back to a broad safe plan.
  • Tool error vs. real findings: Security tools produce a lot of noise — installation errors, timeouts, warnings on stderr — that an LLM can mistake for vulnerabilities. We built _has_real_output() to gate every LLM call and added explicit false-positive suppression rules in the interpreter system prompt.
  • Shared accumulating state across nodes: LangGraph's state passing meant we had to be deliberate about how findings grow. Each node returns {"findings": state.findings + new_findings} to extend, not overwrite, the list as the graph progresses.
  • Docker networking for local repo scanning: The backend runs inside Linux containers but needs to scan folders on the macOS host. The dev compose mounts /tmp and /Users read-only into the container, and the planner validates path accessibility before attempting a scan.
  • Streaming across a thread boundary: Piping per-token callbacks through SSE to the frontend while the graph blocks inside asyncio.to_thread() required careful sentinel design so the frontend can reconstruct which agent emitted which tokens without a race condition.

🏆 Accomplishments we're proud of

  • Entirely local by default: Pulse runs the full pipeline — LLM, all security tools, and the frontend — without sending a single byte to the cloud, using Ollama as the default backend.
  • Genuinely dynamic agent selection: The planner doesn't run every tool blindly. A pure-C repo will never waste time on pip-audit. A JavaScript project won't get cppcheck. The LLM reads file evidence and the routing plan reflects it, with hallucination grounding to prevent the model from claiming languages it didn't actually see.
  • Attack chain with real causal reasoning: The attack chain node applies MITRE ATT&CK categories, enforces an explicit "no phantom edges" rule (an edge from A → B is only valid if exploiting A is a necessary prerequisite for B), and produces a plain-English narrative alongside a rendered graph.
  • Zero-latency-feel streaming: Watching the LLM reason through tool output token-by-token in the Agent Reasoning Console while the scan is still in progress makes the system feel genuinely live in a way that polling-based approaches can't match.

📚 What we learned

  • Prompt grounding is load-bearing. An LLM told to summarise a repo's architecture will confidently invent languages it didn't see. Adding explicit grounding rules — cross-checking LLM output against the actual file-tree signals — cut hallucinated architecture summaries to near zero.
  • The gap between "agentic" and "multi-agent" is real. Pulse is one orchestrator with specialist stages, not independent agents with their own memory and goals. That constraint makes it more predictable and auditable, but true multi-agent delegation would unlock significantly deeper exploitation reasoning.
  • Tool-use reliability requires defensive engineering at every layer. FileNotFoundError, subprocess timeout, malformed JSON, LLM false positives, HTML-entity escaping in cppcheck XML — each required its own guard. Security tooling is not clean.
  • LangGraph is well-suited to sequential agentic pipelines. Compiled state machines with conditional edges give you reproducible execution order, clean state passing, and out-of-the-box traceability — all things that matter when a scan takes several minutes and must not silently skip a step.

🚀 Future work

  • Parallel agent execution: The current pipeline is sequential. Independent agents (e.g., static_c and deps_py on a repo) have no data dependency and could run concurrently using LangGraph's parallel branches, halving scan time.
  • Iterative exploitation loops: After the attack chain is built, dispatch targeted follow-up agents to probe the most critical link. For example, if SQLi is identified, automatically attempt schema extraction and report what was accessible.
  • Authenticated scan support: Add session cookie / API key injection so agents can scan authenticated routes and internal API surfaces, not just public-facing endpoints.
  • CVE enrichment: Cross-reference dependency findings against the NVD and GitHub advisories API to pull real CVSS scores, PoC links, and patch versions directly into findings.
  • Persistent scan history: Replace the in-memory scan store with a database so scans survive backend restarts and users can compare reports across time.
  • CI/CD integration: Expose a scan-trigger API and publish a GitHub Action so Pulse can run on every pull request and block merges when critical-severity findings are introduced.

🧱 Stack

Layer Technology
Frontend Next.js 15, Tailwind CSS, shadcn/ui, Motion, ReactFlow, Recharts
Backend API FastAPI, Python 3.13, uv
Orchestration LangGraph, LangChain
LLM (default) Ollama — llama3.1:8b (local, offline)
LLM (optional) OpenAI GPT-4o, Anthropic Claude Sonnet
Security tools httpx, nmap, whatweb, sqlmap, dalfox, cppcheck, semgrep, bandit, pip-audit, npm audit, trufflehog, detect-secrets
Streaming Server-Sent Events via sse-starlette
Containerisation Docker, Docker Compose
Test target OWASP Juice Shop

Built With

Share this project:

Updates