💡 Inspiration
Penetration testing costs $15,000–$40,000 per engagement and takes weeks - because every step requires a skilled human: choose the right tools, run them, parse dense terminal output, separate real findings from noise, and write a coherent report. Meanwhile, most software teams ship code that's never been properly tested.
We wanted to find out: can an LLM-orchestrated agentic pipeline close that gap? Not a wrapper around a single tool - a genuine end-to-end pipeline that plans its own approach, runs 12 industry-standard security tools, interprets the results, constructs a causal exploit graph, and delivers a professional vulnerability report, all without a human in the chair.
The constraint we set: the entire system must run offline (including LLMs) on ARM architecture. When you're scanning a target, you're mapping its attack surface - that data can't go to a cloud API. Pulse defaults to Ollama with llama3.1:8b, meaning zero bytes leave your machine.
Revive your application before it becomes a headline with Pulse!
⚙️ What it does
Pulse accepts a URL or a local repository path and runs a fully automated penetration testing pipeline end-to-end:
- Dynamic planning: An LLM reads a structural fingerprint of the target and decides which security agents to invoke. For repositories, this fingerprint is built from file tree/extensions/dependency manifests. For URLs, Pulse first runs a quick pre-scan fingerprint (httpx + whatweb) and then selects relevant web agents. Under weak/uncertain URL signal, guardrails keep a conservative baseline (recon + SQLi + XSS) to avoid under-testing.
- Parallel tool execution per agent: Each agent node runs its toolset, drops the LLM call if tools produce no output, and accumulates structured findings into shared state. Operational tool errors — missing binaries, timeouts — are logged but never promoted into findings.
- Live LLM streaming: Every LLM call — planner reasoning, tool interpretation, attack chain construction, and report writing — streams individual tokens to the UI in real time via Server-Sent Events.
- Attack chain synthesis: After all agents complete, a dedicated node reasons over the full findings set using a MITRE ATT&CK-aligned prompt to produce a causal exploit graph with nodes, directed edges, justifications, and a rendered Mermaid diagram.
- Structured vulnerability report: A final agent synthesises everything into a Markdown report: executive summary, findings table, detailed per-finding breakdown with evidence and remediation, attack chain narrative, and a risk score out of 10.
✨ Features
- LangGraph state machine with 12 registered agent nodes
- LLM-driven dynamic agent selection for both repository and URL targets
- Real-time LLM token streaming to the UI via SSE
- MITRE ATT&CK-aligned multi-step attack chain graph with causal edge validation
- Live execution pipeline panel with per-agent running/done/queued status
- Findings sorted by severity with expandable evidence snippets
- Findings-by-component bar chart and interactive attack chain visualisation
- Exportable Markdown vulnerability report
- Swappable LLM backend: Ollama (local/offline), OpenAI, or Anthropic Claude
- Fully Dockerised backend — no security tools installed on the host machine
- Hot-reload of backend Python code via Docker volume mounts
- OWASP Juice Shop bundled as a ready-to-scan test target
- Dual target mode: web URLs and source code repositories, with distinct planning pipelines for each
- Full evidence drawer: click any finding to open a slide-in panel with untruncated raw tool output
- Port map visualisation allowing nmap results rendered as a scannable table with risk-tiered colour coding
Tools our Agents Use
🌐 Recon
httpx probes live HTTP endpoints for status codes, titles, redirect chains, and detected technologies. nmap performs a real TCP port scan across 10,000 ports with service and version detection. whatweb fingerprints server software, frameworks, and CMS versions.
💉 SQL Injection
sqlmap fires real injection payloads against discovered forms and endpoints in batch detection mode. If a parameter is injectable, sqlmap finds it — these are confirmed exploitable injections, not theoretical warnings.
🎯 XSS
dalfox injects real reflected XSS payloads using its built-in payload library against every discovered input surface. Findings are confirmed hits against the live target.
🔬 Static Analysis
semgrep runs its full auto-detect ruleset across the repository source. bandit applies Python-specific security linting. cppcheck checks C/C++ code for memory safety violations and undefined behaviour.
📦 Dependency Auditing
pip-audit queries the OSV database for CVEs in Python packages. npm audit queries the npm advisory registry for Node.js packages. Only agents relevant to the detected stack are selected — a pure-Python repo never runs npm audit.
🔐 Secrets Scanning
trufflehog scans files and commit history for high-entropy strings and known credential patterns. detect-secrets runs a second independent pass with its own pattern matcher. Two tools, one pass, no single point of failure.
⛓️ Attack Chain Synthesis After all agents complete, a dedicated reasoning node receives the full confirmed findings set and constructs a causal exploit graph aligned to MITRE ATT&CK. A "no phantom edges" rule is enforced — an edge from A → B is only drawn if exploiting A is a necessary prerequisite for B. Output includes typed nodes, justified directed edges, a plain-English narrative, and a rendered Mermaid diagram.
📄 Report Generation A final agent synthesises everything into a structured Markdown report: executive summary, findings table sorted by severity, per-finding breakdown with evidence and remediation, the full attack chain narrative, and a risk score out of 10.
Every finding Pulse reports was produced by an industry-standard offensive tool executing against the real target. sqlmap found the injection. dalfox confirmed the reflection. trufflehog found the secret. The LLM interprets and synthesises — the ground truth comes from the tools.
🛠️ How we built it
Pulse is split into a Python backend and a Next.js frontend, connected by a FastAPI REST + SSE API.
Orchestration layer — LangGraph
The core of Pulse is a compiled StateGraph. All nodes share a single GraphState (Pydantic model) that accumulates findings, the attack chain, and the report as execution progresses. Routing between nodes is fully deterministic — driven by the agents_plan list the planner emits — using conditional edges that advance through the plan without any further LLM routing decisions.
Planner For repository targets the planner walks the file tree (up to 1,000 files), builds a compact fingerprint of root files and per-directory extension counts, and passes it to the LLM with a strict JSON schema prompt. For URL targets it first performs a fast pre-scan fingerprint (httpx + whatweb) and then asks the LLM to pick relevant web agents, with conservative guardrails when signal quality is weak. Planner responses are parsed with brace-depth JSON extraction, validated against allowlists, and retried once with a JSON-repair prompt before a safe fallback is used.
Tool layer
Each tool is a @tool-decorated function wrapping a subprocess.run call. Tools return a typed dict with results, total, and error. Every agent node calls _has_real_output() before passing anything to the LLM — operational errors never become findings.
Tools in use: httpx, nmap, whatweb, sqlmap, dalfox, cppcheck, semgrep, bandit, pip-audit, npm audit, trufflehog, detect-secrets.
LLM interpretation
Each agent node calls a shared _llm_interpret() helper with the combined raw tool output. The LLM returns a JSON array of findings (severity, title, description, evidence, remediation, component). A markdown-fence stripper and brace-depth JSON extractor handle minor formatting deviations before parsing.
Streaming
ScanStreamCallback (a LangChain callback) intercepts tokens on every LLM call and appends them — prefixed with null-byte sentinels to delimit agent blocks — to scan.llm_log. The frontend consumes this via an EventSource on the SSE endpoint, re-parsing the sentinel structure to render per-agent reasoning blocks with a live cursor animation.
Frontend Next.js 15 app router, Tailwind CSS, shadcn/ui components, Motion for animations, ReactFlow for the attack chain graph, Recharts for the findings bar chart, and ReactMarkdown with remark-gfm for the vulnerability report.
🚧 Challenges we ran into
- LLM output reliability: Smaller local models (llama3.1:8b on Ollama) struggle to emit clean JSON on the first pass, especially for the planner. We layered three recovery strategies: markdown-fence stripping, brace-depth JSON extraction, and a second-pass JSON-repair LLM call, before falling back to a broad safe plan.
- Tool error vs. real findings: Security tools produce a lot of noise — installation errors, timeouts, warnings on stderr — that an LLM can mistake for vulnerabilities. We built
_has_real_output()to gate every LLM call and added explicit false-positive suppression rules in the interpreter system prompt. - Shared accumulating state across nodes: LangGraph's state passing meant we had to be deliberate about how findings grow. Each node returns
{"findings": state.findings + new_findings}to extend, not overwrite, the list as the graph progresses. - Docker networking for local repo scanning: The backend runs inside Linux containers but needs to scan folders on the macOS host. The dev compose mounts
/tmpand/Usersread-only into the container, and the planner validates path accessibility before attempting a scan. - Streaming across a thread boundary: Piping per-token callbacks through SSE to the frontend while the graph blocks inside
asyncio.to_thread()required careful sentinel design so the frontend can reconstruct which agent emitted which tokens without a race condition.
🏆 Accomplishments we're proud of
- Entirely local by default: Pulse runs the full pipeline — LLM, all security tools, and the frontend — without sending a single byte to the cloud, using Ollama as the default backend.
- Genuinely dynamic agent selection: The planner doesn't run every tool blindly. A pure-C repo will never waste time on
pip-audit. A JavaScript project won't getcppcheck. The LLM reads file evidence and the routing plan reflects it, with hallucination grounding to prevent the model from claiming languages it didn't actually see. - Attack chain with real causal reasoning: The attack chain node applies MITRE ATT&CK categories, enforces an explicit "no phantom edges" rule (an edge from A → B is only valid if exploiting A is a necessary prerequisite for B), and produces a plain-English narrative alongside a rendered graph.
- Zero-latency-feel streaming: Watching the LLM reason through tool output token-by-token in the Agent Reasoning Console while the scan is still in progress makes the system feel genuinely live in a way that polling-based approaches can't match.
📚 What we learned
- Prompt grounding is load-bearing. An LLM told to summarise a repo's architecture will confidently invent languages it didn't see. Adding explicit grounding rules — cross-checking LLM output against the actual file-tree signals — cut hallucinated architecture summaries to near zero.
- The gap between "agentic" and "multi-agent" is real. Pulse is one orchestrator with specialist stages, not independent agents with their own memory and goals. That constraint makes it more predictable and auditable, but true multi-agent delegation would unlock significantly deeper exploitation reasoning.
- Tool-use reliability requires defensive engineering at every layer. FileNotFoundError, subprocess timeout, malformed JSON, LLM false positives, HTML-entity escaping in cppcheck XML — each required its own guard. Security tooling is not clean.
- LangGraph is well-suited to sequential agentic pipelines. Compiled state machines with conditional edges give you reproducible execution order, clean state passing, and out-of-the-box traceability — all things that matter when a scan takes several minutes and must not silently skip a step.
🚀 Future work
- Parallel agent execution: The current pipeline is sequential. Independent agents (e.g.,
static_canddeps_pyon a repo) have no data dependency and could run concurrently using LangGraph's parallel branches, halving scan time. - Iterative exploitation loops: After the attack chain is built, dispatch targeted follow-up agents to probe the most critical link. For example, if SQLi is identified, automatically attempt schema extraction and report what was accessible.
- Authenticated scan support: Add session cookie / API key injection so agents can scan authenticated routes and internal API surfaces, not just public-facing endpoints.
- CVE enrichment: Cross-reference dependency findings against the NVD and GitHub advisories API to pull real CVSS scores, PoC links, and patch versions directly into findings.
- Persistent scan history: Replace the in-memory scan store with a database so scans survive backend restarts and users can compare reports across time.
- CI/CD integration: Expose a scan-trigger API and publish a GitHub Action so Pulse can run on every pull request and block merges when critical-severity findings are introduced.
🧱 Stack
| Layer | Technology |
|---|---|
| Frontend | Next.js 15, Tailwind CSS, shadcn/ui, Motion, ReactFlow, Recharts |
| Backend API | FastAPI, Python 3.13, uv |
| Orchestration | LangGraph, LangChain |
| LLM (default) | Ollama — llama3.1:8b (local, offline) |
| LLM (optional) | OpenAI GPT-4o, Anthropic Claude Sonnet |
| Security tools | httpx, nmap, whatweb, sqlmap, dalfox, cppcheck, semgrep, bandit, pip-audit, npm audit, trufflehog, detect-secrets |
| Streaming | Server-Sent Events via sse-starlette |
| Containerisation | Docker, Docker Compose |
| Test target | OWASP Juice Shop |


Log in or sign up for Devpost to join the conversation.