Skip to content

Feature: Web Application Penetration Testing Skill — Reconnaissance, Exploitation, and Proof-Based Reporting (inspired by Shannon) #400

@teknium1

Description

@teknium1

Overview

Shannon is an autonomous AI penetration testing agent by Keygraph that achieved 96.15% on the XBOW benchmark. Its defining philosophy is "No Exploit, No Report" — it only reports vulnerabilities it can actually exploit and reproduce against a running application, eliminating false positives entirely. It uses a 13-agent, 5-phase pipeline (pre-recon → recon → vulnerability analysis → exploitation → reporting) orchestrated via Temporal.io, with Claude Code as the AI engine and Playwright for browser automation.

The existing RAPTOR-inspired cybersecurity issues (#382 Code Security Audit, #383 Binary Security Analysis, #384 OSS Security Forensics) cover offline analysis — scanning source code, analyzing binaries, and investigating supply chain attacks. None of them address the critical capability of testing a running web application for exploitable vulnerabilities. Shannon fills this gap with active reconnaissance, live exploitation, and proof-based reporting.

This issue proposes a Web Application Penetration Testing skill that adapts Shannon's methodology and pipeline into Hermes Agent, using existing tools (terminal for nmap/whatweb/curl, Hermes browser tool or Playwright MCP for web interaction, delegate_task for parallel vulnerability agents) to perform structured penetration testing against web applications in sandboxed environments.

License note: Shannon is AGPL-3.0 — we cannot use its code directly. This proposal adapts Shannon's concepts, methodology, and pipeline structure as a fresh implementation. All code would be written from scratch.


Research Findings

How Shannon's Pentesting Pipeline Works

Phase 1: Pre-Recon (Source Code Analysis)
A single agent with source code access maps the application architecture:

  • Framework identification and routing patterns
  • Authentication flow mapping (session, JWT, OAuth)
  • Dangerous sink inventory (SQL queries, OS commands, template renders, file operations)
  • Input validation and sanitization audit
  • Trust boundary identification
  • Output: code_analysis_deliverable.md consumed by all downstream agents

Phase 2: Recon (Attack Surface Mapping)
A single agent interacts with the live application:

  • Network scanning via nmap (ports, services)
  • Subdomain enumeration via subfinder
  • Technology fingerprinting via whatweb
  • API endpoint discovery via Playwright browser interaction
  • Correlation between source code analysis and live behavior
  • Authentication testing (login flows, session management)
  • Output: recon_deliverable.md

Phase 3: Vulnerability Analysis (5 parallel agents)
Each agent specializes in one vulnerability class:

Agent Scope Methodology
injection-vuln SQLi, Command Injection, Path Traversal, SSTI, LFI/RFI, Deserialization Slot-type classification (SQL-val, CMD-argument, etc.) with required-defense mapping
xss-vuln Reflected, Stored, DOM-based XSS Render-context analysis (HTML_BODY, JAVASCRIPT_STRING, etc.) with encoding requirements
auth-vuln Login bypass, JWT confusion, token replay, OAuth, password reset, brute force 9-point authentication checklist
authz-vuln Horizontal/vertical privilege escalation, IDOR, business logic Role-based access matrix analysis
ssrf-vuln Internal service access, cloud metadata, protocol smuggling 4 SSRF type classification with escalation paths

Each agent produces an exploitation_queue.json — a structured list of findings with: ID, vulnerability type, source location, path, sink, sanitization status, verdict, witness payload, and confidence level.

Phase 4: Exploitation (5 parallel agents, conditional)
Only runs for vulnerability classes where the analysis phase found actionable findings:

  • Attempts actual exploitation against the live application
  • Follows a 4-level proof system:
    • Level 1: Identified (pattern found)
    • Level 2: Partial (data flow confirmed)
    • Level 3: Confirmed (payload delivered successfully)
    • Level 4: Critical Impact (data extracted, code executed, access gained)
  • Uses a Bypass Exhaustion Protocol: must attempt multiple distinct bypass techniques before classifying as false positive
  • Output: exploitation_evidence.md with full PoC payloads and reproduction steps

Phase 5: Reporting (1 agent)
Assembles all evidence into a comprehensive security assessment with executive summary, proven exploits only, CVSS scoring, and remediation recommendations.

Key Prompting Patterns

1. Progressive Context Chain: Each agent reads deliverables from ALL prior phases. Intelligence accumulates — the exploitation agent knows the architecture (pre-recon), attack surface (recon), and specific vulnerabilities (analysis) before attempting exploitation.

2. Backward Taint Analysis: Start at dangerous sinks, trace backward to sources. Early-terminate when proper sanitization found. More efficient than forward analysis for web app contexts.

3. Slot-Type Taxonomy: Each injection point has a specific "slot type" (SQL-val, SQL-ident, CMD-argument, PATH-segment, TEMPLATE-string, DESERIALIZE) with known required defenses. A mismatch between slot type and defense = vulnerability.

4. Todo-Driven Completeness: Every endpoint/input vector gets a todo item. Analysis is INCOMPLETE if any todos remain. Prevents early termination.

5. Scope Enforcement: Every prompt has explicit in-scope/out-of-scope definitions. External attacker perspective only. Network-reachable targets only.

6. Classification Rigor:

  • EXPLOITED = proven with evidence (Level 3-4)
  • POTENTIAL = blocked by external factors (WAF, rate limit), not by code-level security
  • FALSE POSITIVE = security controls verified to work after bypass exhaustion

Shannon's Results

Demonstrated against real vulnerable applications:

  • OWASP Juice Shop: 20+ critical flaws, full database exfiltration, admin access
  • Checkmarx c{api}tal API: Root-level injection, legacy API bypass, 15 high-severity issues
  • OWASP crAPI: JWT Algorithm Confusion, SQLi credential exfiltration, SSRF chains

Comparison with Existing Issues

Capability #382 Code Audit #383 Binary #384 Forensics This Issue
Source code analysis ✅ Semgrep/CodeQL ✅ LLM-based (for architecture mapping)
Live app testing ✅ Core capability
Exploit verification ✅ Core capability
Binary analysis
Supply chain forensics
Browser automation ✅ Playwright/browser
Network reconnaissance ✅ nmap/subfinder/whatweb
Professional reporting Partial Partial ✅ Pentest-grade

Overlap with #382: Shannon's pre-recon (LLM-based code analysis) overlaps conceptually with #382's source scanning, but the methodology is different — Shannon reads code to understand architecture for exploitation, while #382 scans code for vulnerability patterns. Shannon's transferable patterns (backward taint, slot-type taxonomy, proof system, bypass exhaustion) have been proposed as enhancements to #382.

No overlap with #383 or #384. Binary analysis and forensics are completely separate domains.


Current State in Hermes Agent

What we have:

What we don't have:

  • No web application penetration testing workflow
  • No live exploitation capability
  • No network reconnaissance skill (nmap, subfinder, whatweb)
  • No browser-automated security testing
  • No pentest reporting framework
  • No proof-based vulnerability verification

Implementation Plan

Skill vs. Tool Classification

This should be a skill because:

  • All security tools (nmap, whatweb, subfinder, curl) are CLIs callable via terminal
  • Browser interaction uses existing Hermes tools (browser tool, MCP, or Playwright via terminal)
  • The pentesting methodology is LLM reasoning guided by skill instructions
  • Parallel vulnerability agents use existing delegate_task
  • No custom Python integration needed in the agent harness
  • No API key management for the harness (user provides target URL, the agent attacks it)

Bundled vs. Skills Hub: Recommend Skills Hub. Penetration testing is specialized (security professionals, developers auditing their own apps in sandboxed environments) and requires multiple external tools. Additionally, active exploitation requires clear safety guardrails and ethical use guidance.

Category: security (completing the cybersecurity skill suite alongside #382, #383, #384)

Safety Guardrails (CRITICAL)

Unlike #382-384 which analyze offline artifacts, this skill actively attacks running applications. Mandatory safeguards:

  1. Explicit scope confirmation — Agent must confirm target URL and get user approval before any active testing
  2. Sandboxed environments only — Skill instructions must emphasize: ONLY test applications you own/control in isolated environments (Docker, VMs, dedicated test instances)
  3. No production systems — Explicit prohibition against testing production systems without written authorization
  4. Dangerous command approval — Hermes' existing approval.py system will catch destructive commands, but the skill should add pentest-specific warnings
  5. Ethical use header — Every pentest report should include authorization confirmation
  6. Rate limiting — Built-in delays between requests to avoid overwhelming targets

What We'd Need

  1. SKILL.md — Pentesting workflow with trigger conditions, safety guardrails, phased pipeline instructions
  2. references/recon-guide.md — How to use nmap, whatweb, subfinder effectively for web app reconnaissance
  3. references/vulnerability-taxonomy.md — Slot-type and render-context classification systems adapted from Shannon
  4. references/exploitation-techniques.md — Per-vulnerability-type exploitation guidance (SQLi payloads, XSS contexts, SSRF chains, auth bypass patterns)
  5. references/bypass-techniques.md — Common WAF/filter bypass techniques for each vulnerability class
  6. templates/pentest-report.md — Professional penetration test report template
  7. scripts/recon-scan.sh — Wrapper that runs nmap + whatweb + subfinder and outputs structured JSON

Phased Rollout

Phase 1: Reconnaissance + Manual-Guided Testing

  • Network reconnaissance: nmap port scanning, whatweb fingerprinting, subfinder subdomain discovery
  • Application mapping: crawl endpoints, identify input vectors, map authentication
  • Agent-guided vulnerability identification: agent reads recon results, suggests test vectors
  • User-directed exploitation: agent suggests payloads, user confirms before execution
  • Basic finding documentation (Markdown report)
  • Trigger: "pentest this app", "scan this web application", "test this URL for vulnerabilities"

Phase 2: Semi-Autonomous Vulnerability Analysis

  • Parallel vulnerability-specific analysis via delegate_task:
    • Injection agent (SQLi, command injection, path traversal, SSTI)
    • XSS agent (reflected, stored, DOM-based with render-context analysis)
    • Authentication agent (login bypass, JWT attacks, session management)
    • Authorization agent (IDOR, privilege escalation, role verification)
    • SSRF agent (internal access, cloud metadata, protocol smuggling)
  • Structured handoff via exploitation_queue JSON files
  • 4-level proof system for finding classification
  • Bypass exhaustion protocol before false-positive dismissal
  • Integration with Feature: Code Security Audit Skill — SAST Scanning, Vulnerability Validation, and Automated Patching (inspired by RAPTOR) #382 (static analysis feeds into vuln analysis if source code available)

Phase 3: Full Autonomous Pipeline

  • End-to-end autonomous pentesting (recon → analyze → exploit → report)
  • Browser automation via Playwright MCP or Hermes browser tool for dynamic testing
  • Automated exploitation with PoC generation and reproduction steps
  • Professional pentest report generation with CVSS scoring
  • Todo-driven completeness (all endpoints tested, all findings classified)
  • Integration with CI/CD: run as scheduled security assessment on staging environments
  • Comparative reporting: track findings across runs, show remediation progress

Pros & Cons

Pros

  • Completes the security suite — With Feature: Code Security Audit Skill — SAST Scanning, Vulnerability Validation, and Automated Patching (inspired by RAPTOR) #382-384, Hermes would cover static analysis, binary analysis, forensics, AND active pentesting
  • Proof-based approach — "No Exploit, No Report" eliminates false positive noise
  • Uses existing Hermes tools — terminal (nmap, curl), browser/MCP (Playwright), delegate_task (parallel agents)
  • Progressive trust — Phase 1 is human-guided, Phase 3 is autonomous. Users can adopt at their comfort level.
  • Practical value — Every web developer should test their apps; this makes it accessible
  • Shannon-validated methodology — 96.15% XBOW benchmark proves the approach works
  • AGPL-safe — Fresh implementation using only concepts, no code borrowed

Cons / Risks

  • Safety — Active exploitation can cause data modification/deletion. Sandboxed environments mandatory.
  • Legal liability — Unauthorized pentesting is illegal. Skill must include strong ethical use guidance.
  • Tool dependencies — Full capability requires nmap, whatweb, subfinder, Playwright (many installs)
  • Cost — If using powerful LLMs for multi-agent pentesting, token costs can be significant
  • Scope creep — Pentesting is a vast field; must stay focused on web applications
  • LLM reliability — Exploitation requires precise payloads; LLM-generated payloads may need iteration
  • Not a replacement for professionals — The skill should be clear that it complements, not replaces, professional pentesting
  • Network/infrastructure pentesting not included — Deliberately scoped to web apps only

Open Questions

  1. Should Phase 1 require explicit user confirmation before EVERY test request (safest), or just before the first scan (more practical)?
  2. How should the skill handle authentication? Shannon requires pre-configured credentials. Should we support automated login flows or require the user to provide session tokens?
  3. Should we support source-code-aware testing (white-box, like Shannon) or black-box only? White-box is more powerful but requires code access.
  4. Should the skill integrate with the browser tool (if available) or require Playwright MCP setup? The browser tool is simpler but less capable.
  5. How should we handle the parallel vulnerability agents' cost? Should there be a "quick scan" mode that runs sequentially with a smaller model?
  6. Should findings from Feature: Code Security Audit Skill — SAST Scanning, Vulnerability Validation, and Automated Patching (inspired by RAPTOR) #382 (static analysis) automatically feed into this skill's exploitation phase, or keep them independent?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions