Feature: Web Application Penetration Testing Skill — Reconnaissance, Exploitation, and Proof-Based Reporting (inspired by Shannon)

## Overview

[Shannon](https://github.com/KeygraphHQ/shannon) is an autonomous AI penetration testing agent by Keygraph that achieved 96.15% on the XBOW benchmark. Its defining philosophy is **"No Exploit, No Report"** — it only reports vulnerabilities it can actually exploit and reproduce against a running application, eliminating false positives entirely. It uses a 13-agent, 5-phase pipeline (pre-recon → recon → vulnerability analysis → exploitation → reporting) orchestrated via Temporal.io, with Claude Code as the AI engine and Playwright for browser automation.

The existing RAPTOR-inspired cybersecurity issues (#382 Code Security Audit, #383 Binary Security Analysis, #384 OSS Security Forensics) cover **offline analysis** — scanning source code, analyzing binaries, and investigating supply chain attacks. None of them address the critical capability of testing a **running web application** for exploitable vulnerabilities. Shannon fills this gap with active reconnaissance, live exploitation, and proof-based reporting.

This issue proposes a **Web Application Penetration Testing** skill that adapts Shannon's methodology and pipeline into Hermes Agent, using existing tools (terminal for nmap/whatweb/curl, Hermes browser tool or Playwright MCP for web interaction, delegate_task for parallel vulnerability agents) to perform structured penetration testing against web applications in sandboxed environments.

> **License note:** Shannon is AGPL-3.0 — we cannot use its code directly. This proposal adapts Shannon's *concepts, methodology, and pipeline structure* as a fresh implementation. All code would be written from scratch.

---

## Research Findings

### How Shannon's Pentesting Pipeline Works

**Phase 1: Pre-Recon (Source Code Analysis)**
A single agent with source code access maps the application architecture:
- Framework identification and routing patterns
- Authentication flow mapping (session, JWT, OAuth)
- Dangerous sink inventory (SQL queries, OS commands, template renders, file operations)
- Input validation and sanitization audit
- Trust boundary identification
- Output: `code_analysis_deliverable.md` consumed by all downstream agents

**Phase 2: Recon (Attack Surface Mapping)**
A single agent interacts with the live application:
- Network scanning via nmap (ports, services)
- Subdomain enumeration via subfinder
- Technology fingerprinting via whatweb
- API endpoint discovery via Playwright browser interaction
- Correlation between source code analysis and live behavior
- Authentication testing (login flows, session management)
- Output: `recon_deliverable.md`

**Phase 3: Vulnerability Analysis (5 parallel agents)**
Each agent specializes in one vulnerability class:

| Agent | Scope | Methodology |
|-------|-------|-------------|
| injection-vuln | SQLi, Command Injection, Path Traversal, SSTI, LFI/RFI, Deserialization | Slot-type classification (SQL-val, CMD-argument, etc.) with required-defense mapping |
| xss-vuln | Reflected, Stored, DOM-based XSS | Render-context analysis (HTML_BODY, JAVASCRIPT_STRING, etc.) with encoding requirements |
| auth-vuln | Login bypass, JWT confusion, token replay, OAuth, password reset, brute force | 9-point authentication checklist |
| authz-vuln | Horizontal/vertical privilege escalation, IDOR, business logic | Role-based access matrix analysis |
| ssrf-vuln | Internal service access, cloud metadata, protocol smuggling | 4 SSRF type classification with escalation paths |

Each agent produces an `exploitation_queue.json` — a structured list of findings with: ID, vulnerability type, source location, path, sink, sanitization status, verdict, witness payload, and confidence level.

**Phase 4: Exploitation (5 parallel agents, conditional)**
Only runs for vulnerability classes where the analysis phase found actionable findings:
- Attempts actual exploitation against the live application
- Follows a **4-level proof system**:
  - Level 1: Identified (pattern found)
  - Level 2: Partial (data flow confirmed)  
  - Level 3: Confirmed (payload delivered successfully)
  - Level 4: Critical Impact (data extracted, code executed, access gained)
- Uses a **Bypass Exhaustion Protocol**: must attempt multiple distinct bypass techniques before classifying as false positive
- Output: `exploitation_evidence.md` with full PoC payloads and reproduction steps

**Phase 5: Reporting (1 agent)**
Assembles all evidence into a comprehensive security assessment with executive summary, proven exploits only, CVSS scoring, and remediation recommendations.

### Key Prompting Patterns

**1. Progressive Context Chain:** Each agent reads deliverables from ALL prior phases. Intelligence accumulates — the exploitation agent knows the architecture (pre-recon), attack surface (recon), and specific vulnerabilities (analysis) before attempting exploitation.

**2. Backward Taint Analysis:** Start at dangerous sinks, trace backward to sources. Early-terminate when proper sanitization found. More efficient than forward analysis for web app contexts.

**3. Slot-Type Taxonomy:** Each injection point has a specific "slot type" (SQL-val, SQL-ident, CMD-argument, PATH-segment, TEMPLATE-string, DESERIALIZE) with known required defenses. A mismatch between slot type and defense = vulnerability.

**4. Todo-Driven Completeness:** Every endpoint/input vector gets a todo item. Analysis is INCOMPLETE if any todos remain. Prevents early termination.

**5. Scope Enforcement:** Every prompt has explicit in-scope/out-of-scope definitions. External attacker perspective only. Network-reachable targets only.

**6. Classification Rigor:**
- EXPLOITED = proven with evidence (Level 3-4)
- POTENTIAL = blocked by external factors (WAF, rate limit), not by code-level security
- FALSE POSITIVE = security controls verified to work after bypass exhaustion

### Shannon's Results

Demonstrated against real vulnerable applications:
- **OWASP Juice Shop**: 20+ critical flaws, full database exfiltration, admin access
- **Checkmarx c{api}tal API**: Root-level injection, legacy API bypass, 15 high-severity issues
- **OWASP crAPI**: JWT Algorithm Confusion, SQLi credential exfiltration, SSRF chains

### Comparison with Existing Issues

| Capability | #382 Code Audit | #383 Binary | #384 Forensics | This Issue |
|-----------|----------------|-------------|---------------|------------|
| Source code analysis | ✅ Semgrep/CodeQL | ❌ | ❌ | ✅ LLM-based (for architecture mapping) |
| Live app testing | ❌ | ❌ | ❌ | ✅ Core capability |
| Exploit verification | ❌ | ❌ | ❌ | ✅ Core capability |
| Binary analysis | ❌ | ✅ | ❌ | ❌ |
| Supply chain forensics | ❌ | ❌ | ✅ | ❌ |
| Browser automation | ❌ | ❌ | ❌ | ✅ Playwright/browser |
| Network reconnaissance | ❌ | ❌ | ❌ | ✅ nmap/subfinder/whatweb |
| Professional reporting | Partial | Partial | ✅ | ✅ Pentest-grade |

**Overlap with #382:** Shannon's pre-recon (LLM-based code analysis) overlaps conceptually with #382's source scanning, but the methodology is different — Shannon reads code to understand architecture for exploitation, while #382 scans code for vulnerability patterns. Shannon's transferable patterns (backward taint, slot-type taxonomy, proof system, bypass exhaustion) have been [proposed as enhancements to #382](https://github.com/NousResearch/hermes-agent/issues/382#issuecomment-4000481969).

**No overlap with #383 or #384.** Binary analysis and forensics are completely separate domains.

---

## Current State in Hermes Agent

**What we have:**
- `terminal` tool — can run nmap, whatweb, curl, subfinder if installed
- Browser tool (if available) or Playwright MCP — can automate web interaction
- `web_extract` — can fetch and parse web pages
- `delegate_task` — can spawn parallel sub-agents (maps to Shannon's 5 parallel vuln agents)
- `execute_code` — can run Python scripts for payload generation, response parsing
- #382 Code Security Audit — static analysis (complementary, not overlapping)

**What we don't have:**
- No web application penetration testing workflow
- No live exploitation capability
- No network reconnaissance skill (nmap, subfinder, whatweb)
- No browser-automated security testing
- No pentest reporting framework
- No proof-based vulnerability verification

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **skill** because:
- All security tools (nmap, whatweb, subfinder, curl) are CLIs callable via `terminal`
- Browser interaction uses existing Hermes tools (browser tool, MCP, or Playwright via terminal)
- The pentesting methodology is LLM reasoning guided by skill instructions
- Parallel vulnerability agents use existing `delegate_task`
- No custom Python integration needed in the agent harness
- No API key management for the harness (user provides target URL, the agent attacks it)

**Bundled vs. Skills Hub:** Recommend **Skills Hub**. Penetration testing is specialized (security professionals, developers auditing their own apps in sandboxed environments) and requires multiple external tools. Additionally, active exploitation requires clear safety guardrails and ethical use guidance.

**Category:** `security` (completing the cybersecurity skill suite alongside #382, #383, #384)

### Safety Guardrails (CRITICAL)

Unlike #382-384 which analyze offline artifacts, this skill **actively attacks running applications**. Mandatory safeguards:

1. **Explicit scope confirmation** — Agent must confirm target URL and get user approval before any active testing
2. **Sandboxed environments only** — Skill instructions must emphasize: ONLY test applications you own/control in isolated environments (Docker, VMs, dedicated test instances)
3. **No production systems** — Explicit prohibition against testing production systems without written authorization
4. **Dangerous command approval** — Hermes' existing `approval.py` system will catch destructive commands, but the skill should add pentest-specific warnings
5. **Ethical use header** — Every pentest report should include authorization confirmation
6. **Rate limiting** — Built-in delays between requests to avoid overwhelming targets

### What We'd Need

1. **SKILL.md** — Pentesting workflow with trigger conditions, safety guardrails, phased pipeline instructions
2. **references/recon-guide.md** — How to use nmap, whatweb, subfinder effectively for web app reconnaissance
3. **references/vulnerability-taxonomy.md** — Slot-type and render-context classification systems adapted from Shannon
4. **references/exploitation-techniques.md** — Per-vulnerability-type exploitation guidance (SQLi payloads, XSS contexts, SSRF chains, auth bypass patterns)
5. **references/bypass-techniques.md** — Common WAF/filter bypass techniques for each vulnerability class
6. **templates/pentest-report.md** — Professional penetration test report template
7. **scripts/recon-scan.sh** — Wrapper that runs nmap + whatweb + subfinder and outputs structured JSON

### Phased Rollout

**Phase 1: Reconnaissance + Manual-Guided Testing**
- Network reconnaissance: nmap port scanning, whatweb fingerprinting, subfinder subdomain discovery
- Application mapping: crawl endpoints, identify input vectors, map authentication
- Agent-guided vulnerability identification: agent reads recon results, suggests test vectors
- User-directed exploitation: agent suggests payloads, user confirms before execution
- Basic finding documentation (Markdown report)
- Trigger: "pentest this app", "scan this web application", "test this URL for vulnerabilities"

**Phase 2: Semi-Autonomous Vulnerability Analysis**
- Parallel vulnerability-specific analysis via `delegate_task`:
  - Injection agent (SQLi, command injection, path traversal, SSTI)
  - XSS agent (reflected, stored, DOM-based with render-context analysis)
  - Authentication agent (login bypass, JWT attacks, session management)
  - Authorization agent (IDOR, privilege escalation, role verification)
  - SSRF agent (internal access, cloud metadata, protocol smuggling)
- Structured handoff via exploitation_queue JSON files
- 4-level proof system for finding classification
- Bypass exhaustion protocol before false-positive dismissal
- Integration with #382 (static analysis feeds into vuln analysis if source code available)

**Phase 3: Full Autonomous Pipeline**
- End-to-end autonomous pentesting (recon → analyze → exploit → report)
- Browser automation via Playwright MCP or Hermes browser tool for dynamic testing
- Automated exploitation with PoC generation and reproduction steps
- Professional pentest report generation with CVSS scoring
- Todo-driven completeness (all endpoints tested, all findings classified)
- Integration with CI/CD: run as scheduled security assessment on staging environments
- Comparative reporting: track findings across runs, show remediation progress

---

## Pros & Cons

### Pros
- **Completes the security suite** — With #382-384, Hermes would cover static analysis, binary analysis, forensics, AND active pentesting
- **Proof-based approach** — "No Exploit, No Report" eliminates false positive noise
- **Uses existing Hermes tools** — terminal (nmap, curl), browser/MCP (Playwright), delegate_task (parallel agents)
- **Progressive trust** — Phase 1 is human-guided, Phase 3 is autonomous. Users can adopt at their comfort level.
- **Practical value** — Every web developer should test their apps; this makes it accessible
- **Shannon-validated methodology** — 96.15% XBOW benchmark proves the approach works
- **AGPL-safe** — Fresh implementation using only concepts, no code borrowed

### Cons / Risks
- **Safety** — Active exploitation can cause data modification/deletion. Sandboxed environments mandatory.
- **Legal liability** — Unauthorized pentesting is illegal. Skill must include strong ethical use guidance.
- **Tool dependencies** — Full capability requires nmap, whatweb, subfinder, Playwright (many installs)
- **Cost** — If using powerful LLMs for multi-agent pentesting, token costs can be significant
- **Scope creep** — Pentesting is a vast field; must stay focused on web applications
- **LLM reliability** — Exploitation requires precise payloads; LLM-generated payloads may need iteration
- **Not a replacement for professionals** — The skill should be clear that it complements, not replaces, professional pentesting
- **Network/infrastructure pentesting not included** — Deliberately scoped to web apps only

---

## Open Questions

1. Should Phase 1 require explicit user confirmation before EVERY test request (safest), or just before the first scan (more practical)?
2. How should the skill handle authentication? Shannon requires pre-configured credentials. Should we support automated login flows or require the user to provide session tokens?
3. Should we support source-code-aware testing (white-box, like Shannon) or black-box only? White-box is more powerful but requires code access.
4. Should the skill integrate with the browser tool (if available) or require Playwright MCP setup? The browser tool is simpler but less capable.
5. How should we handle the parallel vulnerability agents' cost? Should there be a "quick scan" mode that runs sequentially with a smaller model?
6. Should findings from #382 (static analysis) automatically feed into this skill's exploitation phase, or keep them independent?

---

## References

- [Shannon](https://github.com/KeygraphHQ/shannon) — Source repo (AGPL-3.0, concepts only)
- [Shannon XBOW benchmark](https://arxiv.org/abs/2503.10430) — 96.15% success rate
- [Dark Reading: Shannon coverage](https://www.darkreading.com/application-security) — Industry reception
- [HelpNetSecurity CISO review](https://www.helpnetsecurity.com/) — Comparative analysis with BugTrace, CAI, PentestGPT
- [OWASP Testing Guide](https://owasp.org/www-project-web-security-testing-guide/) — Standard web pentest methodology
- [OWASP Top 10](https://owasp.org/www-project-top-ten/) — Vulnerability classification standard
- Hermes Agent #382 — Code Security Audit (complementary static analysis, [Shannon patterns merged](https://github.com/NousResearch/hermes-agent/issues/382#issuecomment-4000481969))
- Hermes Agent #383 — Binary Security Analysis (complementary, no overlap)
- Hermes Agent #384 — OSS Security Forensics (complementary, no overlap)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Web Application Penetration Testing Skill — Reconnaissance, Exploitation, and Proof-Based Reporting (inspired by Shannon) #400

Overview

Research Findings

How Shannon's Pentesting Pipeline Works

Key Prompting Patterns

Shannon's Results

Comparison with Existing Issues

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

Safety Guardrails (CRITICAL)

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Agent	Scope	Methodology
injection-vuln	SQLi, Command Injection, Path Traversal, SSTI, LFI/RFI, Deserialization	Slot-type classification (SQL-val, CMD-argument, etc.) with required-defense mapping
xss-vuln	Reflected, Stored, DOM-based XSS	Render-context analysis (HTML_BODY, JAVASCRIPT_STRING, etc.) with encoding requirements
auth-vuln	Login bypass, JWT confusion, token replay, OAuth, password reset, brute force	9-point authentication checklist
authz-vuln	Horizontal/vertical privilege escalation, IDOR, business logic	Role-based access matrix analysis
ssrf-vuln	Internal service access, cloud metadata, protocol smuggling	4 SSRF type classification with escalation paths

Capability	#382 Code Audit	#383 Binary	#384 Forensics	This Issue
Source code analysis	✅ Semgrep/CodeQL	❌	❌	✅ LLM-based (for architecture mapping)
Live app testing	❌	❌	❌	✅ Core capability
Exploit verification	❌	❌	❌	✅ Core capability
Binary analysis	❌	✅	❌	❌
Supply chain forensics	❌	❌	✅	❌
Browser automation	❌	❌	❌	✅ Playwright/browser
Network reconnaissance	❌	❌	❌	✅ nmap/subfinder/whatweb
Professional reporting	Partial	Partial	✅	✅ Pentest-grade

Feature: Web Application Penetration Testing Skill — Reconnaissance, Exploitation, and Proof-Based Reporting (inspired by Shannon) #400

Description

Overview

Research Findings

How Shannon's Pentesting Pipeline Works

Key Prompting Patterns

Shannon's Results

Comparison with Existing Issues

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

Safety Guardrails (CRITICAL)

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions