Overview
RAPTOR includes a sophisticated binary security analysis module (packages/exploit_feasibility/, ~2500 lines in api.py alone) that performs comprehensive exploit feasibility assessment of compiled binaries. It analyzes memory protections, kernel mitigations, glibc defenses, ROP gadget availability, payload constraints, and input handler characteristics to determine whether a vulnerability in a binary is actually exploitable — and if so, what techniques would work.
Hermes Agent has no capability for analyzing compiled binaries, assessing exploit feasibility, triaging crashes, or understanding binary protections. This is a significant gap for users working with C/C++ codebases, embedded systems, CTF challenges, or security research.
This issue proposes a Binary Security Analysis skill that wraps standard binary analysis tools (checksec, readelf, objdump, nm, GDB/LLDB, ROPgadget) with LLM-powered interpretation, adapting RAPTOR's analysis patterns and expert personas into a Hermes Agent skill. The skill would also include crash analysis and triage capabilities from RAPTOR's packages/binary_analysis/ module (1325 lines).
Research Findings
How RAPTOR's Binary Analysis Works
Exploit Feasibility Module (packages/exploit_feasibility/)
The module performs layered analysis through analyze_binary():
Protection Analysis:
- Binary protections via
checksec: RELRO (Partial/Full), PIE (position-independent), NX/DEP (non-executable stack), Stack Canary, FORTIFY_SOURCE
- glibc mitigation analysis: pointer mangling, tcache hardening, safe linking,
__free_hook/__malloc_hook removal status, %n format string verification
- Kernel mitigation analysis: ASLR level (0/1/2),
mmap_min_addr, ptrace_scope
ROP Gadget Analysis:
- Scans for useful gadgets:
pop rdi; ret, pop rsi; ret, syscall; ret, leave; ret, etc.
- Bad byte analysis per target address (null bytes, newlines in payload)
- One-gadget analysis with partial overwrite viability assessment
Exploit Primitive Enumeration:
- Arbitrary read/write detection
- Control flow hijack (RIP/RSP control)
- Heap control primitives
- Format string capabilities (call count, single-shot detection)
Input Handler Analysis:
- Detects input functions:
strcpy, gets, fgets, read, recv, scanf
- Payload constraint analysis: bad bytes, maximum length, charset restrictions
Output:
Rich verdict with classification (exploitable / likely_exploitable / difficult / unlikely / blocked), concrete targets, viable techniques, and actionable guidance.
Crash Analysis Module (packages/binary_analysis/)
CrashAnalyser class (crash_analyser.py, 1325 lines):
10-step crash analysis pipeline:
- Get binary info (
file, readelf)
- Detect ASan instrumentation
- Run ASan analysis if available
- Run debugger analysis (GDB on Linux, LLDB on macOS — auto-detected)
- Get disassembly at crash site (
objdump)
- Analyze memory layout/protections (ASLR, stack canaries, NX/DEP)
- Detect environmental crashes (debugger artifacts, sanitizer artifacts)
- Analyze memory regions around crash address
- Resolve function names (
addr2line, symbol table, link register)
- Compute stack hash for deduplication
Crash type classification (signal-based + function-based + stack-trace-based):
- heap_overflow, stack_overflow, null_pointer_dereference, use_after_free, double_free
- format_string_vulnerability, integer_overflow, buffer_overflow, segmentation_fault
- division_by_zero, illegal_instruction, bus_error
Crash Analysis Skills (.claude/skills/crash-analysis/)
4 specialized sub-skills:
- rr Debugger: Deterministic record-replay debugging with reverse execution. Includes
crash_trace.py script for automated trace extraction (supports both regular and ASAN crashes).
- Line Execution Checker: C++17 tool that checks if specific source lines were executed using gcov data.
- gcov Coverage: Add gcov instrumentation to C/C++ projects for coverage-guided analysis.
- Function Tracing: Uses
-finstrument-functions hooks with per-thread logs and Perfetto visualization output.
Expert Personas
RAPTOR loads specialized expert personas progressively:
- Crash Analyst (Charlie Miller/Halvar Flake persona, 284 lines): Systematic framework — crash type ID → register analysis → exploit primitives → mitigations → attack scenario → feasibility classification (Trivial/Moderate/Complex/Infeasible)
- Offensive Security Researcher (200 lines): Decision trees for format string, stack overflow, and heap exploitation. "6 Byte Rule" for x86_64 + strcpy. "Full RELRO Trap" explanation.
- Exploit Developer (Mark Dowd persona, 337 lines): 7 "Prime Directives" requiring working code, complete executability, safe testing, realistic constraints, honest assessment. Templates for every vulnerability type.
Anti-Hallucination Patterns
The crash analysis system uses a hypothesis/rebuttal loop:
- Crash analyzer writes hypothesis with mandatory evidence (>=3 actual debugger outputs, >=5 distinct memory addresses)
- Checker agent validates mechanically (grep for red flags: "expected output", "should show", "likely", "probably")
- If rejected, analyzer retries with feedback (max 3 iterations)
Key Design Decisions
- Profile-based analysis:
_get_profile_for_vuln_type() auto-selects analysis strategy — web vulnerabilities skip memory mitigation checks entirely
- Same-tier LLM fallback: When analyzing, LLM fallback stays within cloud or local tier, never crosses (prevents inconsistent analysis quality)
- Mandatory gates: The
/exploit command forces feasibility analysis BEFORE any exploit work. Lists specific things NOT to suggest when mitigations are present (e.g., "If Full RELRO, do NOT suggest GOT overwrites")
- Context persistence:
save_exploit_context() persists analysis to JSON files that survive context window compaction
Current State in Hermes Agent
What we have:
- No binary analysis capabilities whatsoever
terminal tool can run checksec, readelf, objdump, gdb etc. if installed
execute_code can run Python scripts for analysis
delegate_task can spawn sub-agents for parallel analysis
What we don't have:
- No skill for binary security assessment
- No crash triage workflow
- No exploit feasibility analysis
- No integration with debugging tools (GDB, LLDB, rr)
- No knowledge of binary protections or exploitation techniques
Relevant existing issues:
Implementation Plan
Skill vs. Tool Classification
This should be a skill because:
- All analysis tools (checksec, readelf, objdump, nm, ROPgadget, GDB) are CLI tools callable via
terminal
- The analysis is LLM-driven interpretation of tool outputs — perfectly suited to skill instructions
- No custom Python integration needed in the agent harness
- No streaming, real-time events, or binary data handling by the agent
- Expert personas are prompting patterns, not code
Bundled vs. Skills Hub: Recommend Skills Hub. Binary analysis is highly specialized (security researchers, CTF players, systems programmers). Required tools (checksec, ROPgadget, GDB) are not commonly installed on developer machines.
Category: security (same category as Code Security Audit skill)
What We'd Need
- SKILL.md — Workflow instructions covering binary protection analysis, crash triage, and exploit feasibility assessment. Includes adapted expert persona prompts.
- references/protections-guide.md — Agent reference explaining each protection (RELRO, PIE, NX, canary, ASLR, FORTIFY) and what they prevent
- references/exploitation-techniques.md — Decision trees for common exploitation paths (adapted from RAPTOR's offensive security researcher persona)
- references/crash-types.md — Classification guide for crash types with investigation steps
- scripts/binary-audit.sh — Helper script that runs checksec + readelf + basic analysis and outputs structured JSON
Phased Rollout
Phase 1: Binary Protection Analysis + Crash Triage
- Detect and use available tools (checksec, readelf, objdump, file, strings, nm)
- Run comprehensive protection analysis on a binary
- Analyze crash files/core dumps with GDB (Linux) or LLDB (macOS)
- Classify crash type (heap overflow, UAF, format string, etc.)
- Present findings with human-readable explanations
- Assess basic exploitability based on protections
Phase 2: Deep Exploit Feasibility
- ROP gadget analysis (via ROPgadget tool)
- Bad byte analysis for payload constraints
- Input handler detection and constraint mapping
- glibc mitigation analysis (version-aware)
- Full exploit feasibility verdict with technique recommendations
- Adapted expert persona prompts (crash analyst, exploit developer)
- Context persistence for multi-turn exploit development
Phase 3: Fuzzing Integration + Advanced Analysis
Pros & Cons
Pros
- Unique capability — No other AI agent framework offers integrated binary security analysis
- High-value for security researchers — Automates tedious manual analysis steps
- Expert-level prompting — RAPTOR's personas encode decades of reverse engineering expertise
- Platform-aware — GDB on Linux, LLDB on macOS (mirrors RAPTOR's approach)
- Progressive complexity — Phase 1 is useful with just
file and readelf; deeper tools add power
- MIT-licensed source — RAPTOR's analysis patterns and code are freely adaptable
Cons / Risks
- Highly specialized audience — Most developers won't need binary exploitation analysis
- Tool dependencies — Full analysis requires checksec, ROPgadget, GDB, optionally rr and AFL++
- Platform limitations — rr only works on Linux x86_64; some tools Linux-only
- Safety concerns — Exploit generation capabilities need clear ethical usage guidelines
- LLM accuracy — Binary analysis requires precise reasoning; LLMs may hallucinate about register values or memory layouts. RAPTOR's anti-hallucination patterns (mandatory debugger output, mechanical checks) are essential.
- Scope — Could easily expand into a full exploit development framework; must stay focused on analysis/triage
Open Questions
- Should the skill include AFL++ fuzzing in Phase 1, or defer to Phase 3 as proposed?
- How much of RAPTOR's exploit_feasibility Python code (2500 lines) should we adapt vs. reimplementing as skill instructions + shell commands?
- Should exploit PoC generation be included, or just analysis/triage? (Ethical considerations)
- Should the skill work with remote binaries (download, analyze) or only local files?
- How should we handle the hypothesis/rebuttal validation loop — via
delegate_task sub-agents or iterative self-checking?
References
Overview
RAPTOR includes a sophisticated binary security analysis module (
packages/exploit_feasibility/, ~2500 lines inapi.pyalone) that performs comprehensive exploit feasibility assessment of compiled binaries. It analyzes memory protections, kernel mitigations, glibc defenses, ROP gadget availability, payload constraints, and input handler characteristics to determine whether a vulnerability in a binary is actually exploitable — and if so, what techniques would work.Hermes Agent has no capability for analyzing compiled binaries, assessing exploit feasibility, triaging crashes, or understanding binary protections. This is a significant gap for users working with C/C++ codebases, embedded systems, CTF challenges, or security research.
This issue proposes a Binary Security Analysis skill that wraps standard binary analysis tools (checksec, readelf, objdump, nm, GDB/LLDB, ROPgadget) with LLM-powered interpretation, adapting RAPTOR's analysis patterns and expert personas into a Hermes Agent skill. The skill would also include crash analysis and triage capabilities from RAPTOR's
packages/binary_analysis/module (1325 lines).Research Findings
How RAPTOR's Binary Analysis Works
Exploit Feasibility Module (packages/exploit_feasibility/)
The module performs layered analysis through
analyze_binary():Protection Analysis:
checksec: RELRO (Partial/Full), PIE (position-independent), NX/DEP (non-executable stack), Stack Canary, FORTIFY_SOURCE__free_hook/__malloc_hookremoval status,%nformat string verificationmmap_min_addr,ptrace_scopeROP Gadget Analysis:
pop rdi; ret,pop rsi; ret,syscall; ret,leave; ret, etc.Exploit Primitive Enumeration:
Input Handler Analysis:
strcpy,gets,fgets,read,recv,scanfOutput:
Rich verdict with classification (exploitable / likely_exploitable / difficult / unlikely / blocked), concrete targets, viable techniques, and actionable guidance.
Crash Analysis Module (packages/binary_analysis/)
CrashAnalyser class (crash_analyser.py, 1325 lines):
10-step crash analysis pipeline:
file,readelf)objdump)addr2line, symbol table, link register)Crash type classification (signal-based + function-based + stack-trace-based):
Crash Analysis Skills (.claude/skills/crash-analysis/)
4 specialized sub-skills:
crash_trace.pyscript for automated trace extraction (supports both regular and ASAN crashes).-finstrument-functionshooks with per-thread logs and Perfetto visualization output.Expert Personas
RAPTOR loads specialized expert personas progressively:
Anti-Hallucination Patterns
The crash analysis system uses a hypothesis/rebuttal loop:
Key Design Decisions
_get_profile_for_vuln_type()auto-selects analysis strategy — web vulnerabilities skip memory mitigation checks entirely/exploitcommand forces feasibility analysis BEFORE any exploit work. Lists specific things NOT to suggest when mitigations are present (e.g., "If Full RELRO, do NOT suggest GOT overwrites")save_exploit_context()persists analysis to JSON files that survive context window compactionCurrent State in Hermes Agent
What we have:
terminaltool can runchecksec,readelf,objdump,gdbetc. if installedexecute_codecan run Python scripts for analysisdelegate_taskcan spawn sub-agents for parallel analysisWhat we don't have:
Relevant existing issues:
Implementation Plan
Skill vs. Tool Classification
This should be a skill because:
terminalBundled vs. Skills Hub: Recommend Skills Hub. Binary analysis is highly specialized (security researchers, CTF players, systems programmers). Required tools (checksec, ROPgadget, GDB) are not commonly installed on developer machines.
Category:
security(same category as Code Security Audit skill)What We'd Need
Phased Rollout
Phase 1: Binary Protection Analysis + Crash Triage
Phase 2: Deep Exploit Feasibility
Phase 3: Fuzzing Integration + Advanced Analysis
Pros & Cons
Pros
fileandreadelf; deeper tools add powerCons / Risks
Open Questions
delegate_tasksub-agents or iterative self-checking?References