Skip to content

Feature: Binary Security Analysis Skill — Exploit Feasibility, Crash Triage, and Protection Analysis (inspired by RAPTOR) #383

@teknium1

Description

@teknium1

Overview

RAPTOR includes a sophisticated binary security analysis module (packages/exploit_feasibility/, ~2500 lines in api.py alone) that performs comprehensive exploit feasibility assessment of compiled binaries. It analyzes memory protections, kernel mitigations, glibc defenses, ROP gadget availability, payload constraints, and input handler characteristics to determine whether a vulnerability in a binary is actually exploitable — and if so, what techniques would work.

Hermes Agent has no capability for analyzing compiled binaries, assessing exploit feasibility, triaging crashes, or understanding binary protections. This is a significant gap for users working with C/C++ codebases, embedded systems, CTF challenges, or security research.

This issue proposes a Binary Security Analysis skill that wraps standard binary analysis tools (checksec, readelf, objdump, nm, GDB/LLDB, ROPgadget) with LLM-powered interpretation, adapting RAPTOR's analysis patterns and expert personas into a Hermes Agent skill. The skill would also include crash analysis and triage capabilities from RAPTOR's packages/binary_analysis/ module (1325 lines).


Research Findings

How RAPTOR's Binary Analysis Works

Exploit Feasibility Module (packages/exploit_feasibility/)

The module performs layered analysis through analyze_binary():

Protection Analysis:

  • Binary protections via checksec: RELRO (Partial/Full), PIE (position-independent), NX/DEP (non-executable stack), Stack Canary, FORTIFY_SOURCE
  • glibc mitigation analysis: pointer mangling, tcache hardening, safe linking, __free_hook/__malloc_hook removal status, %n format string verification
  • Kernel mitigation analysis: ASLR level (0/1/2), mmap_min_addr, ptrace_scope

ROP Gadget Analysis:

  • Scans for useful gadgets: pop rdi; ret, pop rsi; ret, syscall; ret, leave; ret, etc.
  • Bad byte analysis per target address (null bytes, newlines in payload)
  • One-gadget analysis with partial overwrite viability assessment

Exploit Primitive Enumeration:

  • Arbitrary read/write detection
  • Control flow hijack (RIP/RSP control)
  • Heap control primitives
  • Format string capabilities (call count, single-shot detection)

Input Handler Analysis:

  • Detects input functions: strcpy, gets, fgets, read, recv, scanf
  • Payload constraint analysis: bad bytes, maximum length, charset restrictions

Output:
Rich verdict with classification (exploitable / likely_exploitable / difficult / unlikely / blocked), concrete targets, viable techniques, and actionable guidance.

Crash Analysis Module (packages/binary_analysis/)

CrashAnalyser class (crash_analyser.py, 1325 lines):

10-step crash analysis pipeline:

  1. Get binary info (file, readelf)
  2. Detect ASan instrumentation
  3. Run ASan analysis if available
  4. Run debugger analysis (GDB on Linux, LLDB on macOS — auto-detected)
  5. Get disassembly at crash site (objdump)
  6. Analyze memory layout/protections (ASLR, stack canaries, NX/DEP)
  7. Detect environmental crashes (debugger artifacts, sanitizer artifacts)
  8. Analyze memory regions around crash address
  9. Resolve function names (addr2line, symbol table, link register)
  10. Compute stack hash for deduplication

Crash type classification (signal-based + function-based + stack-trace-based):

  • heap_overflow, stack_overflow, null_pointer_dereference, use_after_free, double_free
  • format_string_vulnerability, integer_overflow, buffer_overflow, segmentation_fault
  • division_by_zero, illegal_instruction, bus_error

Crash Analysis Skills (.claude/skills/crash-analysis/)

4 specialized sub-skills:

  • rr Debugger: Deterministic record-replay debugging with reverse execution. Includes crash_trace.py script for automated trace extraction (supports both regular and ASAN crashes).
  • Line Execution Checker: C++17 tool that checks if specific source lines were executed using gcov data.
  • gcov Coverage: Add gcov instrumentation to C/C++ projects for coverage-guided analysis.
  • Function Tracing: Uses -finstrument-functions hooks with per-thread logs and Perfetto visualization output.

Expert Personas

RAPTOR loads specialized expert personas progressively:

  • Crash Analyst (Charlie Miller/Halvar Flake persona, 284 lines): Systematic framework — crash type ID → register analysis → exploit primitives → mitigations → attack scenario → feasibility classification (Trivial/Moderate/Complex/Infeasible)
  • Offensive Security Researcher (200 lines): Decision trees for format string, stack overflow, and heap exploitation. "6 Byte Rule" for x86_64 + strcpy. "Full RELRO Trap" explanation.
  • Exploit Developer (Mark Dowd persona, 337 lines): 7 "Prime Directives" requiring working code, complete executability, safe testing, realistic constraints, honest assessment. Templates for every vulnerability type.

Anti-Hallucination Patterns

The crash analysis system uses a hypothesis/rebuttal loop:

  • Crash analyzer writes hypothesis with mandatory evidence (>=3 actual debugger outputs, >=5 distinct memory addresses)
  • Checker agent validates mechanically (grep for red flags: "expected output", "should show", "likely", "probably")
  • If rejected, analyzer retries with feedback (max 3 iterations)

Key Design Decisions

  1. Profile-based analysis: _get_profile_for_vuln_type() auto-selects analysis strategy — web vulnerabilities skip memory mitigation checks entirely
  2. Same-tier LLM fallback: When analyzing, LLM fallback stays within cloud or local tier, never crosses (prevents inconsistent analysis quality)
  3. Mandatory gates: The /exploit command forces feasibility analysis BEFORE any exploit work. Lists specific things NOT to suggest when mitigations are present (e.g., "If Full RELRO, do NOT suggest GOT overwrites")
  4. Context persistence: save_exploit_context() persists analysis to JSON files that survive context window compaction

Current State in Hermes Agent

What we have:

  • No binary analysis capabilities whatsoever
  • terminal tool can run checksec, readelf, objdump, gdb etc. if installed
  • execute_code can run Python scripts for analysis
  • delegate_task can spawn sub-agents for parallel analysis

What we don't have:

  • No skill for binary security assessment
  • No crash triage workflow
  • No exploit feasibility analysis
  • No integration with debugging tools (GDB, LLDB, rr)
  • No knowledge of binary protections or exploitation techniques

Relevant existing issues:


Implementation Plan

Skill vs. Tool Classification

This should be a skill because:

  • All analysis tools (checksec, readelf, objdump, nm, ROPgadget, GDB) are CLI tools callable via terminal
  • The analysis is LLM-driven interpretation of tool outputs — perfectly suited to skill instructions
  • No custom Python integration needed in the agent harness
  • No streaming, real-time events, or binary data handling by the agent
  • Expert personas are prompting patterns, not code

Bundled vs. Skills Hub: Recommend Skills Hub. Binary analysis is highly specialized (security researchers, CTF players, systems programmers). Required tools (checksec, ROPgadget, GDB) are not commonly installed on developer machines.

Category: security (same category as Code Security Audit skill)

What We'd Need

  1. SKILL.md — Workflow instructions covering binary protection analysis, crash triage, and exploit feasibility assessment. Includes adapted expert persona prompts.
  2. references/protections-guide.md — Agent reference explaining each protection (RELRO, PIE, NX, canary, ASLR, FORTIFY) and what they prevent
  3. references/exploitation-techniques.md — Decision trees for common exploitation paths (adapted from RAPTOR's offensive security researcher persona)
  4. references/crash-types.md — Classification guide for crash types with investigation steps
  5. scripts/binary-audit.sh — Helper script that runs checksec + readelf + basic analysis and outputs structured JSON

Phased Rollout

Phase 1: Binary Protection Analysis + Crash Triage

  • Detect and use available tools (checksec, readelf, objdump, file, strings, nm)
  • Run comprehensive protection analysis on a binary
  • Analyze crash files/core dumps with GDB (Linux) or LLDB (macOS)
  • Classify crash type (heap overflow, UAF, format string, etc.)
  • Present findings with human-readable explanations
  • Assess basic exploitability based on protections

Phase 2: Deep Exploit Feasibility

  • ROP gadget analysis (via ROPgadget tool)
  • Bad byte analysis for payload constraints
  • Input handler detection and constraint mapping
  • glibc mitigation analysis (version-aware)
  • Full exploit feasibility verdict with technique recommendations
  • Adapted expert persona prompts (crash analyst, exploit developer)
  • Context persistence for multi-turn exploit development

Phase 3: Fuzzing Integration + Advanced Analysis


Pros & Cons

Pros

  • Unique capability — No other AI agent framework offers integrated binary security analysis
  • High-value for security researchers — Automates tedious manual analysis steps
  • Expert-level prompting — RAPTOR's personas encode decades of reverse engineering expertise
  • Platform-aware — GDB on Linux, LLDB on macOS (mirrors RAPTOR's approach)
  • Progressive complexity — Phase 1 is useful with just file and readelf; deeper tools add power
  • MIT-licensed source — RAPTOR's analysis patterns and code are freely adaptable

Cons / Risks

  • Highly specialized audience — Most developers won't need binary exploitation analysis
  • Tool dependencies — Full analysis requires checksec, ROPgadget, GDB, optionally rr and AFL++
  • Platform limitations — rr only works on Linux x86_64; some tools Linux-only
  • Safety concerns — Exploit generation capabilities need clear ethical usage guidelines
  • LLM accuracy — Binary analysis requires precise reasoning; LLMs may hallucinate about register values or memory layouts. RAPTOR's anti-hallucination patterns (mandatory debugger output, mechanical checks) are essential.
  • Scope — Could easily expand into a full exploit development framework; must stay focused on analysis/triage

Open Questions

  1. Should the skill include AFL++ fuzzing in Phase 1, or defer to Phase 3 as proposed?
  2. How much of RAPTOR's exploit_feasibility Python code (2500 lines) should we adapt vs. reimplementing as skill instructions + shell commands?
  3. Should exploit PoC generation be included, or just analysis/triage? (Ethical considerations)
  4. Should the skill work with remote binaries (download, analyze) or only local files?
  5. How should we handle the hypothesis/rebuttal validation loop — via delegate_task sub-agents or iterative self-checking?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions