DELTA Cyber Reasoning System

Delta CRS: Diff-Evaluated Localization, Triage, and Attack Cyber Reasoning System
Delta CRS Duo Flow: Analyze potential vulnerability, exploit vulnerability through fuzzing, submit proofs, patch, and report
Delta Analysis Agent: Finding potential vulnerabilities and pass to Fuzzzing agent
Delta Fuzzing Agent: Create Harness specific to the MR, run the fuzzer, exploit and confirm any vulnerability found
Delta Patch & Report Agent: create patch and commit for developers review, and submit report compliant with project requirements
Problem Statement
Introducing Fuzzing!
3 Step of Delta CRS using Duo Flow with 3 agents

Inspiration

Security vulnerabilities remain the most dangerous and expensive bug class in production software: buffer overflows, integer overflows, and use-after-free bugs account for 70% of critical CVEs in C/C++ codebases. Fuzz testing is the most effective technique for finding these bugs automatically, but it requires specialized expertise: writing harnesses, generating seed corpora, configuring sanitizers, triaging crashes, and writing patches. Most development teams skip fuzzing entirely on merge requests because of this complexity.

Cyber Reasoning Systems have demonstrated that LLMs can bridge this gap, autonomously understanding code structure, generating targeted test harnesses, and reasoning about crash root causes. We built DELTA CRS as a GitLab Duo flow powered by Claude to bring that capability to every merge request, making security fuzzing as routine as running a linter. No configuration, no expertise required — assign a reviewer and get patches.

What it does

DELTA CRS is a GitLab Duo Agent Platform flow that automatically scans merge requests for memory safety vulnerabilities. When a developer assigns @ai-delta-crs-flow-gitlab-ai-hackathon as a reviewer, the Duo flow triggers and executes three chained AgentComponent steps, each powered by Claude Sonnet via the GitLab AI Gateway:

Analyze (10 Duo tools): uses get_merge_request and list_merge_request_diffs to read the MR, then read_file, find_files, grep, and gitlab_blob_search to load full source context including headers. Claude classifies vulnerabilities by CWE, plans fuzz targets, and scopes findings to code changed in the MR.
Fuzz (5 Duo tools): uses create_file_with_contents to write a Claude-generated libFuzzer harness and seed corpus, then run_command to compile with clang -fsanitize=address,fuzzer and execute the fuzzing campaign. If compilation fails, feeds clang errors back to Claude for correction (up to 5 retries). Reproduces each crash for full AddressSanitizer traces.
Report (7 Duo tools): uses get_merge_request to read the MR's source branch, get_repository_file to read vulnerable files, then Claude generates minimal patches and create_commit pushes them directly to the MR branch. Posts a structured security report via create_merge_request_note with severity badges, CWE IDs, and collapsible stack traces. Submits a structured code review via post_duo_code_review. Creates vulnerability issues via create_vulnerability_issue for tracking in GitLab's security dashboard.

Each step receives the previous step's output through context:{component}.final_answer input mapping, which is the Duo platform's native inter-component data passing. The flow also runs as a CI/CD job on every MR with a deterministic fallback; even without an API key, it compiles existing harnesses, fuzzes, and triages crashes.

In our demo, DELTA CRS finds a CRITICAL heap-buffer-overflow (CWE-122) in a vulnerable HTTP parser in under 80 seconds of fuzzing: a Content-Length integer overflow at atoi() leads to an undersized malloc, then memcpy writes past the allocation.

How we built it

DELTA CRS is a 3-step GitLab Duo flow (flows/delta-crs-flow.yml) backed by a Duo agent (agents/delta-crs.yml, 22 tools) and 12 Python modules (~3,500 LOC) orchestrating Claude Sonnet and libFuzzer inside a Docker container with clang and sanitizer instrumentation.

GitLab Duo Agent Platform integration: the flow definition uses the Duo v1 schema with ambient environment. Each of the three AgentComponent steps has scoped toolsets — the analyze step gets 10 read-only tools (get_merge_request, list_merge_request_diffs, read_file, find_files, grep, list_dir, get_repository_file, list_repository_tree, gitlab_blob_search, read_files), the fuzz step gets 5 execution tools (read_file, find_files, list_dir, create_file_with_contents, run_command), and the report step gets 7 write tools (get_merge_request, get_repository_file, read_file, create_merge_request_note, post_duo_code_review, create_commit, create_vulnerability_issue). Data flows between components via context:analyze.final_answer and context:fuzz.final_answer input mappings. The standalone Duo agent declares all 22 tools for interactive use in GitLab Duo Chat.

Claude Sonnet drives six distinct AI tasks, each using a different Anthropic interaction pattern:

generate_json for structured analysis: Claude reviews diffs, scores harnesses 1-10, and extracts dictionary tokens as typed JSON
generate_code: Claude synthesizes complete C fuzz harnesses and Python seed generation scripts, extracted from markdown code fences
generate_with_tools — the most sophisticated integration: Claude runs a multi-turn ReAct loop, calling view_file, search_symbol, and search_string to navigate the codebase and localize root causes before generating unified diffs. This is a genuine agentic tool-use loop, not a single-shot prompt.

Compile-fix loop: LLM-generated harnesses often don't compile on the first try. We built an inner agent loop: write harness to disk, compile with clang, feed stderr back to Claude with the actual header file content, get corrected code, retry. This progressively fixes type errors, wrong struct member names, and C/C++ linkage issues, typically succeeding within 1-2 iterations.

3-strategy dictionary generation: three parallel Claude extraction strategies per source function: comparison operand extraction (magic values, string literals), adversarial bug-trigger generation (boundary values, format strings), and protocol-aware formatted string synthesis. Results are deduplicated and merged into a libFuzzer .dict file.

Cross-run memory: A JSON persistence system tracks crash signatures (SHA-256 of normalized top-5 stack frames), patched functions, and lessons learned. The second scan skips known crashes and already-patched code.

CI/CD pipeline: A 4-stage .gitlab-ci.yml pipeline (validate → build → test → scan) ensures every MR gets fuzzing. The scan job has dual-mode behavior: Claude-powered when ANTHROPIC_API_KEY is set as a CI/CD variable, deterministic fallback (compile existing harness + fuzz + triage) when no key is available.

Challenges we ran into

LLM-generated code that doesn't compile: Claude hallucinated struct member names (req.uri instead of req.path), invented function signatures, and forgot that malloc returns void* in C. Our first approach — a single-shot prompt — had a <50% compile rate. The compile-fix loop, combined with always providing the actual header file content as context, brought this to >95% reliability in 1-2 iterations.

Crash deduplication across runs: the same underlying bug produces different crash inputs and slightly different stack traces depending on fuzzer randomness. We built deterministic signatures (SHA-256 of the top 5 stack frames, stripped of addresses and line numbers) that stay stable across runs and compiler versions.

Scoping analysis to MR changes: when the agent reads full source files for context (struct definitions, macros, function prototypes), Claude tends to report every vulnerability it finds — including pre-existing bugs in unchanged code. We added explicit scoping guardrails to every prompt: "Focus ONLY on code added or modified in the MR diff" and require each finding to be labeled as in-diff or pre-existing.

Multi-step Duo flow reliability: our initial 5-step flow suffered from WebSocket closures and context loss between AgentComponent steps. We consolidated to 3 steps with richer per-step prompts and used the Duo platform's context:{component}.final_answer input mapping for reliable data passing, which proved more stable while maintaining the full pipeline capability.

Accomplishments that we're proud of

DELTA CRS finds a real CRITICAL heap-buffer-overflow in the demo HTTP parser in under 80 seconds, a Content-Length integer overflow leading to an undersized malloc, with zero human intervention
The compile-fix loop reliably produces working harnesses by feeding clang errors back to Claude, fixing type errors, wrong function names, and linkage issues in 1-2 iterations
The system works end-to-end from MR reviewer assignment to committed patches: the Duo flow triggers → Claude analyzes diff → generates harness → fuzz → triage crashes → create_commit fix → create_merge_request_note report
Deterministic fallback mode in the .gitlab-ci.yml pipeline means every MR gets fuzzing, even without an LLM API key — zero cost, zero config
Claude's ReAct patcher navigates source code using tools (view_file, search_symbol, search_string) to localize root causes before generating patches, not just pattern-matching on stack traces
Cross-run memory means the second scan is smarter than the first — known crashes are skipped, already-patched functions aren't re-analyzed, and effective harness patterns are recorded
Full Duo platform integration: auto-committed patches via create_commit, structured code review via post_duo_code_review, vulnerability tracking via create_vulnerability_issue — the developer just reviews and merges

What we learned

The hardest part of autonomous fuzzing isn't actually fuzzing; it's harness generation. Knowing which functions to target, how to set up state, and how to interpret fuzzed bytes as structured input is where Claude adds the most value over traditional tooling. The compile-fix loop was the breakthrough that made this reliable.

Claude is surprisingly good at understanding input formats. Given a fuzz harness and source context, Claude reliably infers the expected input structure and generates diverse, protocol-aware seed corpora — something that previously required manual engineering per target.

Deterministic triage matters more than LLM triage. We initially used Claude to classify crash severity. Switching to keyword-based classification (heap-buffer-overflow + WRITE → CRITICAL, heap-buffer-overflow + READ → HIGH) made results reproducible across runs and eliminated LLM cost for the most latency-sensitive part of the pipeline.

Multi-turn tool use beats single-shot for patching. Our first patcher asked Claude to generate a patch from the stack trace alone. Accuracy improved significantly when we gave it tools to browse the codebase and localize the fault before generating a fix — a genuine agentic loop leveraging Claude's tool-use capabilities rather than a prompt-and-pray approach.

The Duo Agent Platform's scoped toolsets are a natural fit for security pipelines. Giving the analyze step read-only tools and the report step write tools (including create_commit and create_vulnerability_issue) enforces a principle-of-least-privilege architecture that mirrors how human security teams operate — analysts read, engineers patch, reporters file issues.

What's next for DELTA Cyber Reasoning System

Deeper GitLab Duo integration: use post_duo_code_review for inline MR annotations on specific diff lines, integrate with GitLab's security dashboard widgets, and add merge-blocking approval gates for CRITICAL findings
AFL++ integration: adding a second fuzzing engine for coverage-guided diversity and ensemble fuzzing alongside libFuzzer
Advanced Claude tool use: expand the ReAct patcher with additional tools for running targeted test cases against patches and verifying fixes don't introduce regressions
Broader language support: extending Claude-driven harness generation beyond C/C++ to Rust (unsafe blocks), Go (cgo), and other languages with memory safety concerns
Corpus distillation: using coverage feedback from libFuzzer to prune and evolve the seed corpus across runs, keeping only inputs that reach new code paths
Multi-project memory: sharing crash signatures and effective harness patterns across repositories within an organization, so a pattern learned on one project helps all others
Duo flow conditional routing: when the platform supports it, skip the fuzz step if no security-relevant changes are detected, reducing cost for non-security MRs

Built With

claude
duo-agent
duo-flow
gitlab
libfuzzer
python