-
-
Delta CRS: Diff-Evaluated Localization, Triage, and Attack Cyber Reasoning System
-
Delta CRS Duo Flow: Analyze potential vulnerability, exploit vulnerability through fuzzing, submit proofs, patch, and report
-
Delta Analysis Agent: Finding potential vulnerabilities and pass to Fuzzzing agent
-
Delta Fuzzing Agent: Create Harness specific to the MR, run the fuzzer, exploit and confirm any vulnerability found
-
Delta Patch & Report Agent: create patch and commit for developers review, and submit report compliant with project requirements
-
Problem Statement
-
Introducing Fuzzing!
-
3 Step of Delta CRS using Duo Flow with 3 agents
Inspiration
Security vulnerabilities remain the most dangerous and expensive bug class in production software: buffer overflows, integer overflows, and use-after-free bugs account for 70% of critical CVEs in C/C++ codebases. Fuzz testing is the most effective technique for finding these bugs automatically, but it requires specialized expertise: writing harnesses, generating seed corpora, configuring sanitizers, triaging crashes, and writing patches. Most development teams skip fuzzing entirely on merge requests because of this complexity.
Cyber Reasoning Systems have demonstrated that LLMs can bridge this gap, autonomously understanding code structure, generating targeted test harnesses, and reasoning about crash root causes. We built DELTA CRS as a GitLab Duo flow powered by Claude to bring that capability to every merge request, making security fuzzing as routine as running a linter. No configuration, no expertise required — assign a reviewer and get patches.
What it does
DELTA CRS is a GitLab Duo Agent Platform flow that automatically scans merge requests for memory safety vulnerabilities. When a developer assigns @ai-delta-crs-flow-gitlab-ai-hackathon as a reviewer, the Duo flow triggers and executes three chained AgentComponent steps, each powered by Claude Sonnet via the GitLab AI Gateway:
Analyze (10 Duo tools): uses
get_merge_requestandlist_merge_request_diffsto read the MR, thenread_file,find_files,grep, andgitlab_blob_searchto load full source context including headers. Claude classifies vulnerabilities by CWE, plans fuzz targets, and scopes findings to code changed in the MR.Fuzz (5 Duo tools): uses
create_file_with_contentsto write a Claude-generated libFuzzer harness and seed corpus, thenrun_commandto compile withclang -fsanitize=address,fuzzerand execute the fuzzing campaign. If compilation fails, feeds clang errors back to Claude for correction (up to 5 retries). Reproduces each crash for full AddressSanitizer traces.Report (7 Duo tools): uses
get_merge_requestto read the MR's source branch,get_repository_fileto read vulnerable files, then Claude generates minimal patches andcreate_commitpushes them directly to the MR branch. Posts a structured security report viacreate_merge_request_notewith severity badges, CWE IDs, and collapsible stack traces. Submits a structured code review viapost_duo_code_review. Creates vulnerability issues viacreate_vulnerability_issuefor tracking in GitLab's security dashboard.
Each step receives the previous step's output through context:{component}.final_answer input mapping, which is the Duo platform's native inter-component data passing. The flow also runs as a CI/CD job on every MR with a deterministic fallback; even without an API key, it compiles existing harnesses, fuzzes, and triages crashes.
In our demo, DELTA CRS finds a CRITICAL heap-buffer-overflow (CWE-122) in a vulnerable HTTP parser in under 80 seconds of fuzzing: a Content-Length integer overflow at atoi() leads to an undersized malloc, then memcpy writes past the allocation.
How we built it
DELTA CRS is a 3-step GitLab Duo flow (flows/delta-crs-flow.yml) backed by a Duo agent (agents/delta-crs.yml, 22 tools) and 12 Python modules (~3,500 LOC) orchestrating Claude Sonnet and libFuzzer inside a Docker container with clang and sanitizer instrumentation.
GitLab Duo Agent Platform integration: the flow definition uses the Duo v1 schema with ambient environment. Each of the three AgentComponent steps has scoped toolsets — the analyze step gets 10 read-only tools (get_merge_request, list_merge_request_diffs, read_file, find_files, grep, list_dir, get_repository_file, list_repository_tree, gitlab_blob_search, read_files), the fuzz step gets 5 execution tools (read_file, find_files, list_dir, create_file_with_contents, run_command), and the report step gets 7 write tools (get_merge_request, get_repository_file, read_file, create_merge_request_note, post_duo_code_review, create_commit, create_vulnerability_issue). Data flows between components via context:analyze.final_answer and context:fuzz.final_answer input mappings. The standalone Duo agent declares all 22 tools for interactive use in GitLab Duo Chat.
Claude Sonnet drives six distinct AI tasks, each using a different Anthropic interaction pattern:
generate_jsonfor structured analysis: Claude reviews diffs, scores harnesses 1-10, and extracts dictionary tokens as typed JSONgenerate_code: Claude synthesizes complete C fuzz harnesses and Python seed generation scripts, extracted from markdown code fencesgenerate_with_tools— the most sophisticated integration: Claude runs a multi-turn ReAct loop, callingview_file,search_symbol, andsearch_stringto navigate the codebase and localize root causes before generating unified diffs. This is a genuine agentic tool-use loop, not a single-shot prompt.
Compile-fix loop: LLM-generated harnesses often don't compile on the first try. We built an inner agent loop: write harness to disk, compile with clang, feed stderr back to Claude with the actual header file content, get corrected code, retry. This progressively fixes type errors, wrong struct member names, and C/C++ linkage issues, typically succeeding within 1-2 iterations.
3-strategy dictionary generation: three parallel Claude extraction strategies per source function: comparison operand extraction (magic values, string literals), adversarial bug-trigger generation (boundary values, format strings), and protocol-aware formatted string synthesis. Results are deduplicated and merged into a libFuzzer .dict file.
Cross-run memory: A JSON persistence system tracks crash signatures (SHA-256 of normalized top-5 stack frames), patched functions, and lessons learned. The second scan skips known crashes and already-patched code.
CI/CD pipeline: A 4-stage .gitlab-ci.yml pipeline (validate → build → test → scan) ensures every MR gets fuzzing. The scan job has dual-mode behavior: Claude-powered when ANTHROPIC_API_KEY is set as a CI/CD variable, deterministic fallback (compile existing harness + fuzz + triage) when no key is available.
Challenges we ran into
LLM-generated code that doesn't compile: Claude hallucinated struct member names (req.uri instead of req.path), invented function signatures, and forgot that malloc returns void* in C. Our first approach — a single-shot prompt — had a <50% compile rate. The compile-fix loop, combined with always providing the actual header file content as context, brought this to >95% reliability in 1-2 iterations.
Crash deduplication across runs: the same underlying bug produces different crash inputs and slightly different stack traces depending on fuzzer randomness. We built deterministic signatures (SHA-256 of the top 5 stack frames, stripped of addresses and line numbers) that stay stable across runs and compiler versions.
Scoping analysis to MR changes: when the agent reads full source files for context (struct definitions, macros, function prototypes), Claude tends to report every vulnerability it finds — including pre-existing bugs in unchanged code. We added explicit scoping guardrails to every prompt: "Focus ONLY on code added or modified in the MR diff" and require each finding to be labeled as in-diff or pre-existing.
Multi-step Duo flow reliability: our initial 5-step flow suffered from WebSocket closures and context loss between AgentComponent steps. We consolidated to 3 steps with richer per-step prompts and used the Duo platform's context:{component}.final_answer input mapping for reliable data passing, which proved more stable while maintaining the full pipeline capability.
Accomplishments that we're proud of
- DELTA CRS finds a real CRITICAL heap-buffer-overflow in the demo HTTP parser in under 80 seconds, a Content-Length integer overflow leading to an undersized malloc, with zero human intervention
- The compile-fix loop reliably produces working harnesses by feeding clang errors back to Claude, fixing type errors, wrong function names, and linkage issues in 1-2 iterations
- The system works end-to-end from MR reviewer assignment to committed patches: the Duo flow triggers → Claude analyzes diff → generates harness → fuzz → triage crashes →
create_commitfix →create_merge_request_notereport - Deterministic fallback mode in the
.gitlab-ci.ymlpipeline means every MR gets fuzzing, even without an LLM API key — zero cost, zero config - Claude's ReAct patcher navigates source code using tools (
view_file,search_symbol,search_string) to localize root causes before generating patches, not just pattern-matching on stack traces - Cross-run memory means the second scan is smarter than the first — known crashes are skipped, already-patched functions aren't re-analyzed, and effective harness patterns are recorded
- Full Duo platform integration: auto-committed patches via
create_commit, structured code review viapost_duo_code_review, vulnerability tracking viacreate_vulnerability_issue— the developer just reviews and merges
What we learned
The hardest part of autonomous fuzzing isn't actually fuzzing; it's harness generation. Knowing which functions to target, how to set up state, and how to interpret fuzzed bytes as structured input is where Claude adds the most value over traditional tooling. The compile-fix loop was the breakthrough that made this reliable.
Claude is surprisingly good at understanding input formats. Given a fuzz harness and source context, Claude reliably infers the expected input structure and generates diverse, protocol-aware seed corpora — something that previously required manual engineering per target.
Deterministic triage matters more than LLM triage. We initially used Claude to classify crash severity. Switching to keyword-based classification (heap-buffer-overflow + WRITE → CRITICAL, heap-buffer-overflow + READ → HIGH) made results reproducible across runs and eliminated LLM cost for the most latency-sensitive part of the pipeline.
Multi-turn tool use beats single-shot for patching. Our first patcher asked Claude to generate a patch from the stack trace alone. Accuracy improved significantly when we gave it tools to browse the codebase and localize the fault before generating a fix — a genuine agentic loop leveraging Claude's tool-use capabilities rather than a prompt-and-pray approach.
The Duo Agent Platform's scoped toolsets are a natural fit for security pipelines. Giving the analyze step read-only tools and the report step write tools (including create_commit and create_vulnerability_issue) enforces a principle-of-least-privilege architecture that mirrors how human security teams operate — analysts read, engineers patch, reporters file issues.
What's next for DELTA Cyber Reasoning System
- Deeper GitLab Duo integration: use
post_duo_code_reviewfor inline MR annotations on specific diff lines, integrate with GitLab's security dashboard widgets, and add merge-blocking approval gates for CRITICAL findings - AFL++ integration: adding a second fuzzing engine for coverage-guided diversity and ensemble fuzzing alongside libFuzzer
- Advanced Claude tool use: expand the ReAct patcher with additional tools for running targeted test cases against patches and verifying fixes don't introduce regressions
- Broader language support: extending Claude-driven harness generation beyond C/C++ to Rust (unsafe blocks), Go (cgo), and other languages with memory safety concerns
- Corpus distillation: using coverage feedback from libFuzzer to prune and evolve the seed corpus across runs, keeping only inputs that reach new code paths
- Multi-project memory: sharing crash signatures and effective harness patterns across repositories within an organization, so a pattern learned on one project helps all others
- Duo flow conditional routing: when the platform supports it, skip the fuzz step if no security-relevant changes are detected, reducing cost for non-security MRs
Built With
- claude
- duo-agent
- duo-flow
- gitlab
- libfuzzer
- python
Log in or sign up for Devpost to join the conversation.