Inspiration
While driving back from South Florida up to the University of Florida, we started thinking that most LLM-based vulnerability detectors treat code as flat text. They pattern-match on surface features but never reason about what the code actually does. LLMs fail to answer that viral "The car wash is only 100m away from my house, should I walk or drive?" question, which we think is an attention problem, as shown by Patel et al., "Repeat the Prompt Twice", which showed that simply repeating the prompt dramatically improves accuracy for non-reasoning LLMs.
Half-jokingly, Jack suggested "just make the LLM be a virtual machine!" Let the model iterate and approximate a turing machine itself. Give the LLM a computer that it can "feel."
Two ideas guided our experimentation and model development:
First, Alexia Jolicoeur-Martineau showed in Less is More that a tiny 7M-parameter recursive network can beat models thousands of times larger on reasoning tasks. The key: you don't need billions of parameters if you can iterate. A small network applied repeatedly builds up complex computation step by step.
Second, Percepta AI demonstrated in Can LLMs Be Computers? that transformers can be trained to literally execute C programs for millions of steps.
That made us wonder: what if we took a small recursive network, trained it to simulate a virtual machine that tracks memory state, and attached it to a large language model that understands code semantics? The model wouldn't just pattern-match vulnerabilities. It would better internalize them.
The final push came from wanting this to be useful beyond benchmarks. Developers already ignore static analyzers because of false positives. We wanted something that plugs directly into GitLab CI/CD and catches real vulnerabilities in merge requests before they ever hit production.
What it does
ExecFormer is a neural vulnerability detection system for C/C++ memory safety that integrates directly into GitLab CI/CD as a SAST scanner.
When a developer opens a merge request with C/C++ changes, our pipeline automatically scans the code and produces a native GitLab SAST security report, flagging vulnerabilities like double-frees, use-after-frees, out-of-bounds reads/writes, and memory leaks with confidence scores and CWE classifications.
The architecture:
Code input \(\rightarrow\) Gemma-3-27B (4-bit quantized) \(\rightarrow\) Token Gate (top-256) \(\rightarrow\) Projection (5376 \(\rightarrow\) 2048) \(\rightarrow\) LoopedTransformerBlock (2 layers \(\times\) 16 iterations, shared weights) \(\rightarrow\) AttentionPool \(\rightarrow\) Verdict + CWE heads
Key results on 306 real-world CVEs:
| Model | F1 | Precision | Recall |
|---|---|---|---|
| ExecFormer (ours) | 0.800 | 0.793 | 0.728 |
| R2Vul 1.5B | 0.780 | 0.762 | 0.798 |
| LineVul | 0.610 | - | - |
| Devign | 0.520 | - | - |
Our project also includes:
- A full web application with an interactive code scanner showing per-iteration model convergence in real time
- API documentation with live key generation
- A research blog explaining the architecture
- DeepPass: a separate research contribution on zero-cost LLM layer duplication
How we built it
Phase 1: Can neural networks execute programs?
We built a custom abstract virtual machine with opcodes (MALLOC, FREE, WRITE, READ, CHECK, PUSH, POP, ADD, SUB, BRANCH) and generated 500,000 synthetic programs with perfect ground-truth labels. A tiny looped transformer (231K parameters) trained on these reached:
- 98.8% accuracy on hard VM traces with branching, loops, and pointer operations
- 100% accuracy and 100% adversarial robustness on abstract interpretation programs
Linear probing confirmed the model was genuinely tracking execution state internally:
- Program counter tracking: \(R^2 = 0.991\)
- Stack depth tracking: \(R^2 = 0.925\)
- Cohen's d (class separation across iterations): 12.5 \(\rightarrow\) 17.1 \(\rightarrow\) 17.4 \(\rightarrow\) 20.4 \(\rightarrow\) 30.1
Phase 2: Transfer to real code
We extracted token embeddings from real C code using Gemma-3-27B on the R2Vul dataset of real-world CVEs. A learned token gate selects the top 256 most informative tokens, projected from 5,376 dimensions to 2,048, and fed through our pre-trained looped transformer with 2 layers \(\times\) 16 shared-weight iterations. Verdict and CWE classification heads sit on top.
Phase 3: End-to-end fine-tuning
LoRA adapters on the Gemma backbone, joint optimization with focal loss (\(\alpha = 0.75\), \(\gamma = 2.0\)), seed sweeps, and threshold tuning at 0.30 brought us to F1 = 0.800.
Deployment
We quantized Gemma-3-27B to 4-bit GGUF (17GB) using Google's QAT quantization, built a FastAPI backend that loads the model via llama-cpp-python, and created a Next.js 14 frontend. The GitLab CI integration uses a Python script that diffs changed C files, calls our API, and outputs a gl-sast-report.json in GitLab's schema 15.0.0 format.
DeepPass: Zero-cost layer duplication (bonus research)
While trying to recover accuracy lost from quantization, we discovered and extended David Ng's Repeat Yourself technique. We developed:
- Spectral screening (SBUID) to cheaply identify which layers benefit from duplication:
$$\text{SBUID} = \text{BLOOD}_{\text{impact}} - \lambda \cdot \rho$$
- Greedy iterative stacking to find complementary blocks that work together
- Per-layer alpha blending to control duplication strength:
$$h_{\text{out}} = h_1 + \alpha \cdot (h_2 - h_1)$$
Results across architectures:
| Model | Baseline | Best Config | Gain |
|---|---|---|---|
| Qwen2-72B | 70.52 | Per-layer alpha triple | +13.55 |
| Gemma-3-27B | 80.54 | Triple (0,2)+(12,13)+(47,48) | +7.27 |
| Qwen3.5-27B | 42.86 | Triple | +37.19 |
Zero training. Zero extra VRAM. 46x fewer evaluations than brute force.
The mechanistic explanation: attention benefits from repetition (re-reading helps), but FFN layers store facts as associative memories that can be disrupted by the slightly perturbed input of a second pass. Correlation between gate instability and FFN harm: Pearson r = -0.89. This was inspired by 3Blue1Brown's explanation of how LLMs store facts in MLP layers.
Challenges we ran into
Fitting 27B parameters into real hardware. The full Gemma-3-27B is 102GB. It does not fit on a single L4 GPU (24GB VRAM) or the GCloud free trial's available machines. We went through several failed attempts: bitsandbytes 4-bit loading ran out of memory, mlx-lm had numpy incompatibilities, and loading fp16 tried to allocate 51GB on a 64GB machine. We eventually found Google's pre-quantized QAT GGUF at 17GB, loaded via llama-cpp-python.
Keeping the demo authentic. We committed early to never faking results. Every number on the website, every scan result, every per-iteration convergence chart comes from real model outputs. This meant building a caching pipeline on GPU infrastructure, transferring results to the web backend, and wiring up two separate data paths.
GitLab CI integration. The python:3.11-slim Docker image doesn't include git (needed for diffing changed files). Cloudflare tunnel URLs are temporary. Small things that add up when you're trying to ship.
Accomplishments that we're proud of
- F1 = 0.800 on 306 real-world CVEs, beating the previous SOTA with a fundamentally different architecture
- Per-iteration probing that lets you watch the model "think" across 16 loop iterations, converging on its verdict. No other vulnerability detector offers this.
- DeepPass started as a side investigation and became a standalone research contribution: +13.55 on a 72B model with zero training and zero VRAM overhead
- Native GitLab SAST integration that produces standard security reports, not just a research demo but a tool that fits into real developer workflows
- The probing results (\(R^2 = 0.991\) on program counter tracking) suggesting that shared-weight iteration genuinely learns abstract interpretation
What we learned
- Constraint management is the real challenge. Having access to powerful LLMs means the bottleneck shifts from "can we build it" to "can we make it fit." 4-bit quantization, GGUF formats, and creative deployment matter as much as model architecture.
- The gap between benchmark and tool is a CI/CD problem. GitLab's SAST report format made that gap surprisingly crossable.
- Shared-weight iteration is underexplored. A 231K parameter network that iterates can match or beat models thousands of times larger on structured reasoning tasks.
- The best research comes from constraints. DeepPass exists because we couldn't afford a bigger GPU. The FFN re-retrieval hypothesis came from asking "why does duplication help attention but hurt factual recall?"
- Sometimes it starts with a car ride. We got so much to chew on from a somewhat facetious conversation, and it all came from thinking about real-world compute and financial constraints.
What's next for ExecFormer
- Adaptive gating for DeepPass: learning per-input whether to duplicate layers, so the model thinks harder on difficult code and skips easy cases
- Sublayer-selective duplication: repeating only the attention mechanism while skipping the FFN, reducing compute overhead by 35-65%
- Expanding CWE coverage beyond the current 5 memory safety classes to format strings, integer overflows, and race conditions
- Deeper GitLab integration: exploring GitLab Duo AI for inline fix suggestions and automated remediation
- Publishing: we plan to submit both ExecFormer and DeepPass as papers, with the interactive website as a living companion to the research
- Applying greedy stacking to ExecFormer's own backbone: duplicating optimal layers in Gemma-3-27B before extracting embeddings, potentially pushing F1 higher with zero additional training
- Expanding to more languages: By expanding to more languages, we can give our architecture more of a chance to beat the SOTA LLM model R2Vul on any set of programming languages, as well as capture additional vulnerabilities by capturing more of the sample space.
Built With
- c
- docker
- fastapi
- gcloud
- gitlab
- google-cloud
- llama-cpp-python
- lora
- nextjs
- optuna
- pydantic
- sqlite
Log in or sign up for Devpost to join the conversation.