ExecFormer

Logo

Inspiration

While driving back from South Florida up to the University of Florida, we started thinking that most LLM-based vulnerability detectors treat code as flat text. They pattern-match on surface features but never reason about what the code actually does. LLMs fail to answer that viral "The car wash is only 100m away from my house, should I walk or drive?" question, which we think is an attention problem, as shown by Patel et al., "Repeat the Prompt Twice", which showed that simply repeating the prompt dramatically improves accuracy for non-reasoning LLMs.

Half-jokingly, Jack suggested "just make the LLM be a virtual machine!" Let the model iterate and approximate a turing machine itself. Give the LLM a computer that it can "feel."

Two ideas guided our experimentation and model development:

First, Alexia Jolicoeur-Martineau showed in Less is More that a tiny 7M-parameter recursive network can beat models thousands of times larger on reasoning tasks. The key: you don't need billions of parameters if you can iterate. A small network applied repeatedly builds up complex computation step by step.

Second, Percepta AI demonstrated in Can LLMs Be Computers? that transformers can be trained to literally execute C programs for millions of steps.

That made us wonder: what if we took a small recursive network, trained it to simulate a virtual machine that tracks memory state, and attached it to a large language model that understands code semantics? The model wouldn't just pattern-match vulnerabilities. It would better internalize them.

The final push came from wanting this to be useful beyond benchmarks. Developers already ignore static analyzers because of false positives. We wanted something that plugs directly into GitLab CI/CD and catches real vulnerabilities in merge requests before they ever hit production.

What it does

ExecFormer is a neural vulnerability detection system for C/C++ memory safety that integrates directly into GitLab CI/CD as a SAST scanner.

When a developer opens a merge request with C/C++ changes, our pipeline automatically scans the code and produces a native GitLab SAST security report, flagging vulnerabilities like double-frees, use-after-frees, out-of-bounds reads/writes, and memory leaks with confidence scores and CWE classifications.

The architecture:

Code input $\rightarrow$ Gemma-3-27B (4-bit quantized) $\rightarrow$ Token Gate (top-256) $\rightarrow$ Projection (5376 $\rightarrow$ 2048) $\rightarrow$ LoopedTransformerBlock (2 layers $\times$ 16 iterations, shared weights) $\rightarrow$ AttentionPool $\rightarrow$ Verdict + CWE heads

Key results on 306 real-world CVEs:

Model	F1	Precision	Recall
ExecFormer (ours)	0.800	0.793	0.728
R2Vul 1.5B	0.780	0.762	0.798
LineVul	0.610	-	-
Devign	0.520	-	-

Our project also includes:

A full web application with an interactive code scanner showing per-iteration model convergence in real time
API documentation with live key generation
A research blog explaining the architecture
DeepPass: a separate research contribution on zero-cost LLM layer duplication

How we built it

Phase 1: Can neural networks execute programs?

We built a custom abstract virtual machine with opcodes (MALLOC, FREE, WRITE, READ, CHECK, PUSH, POP, ADD, SUB, BRANCH) and generated 500,000 synthetic programs with perfect ground-truth labels. A tiny looped transformer (231K parameters) trained on these reached:

98.8% accuracy on hard VM traces with branching, loops, and pointer operations
100% accuracy and 100% adversarial robustness on abstract interpretation programs

Linear probing confirmed the model was genuinely tracking execution state internally:

Program counter tracking: $R^2 = 0.991$
Stack depth tracking: $R^2 = 0.925$
Cohen's d (class separation across iterations): 12.5 $\rightarrow$ 17.1 $\rightarrow$ 17.4 $\rightarrow$ 20.4 $\rightarrow$ 30.1

Phase 2: Transfer to real code

We extracted token embeddings from real C code using Gemma-3-27B on the R2Vul dataset of real-world CVEs. A learned token gate selects the top 256 most informative tokens, projected from 5,376 dimensions to 2,048, and fed through our pre-trained looped transformer with 2 layers $\times$ 16 shared-weight iterations. Verdict and CWE classification heads sit on top.

Phase 3: End-to-end fine-tuning

LoRA adapters on the Gemma backbone, joint optimization with focal loss ($\alpha = 0.75$, $\gamma = 2.0$), seed sweeps, and threshold tuning at 0.30 brought us to F1 = 0.800.

Deployment

We quantized Gemma-3-27B to 4-bit GGUF (17GB) using Google's QAT quantization, built a FastAPI backend that loads the model via llama-cpp-python, and created a Next.js 14 frontend. The GitLab CI integration uses a Python script that diffs changed C files, calls our API, and outputs a gl-sast-report.json in GitLab's schema 15.0.0 format.

DeepPass: Zero-cost layer duplication (bonus research)

While trying to recover accuracy lost from quantization, we discovered and extended David Ng's Repeat Yourself technique. We developed:

Spectral screening (SBUID) to cheaply identify which layers benefit from duplication:

$$\text{SBUID} = \text{BLOOD}_{\text{impact}} - \lambda \cdot \rho$$

Greedy iterative stacking to find complementary blocks that work together
Per-layer alpha blending to control duplication strength:

$$h_{\text{out}} = h_1 + \alpha \cdot (h_2 - h_1)$$

Results across architectures:

Model	Baseline	Best Config	Gain
Qwen2-72B	70.52	Per-layer alpha triple	+13.55
Gemma-3-27B	80.54	Triple (0,2)+(12,13)+(47,48)	+7.27
Qwen3.5-27B	42.86	Triple	+37.19

Zero training. Zero extra VRAM. 46x fewer evaluations than brute force.

The mechanistic explanation: attention benefits from repetition (re-reading helps), but FFN layers store facts as associative memories that can be disrupted by the slightly perturbed input of a second pass. Correlation between gate instability and FFN harm: Pearson r = -0.89. This was inspired by 3Blue1Brown's explanation of how LLMs store facts in MLP layers.

Challenges we ran into

Fitting 27B parameters into real hardware. The full Gemma-3-27B is 102GB. It does not fit on a single L4 GPU (24GB VRAM) or the GCloud free trial's available machines. We went through several failed attempts: bitsandbytes 4-bit loading ran out of memory, mlx-lm had numpy incompatibilities, and loading fp16 tried to allocate 51GB on a 64GB machine. We eventually found Google's pre-quantized QAT GGUF at 17GB, loaded via llama-cpp-python.

Keeping the demo authentic. We committed early to never faking results. Every number on the website, every scan result, every per-iteration convergence chart comes from real model outputs. This meant building a caching pipeline on GPU infrastructure, transferring results to the web backend, and wiring up two separate data paths.

GitLab CI integration. The python:3.11-slim Docker image doesn't include git (needed for diffing changed files). Cloudflare tunnel URLs are temporary. Small things that add up when you're trying to ship.

Accomplishments that we're proud of

F1 = 0.800 on 306 real-world CVEs, beating the previous SOTA with a fundamentally different architecture
Per-iteration probing that lets you watch the model "think" across 16 loop iterations, converging on its verdict. No other vulnerability detector offers this.
DeepPass started as a side investigation and became a standalone research contribution: +13.55 on a 72B model with zero training and zero VRAM overhead
Native GitLab SAST integration that produces standard security reports, not just a research demo but a tool that fits into real developer workflows
The probing results ($R^2 = 0.991$ on program counter tracking) suggesting that shared-weight iteration genuinely learns abstract interpretation

What we learned

Constraint management is the real challenge. Having access to powerful LLMs means the bottleneck shifts from "can we build it" to "can we make it fit." 4-bit quantization, GGUF formats, and creative deployment matter as much as model architecture.
The gap between benchmark and tool is a CI/CD problem. GitLab's SAST report format made that gap surprisingly crossable.
Shared-weight iteration is underexplored. A 231K parameter network that iterates can match or beat models thousands of times larger on structured reasoning tasks.
The best research comes from constraints. DeepPass exists because we couldn't afford a bigger GPU. The FFN re-retrieval hypothesis came from asking "why does duplication help attention but hurt factual recall?"
Sometimes it starts with a car ride. We got so much to chew on from a somewhat facetious conversation, and it all came from thinking about real-world compute and financial constraints.

What's next for ExecFormer

Adaptive gating for DeepPass: learning per-input whether to duplicate layers, so the model thinks harder on difficult code and skips easy cases
Sublayer-selective duplication: repeating only the attention mechanism while skipping the FFN, reducing compute overhead by 35-65%
Expanding CWE coverage beyond the current 5 memory safety classes to format strings, integer overflows, and race conditions
Deeper GitLab integration: exploring GitLab Duo AI for inline fix suggestions and automated remediation
Publishing: we plan to submit both ExecFormer and DeepPass as papers, with the interactive website as a living companion to the research
Applying greedy stacking to ExecFormer's own backbone: duplicating optimal layers in Gemma-3-27B before extracting embeddings, potentially pushing F1 higher with zero additional training
Expanding to more languages: By expanding to more languages, we can give our architecture more of a chance to beat the SOTA LLM model R2Vul on any set of programming languages, as well as capture additional vulnerabilities by capturing more of the sample space.

Built With

c
docker
fastapi
gcloud
gitlab
google-cloud
llama-cpp-python
lora
nextjs
optuna
pydantic
sqlite

Updates

Jie Kai Tao started this project — Mar 25, 2026 01:45 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.