Website: https://vnn-comp.github.io/#vnncomp2026
VNN-LIB Website and Details: https://www.vnnlib.org/
2026 References:
- Rules discussion repository: https://github.com/VNN-COMP/vnncomp2026/issues
Prior Benchmarks / Discussions / Results from 2025 for reference:
- Rules discussion repository: https://github.com/VNN-COMP/vnncomp2025/issues
- Benchmark repository: https://github.com/VNN-COMP/vnncomp2025_benchmarks
- Results repository: https://github.com/VNN-COMP/vnncomp2025_results
- Report: TBD
Changes and History:
- 2026: Initial draft copied from 2025, updated for SAIV 2026 @ FLoC 2026, July 24-25, 2026, Lisbon, Portugal
- Competition Contribution short papers (4-6 pages) will be included in the LNCS SAIV proceedings
- 2025: See 2025 rules
- 2024: Introduced two tracks (regular and extended), eliminated meta-solver definitions
VNN-COMP serves to evaluate neural network verification methods on benchmarks, including (1) evaluations of how many properties can be verified/falsified across a set of benchmarks and (2) within a given amount of computational resources. Scoring mechanisms are detailed later in the rules, but in essence scores are rewarded for proving or falsifying as many properties as possible within the given compute budget. These together serve to evaluate both precision and scalability, as well as identify challenges for the community.
The terminology is based around similar terminology as would be used in an SAT/SMT competition, as the VNN-LIB language is based in part on SMT-LIB.
Instance: consists of: input + property + neural network model, as well as a timeout
- Example: an MNIST classifier with 1 input image and a given local robustness threshold ε
Benchmark: a set of instances
- Example: a specific MNIST classifier with 100 input images and a given robustness threshold ε
Possible results of running a tool on each instance:
- unsat: all the constraints in the .vnnlib cannot be met at the same time (the .vnnlib file typically encodes a counterexample [so the negation of a desired property], if no counterexamples exists, then the desired property holds and unsat should be produced); intuition: an "unsafe" output is not possible
- sat: all the constraints specified in the .vnnlib file can be met at the same time. A witness to a sat property is the assignment to the variables in the .vnnlib file such that all the constraints are true. Intuition: an "unsafe" output is possible from an allowed input
- timeout: execution runtime exceeds per-instance timeout
- error: tool execution on instance results in error / crash
- unknown: other
| Returned results \ GT | Unsat (holds) | Sat (violate) | Timeout | Error | Unknown |
|---|---|---|---|---|---|
| Unsat (holds) | Correct | Incorrect | |||
| Sat (violate) | Incorrect | Correct |
-
Pre-Competition Phase: Tool authors prepare three scripts for their tool. See the README in the prior year repo for information on what each script does: https://github.com/stanleybak/vnncomp2021.
-
Rules Discussion and Benchmark Solicitation Phase: We discuss and finalize rules with the participants. Tool authors and other participants will propose benchmarks and validate them on a website to be in the correct format. Benchmark instances should have .onnx networks, .vnnlib spec files, and contain a script to generate a .csv file of instances based on a random seed.
-
Measurement Phase (unofficial): Tool authors will run their tool on all benchmarks by interacting with a website, using a known seed. This can be done multiple times to fix any errors that come up. The website is: https://vnncomp.christopher-brix.de/
-
Measurement Phase (official): Organizers will run all tools on all benchmarks by interacting with a website, using a fresh random seed to select the benchmark instances.
-
Reporting Phase: Following the presentation of results based on the official measurement phase, participants will be asked to contribute text to a summary report (describing their tool or benchmark), that will be posted on arxiv and will be authored by the organizers. Discrepancies in results can be reported to the organizers and updated in the arxiv report.
Input format: .onnx and .vnnlib
Output format from execution run: single output plaintext file with result string ("sat", "unsat", "timeout", "error", "unknown"). In the case of "sat", the second line of the file will contain an assignment to each variable (the witness). If a witness is not provided when "sat" is produced, the tool will be assessed a penalty. However, there will be no penalty as long as the execution of the network using onnxruntime matches the witness up to a small numerical precision (mismatch is less than 1e-3 relative error), and the constraints are met up to 1e-4 absolute error. An example output file with the witness format is below:
sat
((X_0 0.02500000074505806)
(X_1 0.97500000000000000)
(Y_0 -0.03500000023705806)
(Y_1 0.32500000072225301)
(Y_2 0.02500000094505020))
Verification instance execution process: Verification instances will be executed one by one, alternating between an instance preparation script and an instance execution script that actually checks the spec.
- Execution is done on AWS.
- Note: this section is tentative: we are seeing if we can acquire the instances proposed here and other instance types can be proposed in the rules discussion phase.
- We will have three cloud evaluation platforms, one CPU focused and one GPU focused, and one "balanced" instance. These will have roughly the same cost/hour. The tool authors will each choose one platform to participate on for all benchmarks. The instances are (tentative and subject to availability, if you'd like to request others, please comment in the github issues):
- CPU: m5.16xlarge, $3.072 / hour, 64 vCPU, 256 GB memory
- GPU: p3.2xlarge, $3.06/hour, 8vCPUs, 61 GB memory, 1x V100 GPU
- Balanced: g5.8xlarge, $2.44 per hour, 32 vCPUs, 128 GB memory
- Reasoning: we have gone with roughly cost equivalent instances as the comparison mechanism, given some tools use CPU only and some use GPU. There are pros/cons of this mechanism.
- Per instance: any verification instance will timeout after at most X minutes, determined by the benchmark proposer. These can be different for each instance.
- Per benchmark: the runtime sum of all verification instances in a benchmark will take at most 6 hours.
For example, a benchmark proposal could have six instances with a one-hour timeout, or 100 instances each with a 3.6 minute timeout.
For reporting and/or awards, benchmarks may be further divided into groups by the organizers (for example, image classification, control, novel vs. existing benchmarks, etc.), but these will not influence scoring.
Each research group is allowed to propose multiple tools. Each tool proposed must differ substantially from other proposed tools (for example, different parameters with the same tool are not permitted). The tools should actually be different for the tools to be considered different (e.g., not the same tool with just different heuristics parameters), with edge cases decided by the organizers and in discussion with the participants. Reasoning: participants could submit many variants of their same tool with different heuristics parameters.
Meta-solvers: Definitions eliminated in 2024 as had not arisen in practice and likely infeasible to do things based on some percentage of code difference. If tool participants reuse other libraries and build upon others' research, they should exhibit standard academic/scholarly conduct and courtesy by acknowledging the dependencies in the report and tool description, as well as similarly in any presentation overview. We will collect dependencies in the benchmark voting submission.
Two tracks will exist for scoring: a Regular track and an Extended Track. The regular track will consist of benchmarks where >=50% of tool participants vote for that benchmark to be scored. The extended track will consist of benchmarks with at least 1 vote to be scored and that are not in the regular track. Benchmarks without any votes will be unscored, but will be described in the report.
Reasoning: tool participants should be rewarded for supporting general architectures and incentivized to handle as many benchmarks as possible (hence only 1 tool needs to vote for it to be included in the extended track), but at the same time, if only one or a small number of tools support a particular benchmark, the event is more of a challenge and not a competition, and does not necessarily evaluate what tool has the best performance across a set of benchmarks when compared to other tools that can also analyze that benchmark.
Anyone may propose benchmarks and there are no limits on the numbers of proposed benchmarks. Any collaborations between benchmark proposers and tool participants should be disclosed, and will be collected in the benchmark voting process.
Non-tool participants (such as industry groups interested in using verification tools) can also propose benchmarks.
For classification benchmarks from standard datasets (for example, MNIST or CIFAR), proposed benchmarks should be generic to run on any image from the dataset, rather than tailored to a specific subset of images. For the competition measurements in these cases, the organizers will select a random seed to select the subset of images considered. For non-image benchmarks, the benchmark proposer should also provide a way to randomly select / mutate the benchmark based on a seed. All benchmarks will contain a script to generate the .csv listing each benchmark instance based on a passed-in random seed. You can see how the seed selection was done in a previous year in this file.
To propose a new benchmark, please create a public git repository with all the necessary code and post in the github issues with the benchmark details and links.
The repository must be structured as follows:
- It must contain a
generate_properties.pyfile which accepts the seed as the only command line parameter. - There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
- There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
- The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance.
Each instance is scored is as follows:
- Correct unsat: 10 points
- Correct sat: 10 points
- Incorrect result or penalty: -150 points
- Timeout / Error / Unknown: 0 points
Time bonus: There is no time bonus. Based on analysis from prior years, this did not make a difference in the overall score. Participants still have incentive to optimize their code to ensure analysis completes before the instance timeout is reached.
The benchmark score of each category is a percent, computed as 100 times the sum of the individual instance scores for that benchmark category divided by the maximum sum of instance scores of any tool for that benchmark category (for example, the tool with the highest sum of instance scores for a category should get 100%).
The total overall score for a tool is the sum of benchmark scores (sum of percents), so that each benchmark receives an equal weight.
Prior to the finalization of benchmarks, tools that are participating in the competition may propose changes to the rules in this document. This is done on the "rules discussion" github issue. A team should create a post with a "motion" that outlines a specific proposed rule change (for example, delete the sentence "xyz" and replace it with sentence "ijk"). If the motion is seconded by another team (by replying on the github issue), then the organizers shall create a voting mechanism such as a google form where each tool participating in the competition gets one vote. Results will be made public. More than half of the participating tools must actively vote in favor of the motion for the rule change to be accepted. The organizers will resolve any rule disputes if consensus cannot be reached through discussion or voting.
Subject to sponsoring availability and legal/tax constraints, a cash prize will be provided for the tool with the top overall score (e.g., the tool with the largest sum of benchmark scores). Award certificates will be provided for the top tool on each benchmark and/or category. Ties will be broken by summing up total runtime on all solved instances. Tools with any developers or authors that include VNN-COMP organizers or financial sponsors cannot receive cash awards. The tools may still win categories or the overall competition and can receive award certificates, but cannot receive the cash prizes. A "best falsifier" award will be considered by the organizers if a tool stands out in this category and not others. Other awards may be given at the discretion of the organizers.
Although we will take some measures to detect cheating, we ultimately hope the participants stay true to the friendly nature of the competition and do not try to get an unfair advantage. For example, tools should not output "unsat" on instances unless the tool actually proves the property holds. Violations to the honor code will be resolved by the organizers on a case-by-case basis.
An updated timeline is on this year's website: https://vnn-comp.github.io/#vnncomp2026
Organizers may modify rules to ensure a smooth competition, overcome unforeseen problems, encourage participation and to resolve any ambiguities in the spirit of the competition. This year's organizers are: Taylor T. Johnson, Stanley Bak, Christopher Brix, Tobias Ladner, Lukas Koller, Konstantin Kaulen, Edoardo Manino, Thomas Flinkow, Haoze Wu, Hai Duong, and ThanhVu H. Nguyen. Organizers may also submit tools and participate in the competition, but neither they nor their team can receive cash prizes.
To facilitate automatic evaluation on the cloud, we need your tool in a standard format along with some bash scripts to set up and run the tool. Note that your scripts can be used to download resources from the web such as cloning git repositories. For future-proofing, cloning a specific commit or tag is the preferred method of downloading the tool.
Submission to VNN-COMP is done through the website: https://vnncomp.christopher-brix.de/
Networks will be provided in an .onnx file format that conforms to the VNNLIB standard. Properties are provided in a .vnnlib file. The .onnx and .vnnlib files will be downloaded to the home folder by cloning a repo with benchmarks. For example, the test network / property will be in ~/vnncomp_repo/benchmarks/test/test.onnx and ~/vnncomp_repo/test/test.vnnlib.
Each tool author will be given (indirect) access to an amazon cloud instance running a choice of amazon AMI (operating system), where they can also download any required licenses for their tool. Contact the evaluation chairs if you want to add a different AMI than what is available on the website.
The tool's scripts will then be copied to the home folder. The evaluation platform will then run an 'install_tool.sh' script to install the tool, and then, for each benchmark instance, scripts will call 'prepare_instance.sh' followed by 'run_instance.sh', which should produce a results file.
If you want to skip a benchmark category entirely, you can have prepare_instance.sh return a nonzero value (the category is passed in as a command-line argument).
Three scripts should be provided for each tool.
-
install_tool.sh: takes in single argument "v1", a version string. This script is executed once to, for example, download any dependencies for your tool, compile any files, or setup any required licenses (if it can be automated). Note that some licenses cannot be automatically retrieved, so that the tool authors will be responsible for a manual step prior to running any scripts to get the licenses.
-
prepare_instance.sh: four arguments, first is "v1", second is a benchmark identifier string such as "acasxu", third is path to the .onnx file and fourth is path to .vnnlib file. This script prepares the benchmark for evaluation (for example converting the onnx file to pytorch or reading in the vnnlib file to generate c++ source code for the specific property and compiling the code using gcc. You can also use this script to ensure that the system is in a good state to measure the next benchmark (for example, there are no zombie processes from previous runs executing and the GPU is available for use). This script should not do any analysis. A reasonable timeout (10 minutes) will be enforced for this script. If this gets exceeded, the tool's result will be considered "unknown" for the corresponding instance. The benchmark name is provided, as per benchmark tuning/settings are permitted (per instance settings are not, so do NOT use the onnx filename or vnnlib filename to customize the verification tool settings). If you want to skip a benchmark category entirely, you can have prepare_instance.sh return a nonzero value (the category is passed in as a command-line argument).
-
run_instance.sh: six arguments, first is "v1", second is a benchmark identifier string such as "acasxu", third is path to the .onnx file, fourth is path to .vnnlib file, fifth is a path to the results file, and sixth is a timeout in seconds. Your script will be killed if it exceeds the timeout by too much, but sometimes gracefully quitting is better if you want to release resources cleanly like GPUs. The results file should be created after the script is run and is a simple text file containing one word on a single line: holds, violated, timeout, error, or unknown.
Please post any issues on the corresponding github issue in this repo: https://github.com/VNN-COMP/vnncomp2026/issues