CONCOCTION
CONCOCTION copied to clipboard
CONCOCTION is an automated machine learning-based vulnerability detection framework that combines static source code information and dynamic program execution traces.
Deep learning based vulnerability detection model
Introduction
CONCOCTION is an automated machine learning based vulnerability detection framework. This is the first DL system to learn program presentations by combining static source code information and dynamic program execution traces.
Check our paper for detailed information.
Installation
Concoction builds upon:
- Python v3.6
- KLEE v3.0
- LLVM v13.0
The system was tested on the following operating systems:
- Ubuntu 18.04
See INSTALL.md for further details.
Usage
See usage.md for a step-by-step demo of Concoction.
Data
This is our introduction to the training and validation dataset used in the paper. You can download the dataset from here.
(We manually verified the original dataset and removed low-quality samples, ensuring faster training speed without compromising the model's performance)
Open datasets used in training and evaluation
This folder contains the datasets used in our paper.
github: This folder contains C functions from C-language open-source projects.
sard: This folder contains C functions from the SARD standard vulnerability dataset.
Data structure
All the data is stored in .zip files. After decompression, you will find .txt files,
each of which represents a C function feature file.
Each feature file(eg.2ok_jpg.c-ok_jpg_convert_data_unit_grayscale.c.txt)
includes static features (AST,CFG,DFG and other seven edges) and dynamic features (input variable values and execution traces).
Description of text example
| Items | Labels | Values |
|---|---|---|
| Vulnerability or not | -----label----- | 0/1 |
| Source code | -----code----- | static void ok_jpg_convert_d... |
| Code relationship flow edges | -----children----- | 1,2 1,3 ... 1,4 |
| Code relationship flow edges | -----nextToken----- | 2,4,7,9,10,13,15, |
| Code relationship flow edges | -----computeFrom----- | 42,43 42,44 69,70 ... |
| Code relationship flow edges | -----guardedBy----- | 90,92 101,102 101,103 ... |
| Code relationship flow edges | -----guardedByNegation----- | 124,125 125,126 125,127 ... |
| Code relationship flow edges | -----lastLexicalUse----- | 42,44 43,44 47,48 ... |
| Code relationship flow edges | -----jump----- | 21,22 21,23 23,24 ... |
| Node tokens | -----ast_node----- | const uint8_t *y const uint8_t uint8_t ... |
| ... | ... | ... |
| Input variable values | =======testcase======== | y_inc:0x00000000 x_inc:0x00000000 ... |
| Execution traces | =========trace========= | for(int x = 0;x < max_width;x++) out[0] = y[x]; out[1] = y[x]; ... |
Main Results
A full list of code vulnerabilities discovered by Concoction can be found here.
Contributing
We welcome contributions to Concoction. If you are interested in contributing please see this document.
Citation
If you use CONCOCTION in any of your work, please cite our paper:
@inproceedings{Concoction,
title={Combining Structured Static Code Information and Dynamic Symbolic Traces for Software Vulnerability Prediction},
author={Huanting Wang, Zhanyong Tang, Shin Hwei Tan, Jie Wang, Yuzhe Liu, Hejun Fang, Chunwei Xia, Zheng Wang},
booktitle={The IEEE/ACM 46th International Conference on Software Engineering (ICSE)},
year={2024},
}