Research in neural network reliability and maintainability requires diverse benchmark datasets to ensure effective evaluation and testing. However, publicly available NN datasets are scarce, domain-specific, and lack the architectural diversity needed for comprehensive evaluation. We leveraged GPT-5 to automatically generate a diverse dataset of neural networks to support verification, refactoring, and migration research.

The dataset contains 608 neural networks implemented in PyTorch, each defined by explicit design choices across four key dimensions: architecture type, task category, input data characteristics, and model complexity. All networks are validated for correctness through static analysis and symbolic tracing to ensure reliability. The complete dataset is publicly available on Github.

This blog post provides an overview of the dataset. For complete technical details, please refer to our paper co-authored by Nadia Daoudi and Jordi Cabot.

Dataset Generation Workflow

We used GPT-5 to automatically generate neural networks in three steps that are depicted in Figure 1.

Figure 1: Overview of the LLM-driven NN dataset generation

For each neural network, the LLM is given a prompt that specifies design choices across four key dimensions:

  • Architecture type: MLP, CNN-1D, CNN-2D, CNN-3D, RNN-Simple, RNN-LSTM, and RNN-GRU
  • Task category: Binary-classification, Multiclass-classification, Regression and Representation-learning
  • Input data type and scale: Four input data types: Tabular, Time Series, Text, and Image, each with its corresponding input scale
  • Model complexity: Simple, Wide, Deep, and Wide-Deep

The prompts guide the LLM to produce complete PyTorch implementations that satisfy the instructions on the design choices. By systematically varying these dimensions, we generated 608 neural networks spanning multiple architectures and configurations.

Correctness Validation

To ensure dataset quality, we developed a validation tool that verifies the correctness and compliance of generated neural networks with the specified design choices. The tool uses static analysis to parse the generated code and symbolic tracing to verify structural integrity and execution. It checks compliance with the four design dimensions and ensures all networks are syntactically correct and executable. The validation tool is available in the dataset repository.

Dataset Overview

Diversity is a key strength of the dataset. In total, our dataset contains 6842 layers and tensor operations across 38 unique types, confirming diverse structural components. Network depth ranges from 2 to 35 layers, with complexity patterns aligning with design specifications. Figure 2 shows the distribution of network depth across different complexity categories. Characterising Layers depicted in the figure are specific layers that characterise architecture types.

Figure 2: Distribution of NN depth across complexity categories

Accessing the Dataset

Each neural network in the dataset is stored as a Python file containing the model class definition. The prompt used to generate the NN code is included at the top of each file. File names follow a structured convention encoding the four design dimensions: architecture_task_input-type-scale_complexity.py

The complete dataset is publicly available on GitHub: https://github.com/BESSER-PEARL/LLM-Generated-NN-Dataset 

Join our Club!

Follow the latest news on software development, especially for open source projects

You have Successfully Subscribed!

Share This