The data directory contains .jsonl files with train/dev/test splits of the
Exploration, Hard, and Mixed datasets.
This repository also contains a number of scripts for generating data and computing attributes of data instances. The python scripts in the root directory are as follows:
exps2.pyprovides definintions of theregextype as well as functions for enumerating regexs (for use in sampling) and computing properties.dfa2.pyprovides definitions of thedfa(Deterministic Finite Automaton) type as well methods for computing properties of regular languages and converting fromregextodfa.sample_v2.pygenerates a cache ofregexs and the Exploration set.make_hard_set.pyprovides a script to generate our Hard dataset from the cachedregexs not used in the Exploration training set.compute_properties.pyprovides additional helper functions for computing attributes.cache_loader.pyprovides utilities for loading cached regexs generated bysample_v2.py.
Generate the Exploration dataset:
python sample_v2.py \
-d <output-folder> \
[--depths <maximum number of compositions>]
Generate the Hard dataset:
python make_hard_set.py