Data used to generate the graphs/tables in my book The New C Standard: An Economic and Cultural Commentary, along with the scripts and source code used to create the data.
Blog post discussing the background.
The original data is contained in the diagrams directory. The mapping from a graph/table in the book to the corresponding data filename is somewhat involved (these generated files were automatically inserted into the text of the book, which was then processed using various tools to generate a pdf file).
So you've seen a plot and you want the numbers used to create the plot. The process to locate the appropriate data file is as follows:
-
Paste the first few words of the plot's caption, e.g., the caption for Figure 825.1 starts: "Number of integer constants"
-
Open the file graph-table.txt and search for these words,
-
The name appearing above the caption (freqcons in this example above) is the name of the .gra file containing the commands used to create the plot (in the grap language),
-
Go to the diagrams directory and open the .gra file (e.g., freqcons.gra),
-
Locate the copy commands (e.g.,
copy "intlitvals.d" thru {), and note the name(s) of the data file (in this example intlitvals.d). The data is in this file.
The process of extracting data from a collection of C projects is as follows:
-
Put the source, or links to the source, in the programs directory. This currently contains tar/zip files of the projects measured for the book.
-
The data is extracted from .c and .f files by the program ngram. Various scripts are available to process the output of ngram to produce graphs (mkallgra.sh) and tables (mkalltab.sh).
Some of the figures and tables use data generated using commercially available tools which are not included in this package.
The graphs are created using grap. This tool is part of the groff/pic/tbl family, but is not packaged with many Linux distributions. You can download it from: https://www.lunabase.org/~faber/Vault/software/grap or https://github.com/snorerot13/grap
The file config.files sets various shell variables. The variable USE_HOME needs to be set to the directory that contains this file
The top level script mkallget.sh invokes the scripts that execute ngram with various arguments. This generates various raw data files from the source code. Some of this raw data may need to be processed again (generating other data files) before it is a suitable final form. Setting GEN_DATA (in config.files) to 0 stops mkallgra.sh and mkalltab.sh generating these data files, but they continue to do everything else they do (this saves time when tuning the formatting of the graphs and tables).
There are data extraction scripts in various directories (these are invariably called getxxxx.sh, where xxxx is a name relating to the data they extract). The mkgra.sh script in each directory extracts the necessary data (creating .d files) and expand the .g files (to .gra files). The mktab.sh script in each directory extracts the necessary data and writes it to a text file (one file per table).
Starting with a new set of source files available in the directory program the commands to execute are:
-
Compile two programs and copy them to a few subdirectories using:
bldall.sh -
Extract the data:
c_use.sh > c.cnt
h_use.sh > h.cnt
mkallget.sh
mkallgra.sh
mkalltab.sh -
To generate postscript and pdf files for the contents of the diagrams directory (placed in the ps directory), assuming you have installed grap, use:
cd ../diagrams
mkps.sh
The distributed directories are:
- program: The default directory used for finding .c and .h files. The top level directories in program are assumed to denote various applications, or groups of programs. This directory can simply consist of links to where the source is actually held.
-
idents: Identifier related data and scripts. By default the various Levenstein distance measurements are not generated. It takes 4+ hours, on a 1.5GHz Pentium, to generate the measurements for gcc (it is an N^2 algorithm). Edit getall.sh to change this default.
-
prepro: Preprocessor related data and scripts.
-
statements: Statement related data and scripts. It takes 3+ hours, on a 1.5GHz Pentium, to generate all the raw data for gcc.
-
decls: Declaration related data and scripts.
-
tables: Scripts for extracting data from various ngram generated files. Numbers are designed to appear in tables.
-
duplicates: Duplicate lines are detected using simian. A full evaluation version of this tool can be downloaded from www.redhillconsulting.com.au/products/simian
-
bldgra: Build .gra and .d files from the raw data generated by c_use.sh and h_use.sh. The output is written to the directory diagrams.
-
bldtab: Create the table information from the raw data generated by c_use.sh and h_use.sh. The output is written to the directory tables.
-
scripts: Various general utility scripts.
-
diagrams: Holds the .gra and .d files generated by the various mkgra.sh scripts.
-
tables: The subdirectory tab_data holds the information generated by the various mktab.sh scripts. updTABLE.sh combines the table information and a stripped down version of the books text. The tools to generate a pdf file from this output are not part of the distribution.
-
thirdparty: Various programs and scripts that might not be part of an installation.
-
config.files: Sets environment variables used by many shell scripts