The full dataset of shell aliases is available at https://doi.org/10.5281/zenodo.3778824.
We used Jupyter notebooks to carry out most of the data analysis. You can find these in the notebooks directory of this repository. Our environment was Python 3.7.6, with the dependencies listed in notebooks/requirements.txt.
If you just want to re-run the analysis or explore the dataset, you can skip ahead to step 4 below.
-
Collection
We used the github-searcher tool to collect all Shell language files under 29 KB containing the string
aliasfrom GitHub. This data collection was carried out over a period of two-and-a-half weeks from December 20th 2019 to January 8th 2020. The filesampling.csvcontains the number of files obtained and the estimated population (as reported by GitHub) of each sampled file size stratum. Additionally,notebooks/sampling.ipynbcontains some helper code used during the data collection process.The resulting SQLite database containing all collected files is available as
results_unparsed.dbin the Zenodo archive. The database schema is given in the paper. -
Pruning
The notebook at
notebooks/dataset.ipynbcontains the data pruning steps. If you want to repeat these exactly, copyresults_unparsed.dbtonotebooks/results.dband execute the notebook. -
Parsing
We used the Haskell script in the
parserdirectory to parse the raw file contents to find actual alias definitions and decompose them into their constituent parts. The parser works on the database produced by github-searcher and extends it by adding tables for aliases, commands, and arguments. Parsing the pruned database takes about 1 hour on a 2019 iMac.After parsing, we removed files that did not contain any aliases from the database. (The SQL commands for this are in
notebooks/dataset.ipynb). -
Analysis
All data analysis was performed on the parsed and pruned dataset, available as
results.dbin the Zenodo archive. You can copy it tonotebooks/results.dbto follow along. In addition to the notebooks, there are some ancillary files, such as the results of our coding process, all of which you can also find in thenotebooksdirectory.If you want to re-run the analysis or explore the dataset,
notebooks/summary.ipynbis a good starting point.