scripts.md

Script Information

Whenever I'm processing the raw CSV files, I have the following scripts (and their applicable dependencies) in the root directory of the data repository. You can find all of these scripts in the scripts directory of this repository.

Script	Purpose	Dependencies
`clean_raw_csvs.py`	A Python script that takes one term folder and combines all raw data into one file -- `enrollment.csv` -- that can then be broken up. This script is also responsible for validating that each row in the raw CSV files are valid (and removes any invalid rows).	`fix_inconsistent_csv.py`
`enroll_data_cleaner.py`	A Python script that takes one term folder and "cleans" the data by breaking the `enrollment.csv` file into class or section CSV files and putting those CSV files in the `overall` or `section` folders.
`plot.py`	A Python script that takes one term folder and, using the cleaned data in said folder, creates graphs for each CSV file in the `overall` and `section` folders.	`pandas` `matplotlib` `seaborn`
`list_files.py`	A Python script that simply puts all the courses and sections that I have data for in a text file.
`generate_toc.py`	A Python script that generates a Markdown file containing links to all CSV and graph files. This is a workaround for the limitation on GitHub where GitHub only displays 1000 files.

There are also several general scripts designed to streamline this process.

Script	Purpose
`run.sh`	A Bash script that runs both `clean_raw_csvs.py` and `enroll_data_cleaner.py`. This is no longer maintained and is now located in the `misc/old_scripts` folder.
`multplot.sh`	A Bash script that runs all applicable Python scripts above for each active terms. This is no longer maintained and is now located in the `misc/old_scripts` folder.
`run.ps1`	A Powershell script that runs all applicable Python scripts above for each active terms. It also automatically pulls or clones and pushes the changes to GitHub.

To get an idea of how the cleaning works, consider the following diagram:

It should be noted that these scripts are not committed to the data repositories.

One thing to note: when CSV files are generated by my scraper, all graduate and undergraduate course data is put in the same CSV file. However, the data repositories are structured so that undergraduate and graduate courses are separated. So, these raw CSV files need to be broken up into two raw CSV files. For this, there is a Rust program in the misc/rust directory aptly named separate_grad_courses. If you want a compiled executable for Windows, you can get it from misc/executables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script Information

FilesExpand file tree

scripts.md

Latest commit

History

scripts.md

File metadata and controls

Script Information