Skip to content

Latest commit

 

History

History
31 lines (23 loc) · 2.92 KB

File metadata and controls

31 lines (23 loc) · 2.92 KB

← Go Back

Script Information

Whenever I'm processing the raw CSV files, I have the following scripts (and their applicable dependencies) in the root directory of the data repository. You can find all of these scripts in the scripts directory of this repository.

Script Purpose Dependencies
clean_raw_csvs.py A Python script that takes one term folder and combines all raw data into one file -- enrollment.csv -- that can then be broken up. This script is also responsible for validating that each row in the raw CSV files are valid (and removes any invalid rows). fix_inconsistent_csv.py
enroll_data_cleaner.py A Python script that takes one term folder and "cleans" the data by breaking the enrollment.csv file into class or section CSV files and putting those CSV files in the overall or section folders.
plot.py A Python script that takes one term folder and, using the cleaned data in said folder, creates graphs for each CSV file in the overall and section folders. pandas
matplotlib
seaborn
list_files.py A Python script that simply puts all the courses and sections that I have data for in a text file.
generate_toc.py A Python script that generates a Markdown file containing links to all CSV and graph files. This is a workaround for the limitation on GitHub where GitHub only displays 1000 files.

There are also several general scripts designed to streamline this process.

Script Purpose
run.sh A Bash script that runs both clean_raw_csvs.py and enroll_data_cleaner.py. This is no longer maintained and is now located in the misc/old_scripts folder.
multplot.sh A Bash script that runs all applicable Python scripts above for each active terms. This is no longer maintained and is now located in the misc/old_scripts folder.
run.ps1 A Powershell script that runs all applicable Python scripts above for each active terms. It also automatically pulls or clones and pushes the changes to GitHub.

To get an idea of how the cleaning works, consider the following diagram:

Rundown of the process.

It should be noted that these scripts are not committed to the data repositories.


One thing to note: when CSV files are generated by my scraper, all graduate and undergraduate course data is put in the same CSV file. However, the data repositories are structured so that undergraduate and graduate courses are separated. So, these raw CSV files need to be broken up into two raw CSV files. For this, there is a Rust program in the misc/rust directory aptly named separate_grad_courses. If you want a compiled executable for Windows, you can get it from misc/executables.