This is an evolving repo optimized for machine-learning projects aimed at designing a new algorithm. They require sweeping over different hyperparameters, comparing to baselines, and iteratively refining an algorithm. Based of cookiecutter-data-science.
src: contains main code for modeling (e.g. model architecture)experiments: code for runnning experiments (e.g. loading data, training models, evaluating models)scripts: scripts for hyperparameter sweeps (python scripts that launch jobs inexperimentsfolder with different hyperparams)notebooks: jupyter notebooks for analyzing results and making figurestests: unit tests
- setup using uv (requires installing uv then run a script using
uv run <script>). - this installs a package named
srcfor importing- see
pyproject.tomlfor dependencies, not all are required
- see
- example run: run
uv run scripts/01_train_basic_models.py(which callsexperiments/01_train_model.py) then view the results innotebooks/01_model_results.ipynb - keep tests upated and run using
uv run pytest
- scripts sweep over hyperparameters using easy-to-specify python code
- experiments automatically cache runs that have already completed
- caching uses the (non-default) arguments in the argparse namespace
- notebooks can easily evaluate results aggregated over multiple experiments using pandas
- See some useful packages here
- Avoid notebooks whenever possible (ideally, only for analyzing results, making figures)
- Paths should be specified relative to a file's location (e.g.
os.path.join(os.path.dirname(__file__), 'data')) - Naming variables: use the main thing first followed by the modifiers (e.g.
X_train,acc_test)- binary arguments should start with the word "use" (e.g.
--use_caching) and take values 0 or 1
- binary arguments should start with the word "use" (e.g.
- Use logging instead of print
- Use argparse and sweep over hyperparams using python scripts (or custom things, like amulet)
- Note, arguments get passed as strings so shouldn't pass args that aren't primitives or a list of primitives (more complex structures should be handled in the experiments code)
- Each run should save a single pickle file of its results
- All experiments that depend on each other should run end-to-end with one script (caching things along the way)
- Keep updated requirements in setup.py
- Follow sklearn apis whenever possible
- Use Huggingface whenever possible, then pytorch