Skip to content

Conversation

@desilinguist
Copy link
Collaborator

Last year, scikit-learn added functionality to include model fit times when computing learning curves since – in addition to the model's performance – it's also quite useful to know how the long the model takes to train as more training data was added. This PR now adds the same functionality to SKLL.

  • The skll.utils.train_and_score() function now measures the model fit time for every model trained as part of a learning curve experiment.
  • We now generate two plots for each featureset for a learning_curve experiment. The first is the usual "score curve" that shows the training and cross-validation scores as more training data is added. The newly-added second plot is a "time curve" that shows how the model fit times change as more training data is added. The format for this new curve's name is: <experiment>_<featureset>_times.png.
  • The model fit times show in the time curve are first averaged over all runs with the same training data size and then averaged over all output metrics (if multiple ones are specified), making the estimates a bit more smooth.
  • While the score curve is faced across both rows (output metrics) and columns (learners), the time curve is only faceted along columns (learners) since we already averaged over the metrics.
  • I refactored the skll.experiments.output.generate_learning_curve_plots function. It now only pre-processes the score and time data to create data frames. The two curves (score and time) are now generated by two private functions: skll.experiments.output._generate_learning_curve_score_plots and skll.experiments.output._generate_learning_curve_time_plots.
  • Updated existing tests to allow for the refactoring and to ensure that the new plots are checked.
  • Documentation has been updated to show the time curve in addition to the time curve. I modified the existing plot to show a more realistic example.

As always, the best way to review is to try this out in the examples. As a starting point, if you want to replicate the same example, you can modify the Titanic example's learning_curve.cfg file as shown below and then look at the Titanic_Learning_Curve_all.png and Titanic_Learning_Curve_all_times.png files in the output directory.

CleanShot 2023-06-26 at 11 16 33@2x

This PR closes #556.

@codecov
Copy link

codecov bot commented Jun 26, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.05 🎉

Comparison is base (143ff09) 95.19% compared to head (10469c3) 95.24%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #745      +/-   ##
==========================================
+ Coverage   95.19%   95.24%   +0.05%     
==========================================
  Files          29       29              
  Lines        3538     3578      +40     
==========================================
+ Hits         3368     3408      +40     
  Misses        170      170              
Impacted Files Coverage Δ
skll/experiments/__init__.py 94.69% <100.00%> (ø)
skll/experiments/output.py 97.86% <100.00%> (+0.40%) ⬆️
skll/learner/__init__.py 97.18% <100.00%> (ø)
skll/learner/utils.py 93.37% <100.00%> (+0.05%) ⬆️
skll/learner/voting.py 98.54% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@desilinguist desilinguist merged commit 9e501a9 into main Jun 27, 2023
@delete-merged-branch delete-merged-branch bot deleted the 556-include-fit-times-for-learning-curves branch June 27, 2023 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Include fit times in learning curve output

4 participants