-
Notifications
You must be signed in to change notification settings - Fork 68
Add voting learners #665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add voting learners #665
Conversation
- This is useful since it can the be used by both regular learners as well as the voting learners.
- Add docstring for `cross_validate()` - Add `learning_curve()`. - Use refactored code where possible.
# Conflicts: # skll/learner/__init__.py # skll/learner/utils.py
# Conflicts: # skll/learner/utils.py
# Conflicts: # tests/test_output.py
- Use position arguments instead of `.args` and `.kwargs` accessors which do not seem to be supported on Python 3.7.
- And fix existing tests
- Set BINDIR since we are now running tests without activating the conda environment. - Set nose options as environment variables to make command shorter - Use `travis_wait` for the longest running test job to avoid early termination.
- Update run_experiment documentation to include detailed description of `VotingClassifier` and `VotingRegressor`. Also add entry for `save_votes` configuration field. - Add new votingconfiguration files for voting learners for Iris and Boston examples. - Update contributing page for readability and fix links. - Remove all top-level imports from documentation pages and tutorial notebook. - Update Learner API documentation to include `VotingLearner` class and improved sub-headings. - Update docstrings for voting learners to make them more readable. - Update sphinx configuration (year and imports -
Codecov Report
@@ Coverage Diff @@
## main #665 +/- ##
==========================================
+ Coverage 95.09% 96.76% +1.66%
==========================================
Files 27 63 +36
Lines 3100 9077 +5977
==========================================
+ Hits 2948 8783 +5835
- Misses 152 294 +142
Continue to review full report at Codecov.
|
mulhod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple questions and comments. (I have not done a full review yet, though. I just want to break it up since I won't get back to this till tomorrow probably.)
It looks really awesome so far! Exciting.
- Update API and experiment tests to handle this new functionality.
mulhod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typos.
mulhod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice! Exciting changes!
aoifecahill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! I've tried it out (including navigating the documentation) and managed to successfully run and revise an experiment based on the outputs/documentation.
This PR mainly closes #488. It also closes #524 and closes #584.
New
VotingLearnerclassRelevant files:
learner/voting.pyandlearner/utils.py.The main contribution of this PR is to allow SKLL to use
VotingClassifierandVotingRegressorlearners from scikit-learn. This was not as straightforward since this is the first true meta-learner class we have added to SKLL. By meta-learner, I mean a class that builds on top of SKLL'sLearnerclass. This meta-learner class is calledVotingLearner.The implementation of
VotingLearnerprovides the same 7 methods as in the originalLearnerclass:train(),cross_validate(),evaluate(),predict(),learning_curve(),from_file()andsave(). There is some minor code duplication between these methods and the corresponding methods in theLearnerclass but most of it was avoided by the refactoring of the common code into utility functions last year. There are some changes to those functions in order to accommodate the new meta-learner. In addition, there is some new refactoring primarily for thefrom_file()andsave()methods for the two classes which now use the refactored functions_save_learner_to_disk()and_load_learner_from_disk().The implementation supports passing in keyword arguments to the underlying learners, as well as samplers and sampler arguments.
The
train()implementation nicely incorporates grid search such that the learners underlying the voting meta-learner are automatically tuned with grid search (assuming the user requests it) before their predictions are used for voting. Grid search also works withcross_validate()although, as expected, doing this is much slower since the per-fold-grid-search will now be done for each underlying learner. Both these methods also accept a list of parameter grids for tuning the underlying learners.The
evaluate()andpredict()methods support returning (and writing out) not only the meta-learner's predictions but also the predictions from the underlying learners that were used in the voting process.Integration with
run_experimentRelevant files:
config/__init__.py,experiments/__init__.py, andexperiments/utils.py.The
VotingLearnerclass can be used via the SKLL API as described in the previous section. This PR also includes hooks that allow users to specifyVotingClassifierandVotingRegressoras their chosen learners in an experiment configuration file as input torun_experiment. The hooks are set up to require users to specify the underlying estimators as fixed parameters using theestimator_nameskey. The following additional fixed parameters can also be specified:voting_type,estimator_fixed_parameters,estimator_samplers,estimator_sampler_parameters, andestimator_param_grids. These parameters are fully documented and example configuration files are also included (see documentation section below for details).A new configuration field called
save_votesis added to allow the user to save the predictions from the underlying learners in addition to the predictions from theVotingClassifierorVotingRegressor. The default value for this field isFalse.A new JSON encoder called
PipelineTypeEncoder()was added so as to support being able to serializeVotingLearnerinstances to JSON. This was necessary since these instances in turn containPipelineinstances.New tests
Relevant files:
tests/test_voting_learners_api_*.py,tests/test_voting_learners_expts_*.py, andtests/utils.py.The artificially constructed datasets we use for the existing tests are not very useful for
VotingLearnertests since they are either too small or too toy-ish. Therefore, we use the digits and housing datasets (included in scikit-learn) for classification and regression tests respectively. To make it easy to use these datasets, two new utility functions are added:make_digits_data()andmake_california_housing_data().Note that we were already using a version of the digits dataset in the learning curve tests for the
Learnerclass. This use was refactored to use the newmake_digits_data()utility function.All of the API tests test the methods of the
VotingLearnerclass by first calling that method on an instantiatedVotingLearnerin SKLL space, then using only scikit-learn functions to perform (nearly) identicaloperations, and then comparing the two results to make sure they are as close as possible. Most of the classification tests only those results up to 2 decimal places because there are some inherent differences between scikit-learn and SKLL that make it difficult to replicate SKLL operations in scikit-learn space. Most of the regression tests are able to compare to more decimal places since there are no probabilities involved.
Since the API tests are so comprehensive, there is no real reason for the experiment tests to run real experiments since the same API methods are called from within
run_configuration()anyway. For this reason, we focus all of the experiment tests on making sure that the right methods are called based on the "task" value and that they are called with the right arguments derived from the fields specified in the configuration file. To do so, we mock the appropriate API methods, callrun_configuration()on different configuration files, and check that the mocked methods were called the expected number of times and with the expected arguments.To make experiment testing easier, we add a new utility function called
fill_in_config_options_for_voting_learners()that takes an empty configuration file template (tests/configs/test_voting_learner.template.cfg ) and populates it with the right values depending on the arguments with which the function was called. In addition, a new class calledBoolDictis added that returns ``False`` as the default value for key lookups rather than ``None``. This simplifies thefill_in_config_options_for_voting_learners()` function significantly.New tests were added to
test_input.pyfor thesave_votesfield. In addition, existing tests in the same file were updated to accommodate this new field.Note that there is still some code duplication between the tests but this adding them all into the same file will add a lot more complexity (
ifstatements and the like). This way they are fully self-contained and can be run fully independently.Documentation
Relevant files:
doc/run_experiment.rst,doc/api/learner.rst,examples/iris/voting.cfg,examples/boston/voting.cfg, and others (see below).run_experimentdocumentation to include detailed description ofVotingClassifierandVotingRegressorand added an entry for thesave_votesconfiguration field. Note that some of the links will only work after the PR is merged.doc/api/learner.rstto include theVotingLearnerclass and improved sub-headings.VotingClassifierandVotingRegressorfor the Iris and Boston examples respectively.doc/contributing.rstpage for readability and fixed links to existing methods. Note that some of the links will only work after the PR is merged.api/skll.rstand updatedapi/quickstart.rstand the Tutorial notebook.doc/conf.pyto fix imports and changed year from 2019 to 2021.Other changes
There was a major bug when using samplers. We were calling
fit_transform()on the test set rather thantransform(). This was fixed andtests/test_classification.py:test_sampler:test_sparse_predict_sampler()was updated.A new parallel build job was added to both Travis and Azure CI builds to accommodate the new tests. Test files were redistributed across all 6 jobs to make sure that the overall build time is still optimized.
The Travis CI configuration now creates a new conda environment rather than using the default miniconda one. In addition, it does not activate said environment when running the tests. Finally, it also configures
nosetestsvia environment variables for simplicity.Both Azure CI and Travis CI now set logging level for tests to
WARNINGand do not use--nologcapturewhich significantly reduces the size of the logs produced.The warning and error messages in
Learner.learning_curve()have been tweaked to be more concise.