Skip to content

Conversation

@desilinguist
Copy link
Collaborator

@desilinguist desilinguist commented Dec 3, 2020

This PR closes #648. More specifically, it:

  • Adds a warning in learning_curve() for probabilistic learners.
  • Modifies train_and_score() to handle probabilities by using argmax.
  • Modifies the learning curve implementation test to handle probabilities. The other tests do not need to be modified since they all test things downstream of the implementation.

@desilinguist desilinguist requested review from a user, aoifecahill and mulhod December 3, 2020 20:21
@codecov
Copy link

codecov bot commented Dec 3, 2020

Codecov Report

Merging #649 (9bf85f6) into main (26b96c2) will decrease coverage by 0.05%.
The diff coverage is 60.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #649      +/-   ##
==========================================
- Coverage   95.11%   95.06%   -0.06%     
==========================================
  Files          27       27              
  Lines        3093     3098       +5     
==========================================
+ Hits         2942     2945       +3     
- Misses        151      153       +2     
Impacted Files Coverage Δ
skll/learner/utils.py 94.33% <33.33%> (-0.62%) ⬇️
skll/learner/__init__.py 96.28% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26b96c2...9bf85f6. Read the comment docs.

- Clean up a docstring that sphinx complains about.
@desilinguist
Copy link
Collaborator Author

@aoifecahill @mulhod @bndgyawali This is ready for review. Codecov is wrong about the 2 lines in train_and_score() not being covered by tests. They are the lines that make the new test pass. However, I suspect nosetests can't detect that because train_and_score() is called via joblib.parallel().

Copy link
Contributor

@mulhod mulhod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I tested by making a new config named learning_curve.cfg for the Boston dataset:

Click to expand!
[General]
experiment_name = Example_LearningCurve
task = learning_curve

[Input]
train_directory = all
featuresets = [["example_boston_features"]]
featureset_names = ["example_boston"]
feature_scaling = both
learners = ["RandomForestClassifier", "SVC", "LogisticRegression"]
suffix = .jsonlines

[Tuning]

[Output]
probability = true
metrics = ['unweighted_kappa']
results = output_lc
log = output_lc

probability is set to true. The data was created by running:

mkdir all; cat {train,test}/*jsonlines > all/example_boston_features.jsonlines

(The dataset has just a smidge over the mininum of 500 examples.)

I see this in the output:

$ run_experiment learning_curve.cfg
[...]
2020-12-03 22:39:12,411 - Example_LearningCurve_example_boston_LogisticRegression - INFO - Generating learning curve(s)
2020-12-03 22:39:12,411 - Example_LearningCurve_example_boston_LogisticRegression - WARNING - Since ``probability`` is set, the most likely class will be computed via an argmax before computing the curve.

I see that warning for each learner and it's prominent enough, I think.

@desilinguist desilinguist merged commit 29614f3 into main Dec 4, 2020
@desilinguist desilinguist deleted the 648-learning-curve-fix-for-probability branch December 4, 2020 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Learning curve won't work for probabilistic classifiers

4 participants