WIP sklearn compatible ThresholdOptimizer#282
WIP sklearn compatible ThresholdOptimizer#282adrinjalali wants to merge 8 commits intofairlearn:mainfrom
Conversation
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
|
|
||
| def __init__(self, *, estimator=None, constraints=DEMOGRAPHIC_PARITY, | ||
| grid_size=1000, flip=True, plot=False, random_state=None): | ||
|
|
There was a problem hiding this comment.
We have lost unconstrained_predictor and all of the related logic. Was this intentional? The idea is that if this argument is present, we should use it in fit to initialize self._estimator (without calling self.unconstrained_predictor.fit(...)).
There was a problem hiding this comment.
@adrinjalali can elaborate but from what I understand scikit-learn doesn't take already fitted models, but rather fits them itself. We can still have another version where we take an already fitted one, but it wouldn't be scikit-learn compatible. Is that about accurate?
There was a problem hiding this comment.
Yes. GridSearchCV clones the estimators before fitting them, and that clears all the state they have. If we want to allow the user to re-fit on using a pre-fitted estimator, we can implement a warm_start strategy for the estimator to use the previously fitted one. But by default all of these should be fit from scratch, since grid search is fitting them on a new part of the data each time anyway.
There was a problem hiding this comment.
That sounds reasonable. Would be sad to lose the existing functionality if we're planning to add it back in anyway... any chance we can add that warm start in this PR, or is that a major change?
There was a problem hiding this comment.
Sorry for a delay on this, but I think that warm_star is confusing in this context.
Just to state the obvious: the high-level semantics of ThresholdOptimizer (after it is fitted) is really of a two-stage pipeline. The first step is implemented by self._estimator.predict(). It takes as input X and returns y. The returned y is then transformed into y' (inside self.predict()).
We would like to enable two styles of fitting this pipeline: (1) fitting both steps from the provided data; or (2) fitting only the second step from the provided data (and using the provided unconstrained_predictor as the first step). Now if I think about a warm start, I would expect re-fitting a previously fitted pipeline, using the previous pipeline as a warm start. This to me is quite different from (2), which is just fitting the second stage of the pipeline (from scratch) while keeping the first stage of the pipeline untouched.
I'm not quite sure I understand the issue with the interaction between the cloning behavior and the unconstrained_predictor argument.
There was a problem hiding this comment.
maybe you're referring to this?
https://scikit-learn.org/stable/glossary.html#term-metaestimators
There was a problem hiding this comment.
To help resolve this. I suggest that we also implement the pattern from CalibratedClassifierCV link
That would mean, using prefit=False instead of warm_Start=False.
There was a problem hiding this comment.
Sorry took me a while to concentrate enough to understand what you mean.
I understand that warm_start is different from what we have here. I'm just not sure why it's really needed. In a usual pipeline, we usually don't have pre-fitted estimators to feed to meta-estimators. Let me open a separate issue to talk about it there.
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
|
Should there be a set of test cases to ensure we remain sklearn compatible? |
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
|
I'll add the sklearn tests once the existing tests pass. I'm finding it difficult to make them pass since they're a bit obscure to me, but getting there. One question: the output of |
I don't think I understand your question properly. Please don't hesitate to ask if tests seem weird. They need to be made less obscure or explained with a little comment if they are a huge obstacle. |
| :param estimator: An untrained estimator that will be trained, and | ||
| subsequently its output will be postprocessed | ||
| :type estimator: An untrained estimator | ||
| :type estimator: An untrained estimator, default=SGDClassifier(loss='log') |
There was a problem hiding this comment.
Do we want to provide a default? If so, why that one? Just generally wondering, but I don't have a strong opinion
There was a problem hiding this comment.
I thought we couldn't, since this was mutually exclusive with unconstrained_predictor ?
There was a problem hiding this comment.
Hmmmm…. would be related to the warm_start below.
fairlearn/postprocessing/__init__.py
Outdated
| learn how to adjust the predictor's output from the training data. | ||
| """ | ||
|
|
||
| from ._postprocessing import PostProcessing # noqa: F401 |
There was a problem hiding this comment.
this probably warrants a description in the CHANGES.md as well :-)
| EQUALIZED_ODDS: _threshold_optimization_equalized_odds} | ||
|
|
||
|
|
||
| def _get_soft_predictions(estimator, X): |
There was a problem hiding this comment.
right now we're using predict, not predict_proba for this. However, when the scores are binary this method doesn't work as well as when using real-value scores so for examples (such as one of the notebooks) we've actually done this manually by wrapping the estimator and mapping predict_proba to predict. I'm not familiar with decision_function, but would we just use predict if the others aren't available?
There was a problem hiding this comment.
Interesting. As @romanlutz says, previously we were only using predict as the first stage of the pipeline before applying thresholding. I think it's reasonable to consider the back-off sequence predict_proba -> decision_function -> predict, where in predict_proba we only look at the predicted probability of the positive class.
So maybe two fixes to the below:
- for
predict_proba, just extract the second column - after
else, don't throw an error, but instead returnpredict
There was a problem hiding this comment.
predict has very different semantics than predict_proba or decision_function. The probabilities predict_proba gives usually come from the raw values that the decision_function gives. You can think of the decision_function values as the distances from the hyperplanes in the case of an SVM.
predict always gives categorical results, and are not smooth. Either the method (ThresholdOptimizer) knows how to deal with that or it doesn't. If it doesn't, I don't think falling back to predict is a good idea. The predict function of a classifier shouldn't be giving soft outputs.
Alternatively, we could ask the user to either pass a classifier with a predict_proba or a decision_function provided, or a regression on a binary classifier which gives soft outputs as the predictions. One of our classifiers uses a regressor under the hoods in sklearn.
There was a problem hiding this comment.
I'm lacking relevant sklearn knowledge here: does every classifier have either predict_proba or decision_function or are there any that don't?
I don't understand what you mean by regression on a binary classifier. How would that be called? In some examples we use predict_proba from LogisticRegression and take the second entry (assuming binary classification) which is the probability for label 1.
There was a problem hiding this comment.
ThresholdOptimizer has no problem handling hard 0/1 predictions and that's in fact what the original paper was assuming. That's also the reason why we used predict in our original implementation. However, in those cases when the classifier is also providing a soft prediction (in the form of decision_function or predict_proba), it is obviously better to use that.
So I definitely think that we should allow users to run ThresholdOptimizer based on predict of the underlying classifier. What I wonder about is whether we should allow folks to force ThresholdOptimizer to use predict even in those cases when the classifier also implements the soft versions.
PROS: it allows you to run the "textbook version" of the algorithm
CONS: it's essentially guaranteed to lead to inferior performance
There was a problem hiding this comment.
I don't understand what you mean by regression on a binary classifier.
one example is that the GradientBoostingClassifier uses DecisionTreeRegressor at each step instead of a DecisionTreeClassifier. In some cases people use a regressor on the binary classification problem with 0/1 as the target, and they use the soft predicted output as a proxy for probability.
We can have a predict_method parameter for the users to set it to whatever they want, and have it as auto by default.
There was a problem hiding this comment.
I like the final suggestion! Let's go with predict_method parameter defaulted to auto. For auto, we can try predict_proba -> decision_function -> predict (also I don't mind dropping the final predict).
The only weirdness is that whenever predict_method is predict_proba, we will need to extract the second column of the output.
| """ | ||
|
|
||
| def __init__(self, *, estimator=SGDClassifier(loss='log'), | ||
| constraints=DEMOGRAPHIC_PARITY, |
There was a problem hiding this comment.
If we're using SGDClassifier as a default, shall we run it with random_state=0 (or something of the sort) to make it replicate? Alternatively, I don't think we necessarily need to have a default here.
There was a problem hiding this comment.
I'm not very set on the SGDClassifier, really no hard feelings there. We could instead use HistGradientBoostingClassifier which is a much more modern one and people seem to like it.
Regarding the random state, since the user can set the estimator parameter, they can set it while passing a random state to it. It's the same in GridSearch and the underlying estimator for example. If the user needs a deterministic result, they should pass an rng to both.
There was a problem hiding this comment.
Should we have a default at all? Is this commonly done?
There was a problem hiding this comment.
I'm in favor of dropping the default value for the estimator. It might even make sense to drop a default on constraints.
There was a problem hiding this comment.
sklearn's common tests assume all estimators can be constructed using default values, except for the estimator or a meta-estimator. So we could leave this as required, and follow the [private] protocol to make the common tests understand how it should construct it. It'll use a Ridge if the meta-estimator is a regressor and a LinearDiscriminantAnalysis if a classifier.
Alternatively, the tests on the class can be disabled/skipped and only test instances.
There was a problem hiding this comment.
I think that the right approach is to go the meta-estimator route; this is something we will also want to use in our reductions, so if we implement it here, we can propagate it there.
On the API level, I think that ThresholdOptimizer should be similar to CalibratedClassifierCV link
Re. defaults, in CalibratedClassifierCV, the base_estimator is defaulted to None, which becomes LinearSVC(random_state=0) during fit. I'm happy to go with whatever plays better with sklearn; if we need to have some default, I think that any linear model would work (LinearSVC, LinearDiscriminantAnalysis, or SGDClassifier), I just thought it would be nicer if the default was deterministic.
|
@adrinjalali this is closable because you did this in multiple smaller PRs, right? |
|
#342 should close this one :) |
You probably mean the issue. I meant this PR :-) Are you still using this or is the PR closable? |
|
@adrinjalali can I support the completion of this PR in any way? or is it at this point ok for it to be closed? |
|
The code has changed quite a bit now, I think fresh PRs as you're doing is better. |
This PR moves the
ThresholdOptimizertowards a sklearn compatible one.Still working on fixing the tests, and adding the sklearn's common tests for the estimator.
The changes are substantial, but I hope none of them are controversial.
p.s. the line limit is 80 here, since longer ones don't easily fit on my screen :D