[Callbacks] FEA Add the ScoringMonitor callback#33407
[Callbacks] FEA Add the ScoringMonitor callback#33407FrancoisPgm wants to merge 76 commits intoscikit-learn:callbacksfrom
Conversation
jeremiedbb
left a comment
There was a problem hiding this comment.
Thanks for the PR @FrancoisPgm. Here's a first pass
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
…kit-learn into metric_monitor_callback2
…ut pandas, update get_logs output, remove X_val, y_val args from other estimators
…kit-learn into metric_monitor_callback2
|
Instead of passing a partial Then A callback can request a value by setting a class attribute. We can add a Then |
I was thinking EDIT: As discussed in the meeting, I ended up going with |
… import of callback_management_context to avoid circular imports. Use UTC time. Seperate tests depending on pandas availability.
…kit-learn into pr/FrancoisPgm/33407
…kit-learn into pr/FrancoisPgm/33407
ogrisel
left a comment
There was a problem hiding this comment.
For the record, in today's meeting we evoked the idea of exposing some kind of log index datastructure that would record per-run metadata such as the run id, the estimator name and the run start UTC datetime (as an ISO-formatted string) to make it convenient to find which log the user is interested in and then make it possible to retrieve a particular log by its run id.
| else: # eval_on == "both" | ||
| train_log = [entry for entry in log if entry["eval_on"] == "train"] | ||
| val_log = [entry for entry in log if entry["eval_on"] == "val"] | ||
| assert len(train_log) == len(val_log) == max_iter |
There was a problem hiding this comment.
We could also compute the score(s) of estimator (after fit) on the same data and check that they match the last entry of the log.
| MetaEstimator(est).fit(X=X, y=y, X_val=X_val, y_val=y_val) | ||
|
|
||
| # with metadata-routing enabled and requested | ||
| est.set_fit_request(X_val=True, y_val=True) |
There was a problem hiding this comment.
I think it would make sense to set those requests on the ScoringMonitor estimator itself (maybe even by default) and then have the CallbackSupportMixin automatically request this metadata if there is a registered callback that request them on the estimator instance?
There was a problem hiding this comment.
I think it would make sense to set those requests on the ScoringMonitor estimator itself
You mean something like ScoringMonitor().set_on_fit_task_begin_request("X_val") ?
It would not work because the estimator might be able to pass some values for some tasks but not for others, so we'd get an error. I don't think that we should make the callbacks metadata routers.
There was a problem hiding this comment.
I naively though about something like:
ScoringMonitor().set_on_fit_task_end_request(X_val=True, y_val=True)but in the end I don't like it, and it might not work.
Instead, we could have BaseEstimator.set_callbacks automatically call:
self.set_fit_requests(X_val=True, y_val=True)if one of the callbacks declares that it needs extra requests in some kind of public attribute.
We can be probably defer this design discussion to a follow-up issue and keep the manual est.set_fit_request(X_val=True, y_val=True) for now.
There was a problem hiding this comment.
We can only set_fit_request on estimators (that inherit from BaseEstimator). Callbacks don't do this and don't expose the set_request methods. I agree with @jeremiedbb that it is fine to pass metadata as usual params, instead of routing them to the callback.
There was a problem hiding this comment.
In a separate discussion, we came to the conclusion that we have 2 options regarding metadata-routing:
- don't make callbacks routers (as being done currently).
- make callbacks routers and turn every estimator into a router as well.
The former is obviously a lot simpler but has limitations. For instance it doesn't allow to pass different sample weight for the fitting and for the scoring.
I think that this use case is quite niche and that for now we can go with the first option. We can reconsider later the second option if we really want to enable the advanced use cases.
| X=X, y=y | ||
| ) | ||
| subcontext = callback_ctx.subcontext(task_id=i, task_name="iteration") | ||
| subcontext.call_on_fit_task_begin(X=X, y=y, metadata=metadata) |
There was a problem hiding this comment.
I would suggest not calling this "metadata", to not be confused with metadata routing, which is a different concept and which this doesn't relate to. I think this could otherwise be very confusing for contributors.
This was introduced in #33572 as
VALID_HOOK_PARAMS_IN = ["X", "y", "metadata", "reconstruction_attributes"]
VALID_HOOK_PARAMS_OUT = ["X", "y", "metadata", "fitted_estimator"]I was thinking if we could call it fit_extras instead.
LLM also suggests hook_payload, which after what I have read on what "payload" means in computing, fits even better, because it is generic and signals that this is data coming with this specific event.
There was a problem hiding this comment.
to not be confused with metadata routing
But it is the metadata from metadata-routing, see https://github.com/scikit-learn/scikit-learn/pull/33407/changes#diff-9d7ab703cc6480acfdb50d80729bb33776dcf3269a07a8dedd1d4383dcb281abR372
There was a problem hiding this comment.
But it is the metadata from metadata-routing,
I agree that currently this dict contains the routed metadata. What I mean is that it shouldn't be named as if it is part of the routing. What I am thinking is:
a) metadata it is not part of the routing conceptually since the sub-estimator's method is the end-consumer (even if it passes it forwards to another function or to a callback).
b) Metadata routing is generically supporting any routed metadata, so users can define custom estimators and be sure their metadata gets routed to it if they use it in a scikit-learn meta-estimator. I think we wouldn't want to expose every param, that gets passed though metadata routing, to the callbacks, right? So if we pick information that gets used in implemented callbacks, there will be metadata that had been routed, but which is not part of the metadata param that gets passed to a callback.
c) We could also pass something that was computed/defined on the sub-estimator's method directly for a different callback than ScoringMonitor. We actually have to, because there is no other param available (except the reconstruction attributes).
In the SLEP, I read:
- "metadata": a dictionary containing training and validation metadata, e.g. sample
weights, `X_val`, `y_val`, etc.
So my impression is that the idea of this parameter is to pass something that is known on the estimator-level to the callback hook, and it could be something routed to the estimator or something only the estimator knows. For the reconstructions parameters, you have now defined a separate parameter, but what if the estimator splits data into train and validation sets internally, or what about passing n_iter, or loss, or best_score_so_far?
There was a problem hiding this comment.
Regarding the name, that's what I put on the SLEP and people are currently voting on that, so we can't really change it now. I mean we could, but that would give a bad image.
I think we wouldn't want to expose every param, that gets passed though metadata routing, to the callbacks, right?
I think that we want to pass everything that is routed. Then it's up to the callback to do something with it (don't use any, use all, only use specific metadata). Maybe in sklearn we'll never use all the possible metadata in built-in callbacks, but we leave the door open for custom callbacks to use them.
We could also pass something that was computed/defined on the sub-estimator's method directly
That won't be included in metadata. Instead it will be added as an extension of the SLEP. For this SLEP we wrote the signature with the minimal set of params required for progress bars and score monitoring. We're confident that it will also be enough for early stopping and snapshots. For structured logging we might indeed want to pass new information and we'll extend the api then.
There was a problem hiding this comment.
I think that we want to pass everything that is routed.
Something that was computed/defined on the sub-estimator's method directly
That won't be included in metadata.
I see. Thanks for your clarifications. If all routed params will be in metadata and we do not intend to put anything else into it, then metadata would be the best param name indeed. (I think I had missed the discussions and the reasons for change scikit-learn/enhancement_proposals@3cdaa00 in the SLEP.)
What I understand from what you write, @jeremiedbb, is that metadata is supposed to be everything that gets routed into the estimator's fit.
But there are a few things that are still unclear to me. I will try to write down what I think, but no pressure for a long answer. I mainly want to bring this down cleanly for myself:
For sample_weight, ScoringMonitor should use the sample_weight routed to the scorer, since it is a scoring callback. But instead it uses the sample_weight available in the estimator's fit method. Users can set different requests on fit and score, also a different sample_weight (I'm not sure if it is a good idea to do so, though). So there is a chance for things going wrong, I think.
Additionally, sample_weight can also be passed directly (as test_sample_weights_and_metadata_routing) shows, so in this case, sample_weight comes without routing and is strictly not routed metadata.
If metadata only contains de facto routed params, passing X_val and y_val to the callback can currently only happen by setting a metadata request for it on HistGradientboosting (which is the only estimator currently consuming these and the only one able to set a request). Does that mean, only HistGradientboosting can use ScoringMonitor with eval_on="val" or "both" if routing is enabled?
Would other estimators that calculate similar values then not be able to use ScoringMonitor with eval_on="val" or "both" or could they also pass their computed values directly, similar to sample_weight which can be passed directly by the user?
(I didn't dig deep, but I found SGD using a validation_mask, and a few estimators using X_val and y_val after an internal train_test_split. It seems like they use validation sets internally.)
| X=X, y=y | ||
| ) | ||
| subcontext = callback_ctx.subcontext(task_id=i, task_name="iteration") | ||
| subcontext.call_on_fit_task_begin(X=X, y=y, metadata=metadata) |
There was a problem hiding this comment.
But it is the metadata from metadata-routing,
I agree that currently this dict contains the routed metadata. What I mean is that it shouldn't be named as if it is part of the routing. What I am thinking is:
a) metadata it is not part of the routing conceptually since the sub-estimator's method is the end-consumer (even if it passes it forwards to another function or to a callback).
b) Metadata routing is generically supporting any routed metadata, so users can define custom estimators and be sure their metadata gets routed to it if they use it in a scikit-learn meta-estimator. I think we wouldn't want to expose every param, that gets passed though metadata routing, to the callbacks, right? So if we pick information that gets used in implemented callbacks, there will be metadata that had been routed, but which is not part of the metadata param that gets passed to a callback.
c) We could also pass something that was computed/defined on the sub-estimator's method directly for a different callback than ScoringMonitor. We actually have to, because there is no other param available (except the reconstruction attributes).
In the SLEP, I read:
- "metadata": a dictionary containing training and validation metadata, e.g. sample
weights, `X_val`, `y_val`, etc.
So my impression is that the idea of this parameter is to pass something that is known on the estimator-level to the callback hook, and it could be something routed to the estimator or something only the estimator knows. For the reconstructions parameters, you have now defined a separate parameter, but what if the estimator splits data into train and validation sets internally, or what about passing n_iter, or loss, or best_score_so_far?
There was a problem hiding this comment.
Here is another pass of review. (I haven't fully finished reviewing and there are a few things undefined on the metadata param).
In sum of my comment #33407 (comment), I am not sure if we should use it exclusively for routes params, because it would require users to call estimators differently.
Thanks so far for your work @FrancoisPgm and @jeremiedbb. ❤️
| X=X, y=y | ||
| ) | ||
| subcontext = callback_ctx.subcontext(task_id=i, task_name="iteration") | ||
| subcontext.call_on_fit_task_begin(X=X, y=y, metadata=metadata) |
There was a problem hiding this comment.
I think that we want to pass everything that is routed.
Something that was computed/defined on the sub-estimator's method directly
That won't be included in metadata.
I see. Thanks for your clarifications. If all routed params will be in metadata and we do not intend to put anything else into it, then metadata would be the best param name indeed. (I think I had missed the discussions and the reasons for change scikit-learn/enhancement_proposals@3cdaa00 in the SLEP.)
What I understand from what you write, @jeremiedbb, is that metadata is supposed to be everything that gets routed into the estimator's fit.
But there are a few things that are still unclear to me. I will try to write down what I think, but no pressure for a long answer. I mainly want to bring this down cleanly for myself:
For sample_weight, ScoringMonitor should use the sample_weight routed to the scorer, since it is a scoring callback. But instead it uses the sample_weight available in the estimator's fit method. Users can set different requests on fit and score, also a different sample_weight (I'm not sure if it is a good idea to do so, though). So there is a chance for things going wrong, I think.
Additionally, sample_weight can also be passed directly (as test_sample_weights_and_metadata_routing) shows, so in this case, sample_weight comes without routing and is strictly not routed metadata.
If metadata only contains de facto routed params, passing X_val and y_val to the callback can currently only happen by setting a metadata request for it on HistGradientboosting (which is the only estimator currently consuming these and the only one able to set a request). Does that mean, only HistGradientboosting can use ScoringMonitor with eval_on="val" or "both" if routing is enabled?
Would other estimators that calculate similar values then not be able to use ScoringMonitor with eval_on="val" or "both" or could they also pass their computed values directly, similar to sample_weight which can be passed directly by the user?
(I didn't dig deep, but I found SGD using a validation_mask, and a few estimators using X_val and y_val after an internal train_test_split. It seems like they use validation sets internally.)
| MetaEstimator(est).fit(X=X, y=y, X_val=X_val, y_val=y_val) | ||
|
|
||
| # with metadata-routing enabled and requested | ||
| est.set_fit_request(X_val=True, y_val=True) |
There was a problem hiding this comment.
We can only set_fit_request on estimators (that inherit from BaseEstimator). Callbacks don't do this and don't expose the set_request methods. I agree with @jeremiedbb that it is fine to pass metadata as usual params, instead of routing them to the callback.
|
|
||
| # with metadata-routing enabled and requested | ||
| est.set_fit_request(X_val=True, y_val=True) | ||
| MetaEstimator(est, n_outer=2, n_inner=3).fit(X=X, y=y, X_val=X_val, y_val=y_val) |
There was a problem hiding this comment.
Maybe in the end we could check that X_val and y_val that the callback received is equal (the same?) to what was passed?
There was a problem hiding this comment.
I don't see how we could do that with only the ScoreMonitoring callback. We need the TestingCallback that records the data passed to the hooks. But that's part of the bigger issue to add more tests for the hooks #33324
| # Without metadata-routing enabled, passing X_val and y_val gives an error | ||
| msg = re.escape( | ||
| "[X_val, y_val] are passed but are not explicitly set as requested or not " | ||
| "requested for MaxIterEstimator.fit" | ||
| ) |
There was a problem hiding this comment.
I think we would not need the two checks if UnsetMetadataPassedError raises here. We already test this for every meta-estimator (not this test class of cause) and it works consistently.
| # error if sample_weight not requested | ||
| scorer = make_scorer(r2_score) | ||
| callback = ScoringMonitor(eval_on="train", scoring={"r2": scorer}) | ||
| est = MaxIterEstimator().set_callbacks(callback) | ||
| with pytest.raises( | ||
| TypeError, | ||
| match=re.escape("score got unexpected argument(s) {'sample_weight'}"), | ||
| ): | ||
| est.fit(X=X, y=y, sample_weight=sample_weight) |
There was a problem hiding this comment.
As in the other test, we don't necessarily need this here, since this behaviour is already tested for every meta-estimator (not the MetaEstimator test class of cause). What do you think about keeping this as a separate test for the test class?
There was a problem hiding this comment.
I'm leaving them for now while we're figuring how we want to handle metadata routing. It helps making sure that the testing meta-estimator correctly implements metadata routing during this time.
| scorer = make_scorer(r2_score) | ||
| scorer.set_score_request(sample_weight=True) | ||
| callback = ScoringMonitor(eval_on="train", scoring={"r2": scorer}) | ||
| MaxIterEstimator().set_callbacks(callback).fit( | ||
| X=X, y=y, sample_weight=sample_weight | ||
| ) |
There was a problem hiding this comment.
Here, sample_weight is still not routed via metadata routing to MaxIterEstimator.fit, it is only routed to the scorer.
It is passed directly into MaxIterEstimator.fit and it is this sample_weight (not the routed one) that gets used in the callback. This means, if we would route a different sample_weight to the scorer, the user could be oddly surprised by differing results. We need to document this well, since there is no solution to fix this without metadata routing, as far as I'm aware of.
For fixing this test (if we still want to test if directly passed versus routes sample_weight is the same: If we want to route sample_weight to the estimator, we need to to do a MaxIterEstimator().set_fit_request(sample_weight=True) (and pass it as a sub-estimator in a meta-estimator).
There was a problem hiding this comment.
On a general note: In the meetings we were only talking about using metadata routing to make metadat available on the ScoringMonitor. This is not the whole picture.
We need to find solutions that work without metadata routing first, since we will keep passing arguments like sample_weight directly (as this test shows).
Then, we can add routing functionality.
sklearn/callback/_scoring_monitor.py
Outdated
| return | ||
|
|
||
| if self.eval_on in ("train", "both"): | ||
| score_params = metadata.get("train", {}).copy() |
There was a problem hiding this comment.
score_params = metadata.get("train", {}).copy()This is very misleading. We don't retrieve the score_params from metadata routing here, we only pass what estimator.fit has.
The base problem is that this is not passing the correct information to the scorer.
As a first step, we should document it well in docstrings as well as via the used variable names. Therefore, it would be much better to keep calling it sample_weight, or more correctly, fit_sample_weight, or fit_metadata (if we want to be open to custom scorers).
score_params is metadata routing vocabulary. We are outside of metadata routing here and the scorer instance that may be used to score the estimator otherwise (as part of a cv splitter for instance) may receive a different set of score_params via metadata routing.
There was a problem hiding this comment.
in MetaEstimator, you made me rename metadata to fit_params because it is what we're passing to fit.
Here, this is what we're passing to score so it is score_params. And it has to do with metadata routing because if enabled, score will raise if I don't pass sample_weight while it requested it.
There was a problem hiding this comment.
If we want to be robust and explicit, we can inspect what the scorer requests and take it from metadata. Not sure how to do that properly though
sklearn/callback/_scoring_monitor.py
Outdated
| def __init__(self, *, eval_on="train", scoring): | ||
| self.eval_on = eval_on | ||
| self.scoring = scoring | ||
| # Turn the scorer into a MultimetricScorer which can route score params |
There was a problem hiding this comment.
| # Turn the scorer into a MultimetricScorer which can route score params | |
| # Turn the scorer into a _MultimetricScorer |
Nit.
sklearn/callback/_scoring_monitor.py
Outdated
| scores = {"eval_on": eval_on} | ||
| scorer = self._estimator_scorers[context.estimator_name] | ||
| score_params = {k: v for k, v in score_params.items() if v is not None} | ||
| score = scorer(fitted_estimator, X, y, **score_params) |
There was a problem hiding this comment.
To verify my understanding: When we get the fitted_estimator from reconstruction attributes here, it is desirable that each fitted_estimator is a bit different from another, depending on the learned attributes until the time this hook had been called in fit.
I can see that for MaxIterEstimtaor, n_iter changes between the hook calls. But what would differ in other estimators? Would they differ by the param(s) passed as reconstruction_attributes into call_on_fit_task_end or maybe also by some attribute of the estimator instance via the copy() in _from_reconstruction_attributes (accidentally or on purpose)?
Could you give some examples?
There was a problem hiding this comment.
I think I can see, we only need to pass a subset of all the learned attributes (ending on _) to reconstruction_attributes and only if we want to overwrite those the copied estimator estimator exposes at this point in time anyway. Copying the estimator does most of the work of adjusting the estimator.
| inner_ctx.call_on_fit_task_begin(X=X, y=y) | ||
|
|
||
| est.fit(X, y) | ||
| est.fit(X=X, y=y, **fit_params) |
There was a problem hiding this comment.
I wonder, if passing **params (containing **fit_params and **score_params) to each consumer would do the trick and prevent us from making every estimator and every callback into a metadata router.
We'd route a bit more information than before, and we'd need to have an additional **params kwarg on consuming estimators.
But since score() can be called from inside fit now when a ScoringMonitor callback is set (or any other callback that uses sample_weight or X_val or user-specified kwargs) that would not be so surprising.
The callback could then use the routed information (for instance if a sample_weight=True/False request is set on a scorer) or require the user to set it in its scorer sample_weight is not set yet.
| if select == "most_recent": | ||
| return logs[-1] if logs else {} | ||
|
|
||
| return logs |
There was a problem hiding this comment.
Do we want to give users the option to delete the logs? Currently, they accumulate and the users don't have any control over this.
| return logs | |
| return logs | |
| def clear_logs(): | |
| .... |
| - "data": the recorded scores for the run. Each score value is associated | ||
| with the detailed context of the score computation. |
There was a problem hiding this comment.
Just a suggestion (since I had struggled with this), feel free to not apply.
| - "data": the recorded scores for the run. Each score value is associated | |
| with the detailed context of the score computation. | |
| - "data": the recorded scores for each step of the run. Each score | |
| value is annotated with the path of the task leading to score | |
| computation. |
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
Reference Issues/PRs
Towards #27676
What does this implement/fix? Explain your changes.
Add the
MetricMonitorScoringMonitorcallback, which evaluates ametricscorer on an estimator during fit and logs the values.AI usage disclosure
I used AI assistance for:
Any other comments?
ping @jeremiedbb
EDIT: the previously called
MetricMonitorcallback is now renamedScoringMonitorto avoid confusion since it uses scorers to compute the metric. An actualMetricMonitorwhich would take a scikit-learn metrics as an argument is considered for a future PR.