Autologging functionality for scikit-learn integration with XGBoost (Part 2)#5078
Conversation
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
|
@jwyyy Thank you for your updates. I'll take a look first thing tomorrow! |
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
remove additional lines Signed-off-by: Junwen Yao <jwyiao@gmail.com>
564a2c5 to
6bc2a8e
Compare
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
|
Hi @dbczumar, I made some updates in this PR based on your review. The main change is what you suggested here. We reuse the Please let me know your ideas and suggestions when you have to review it again. Thank you very much! |
| @@ -0,0 +1,48 @@ | |||
| from pprint import pprint | |||
|
|
|||
There was a problem hiding this comment.
Awesome example! Can we add a brief README to this directory explaining what this example covers? E.g. Usage of XGBoost's scikit-learn integration with MLflow Tracking, particularly autologging?
| # params of xgboost sklearn models are logged in train() in mlflow.xgboost.autolog() | ||
| if flavor_name == FLAVOR_NAME: | ||
| _log_posttraining_metadata(autologging_client, self, *args, **kwargs) | ||
| autologging_client.flush(synchronous=True) |
There was a problem hiding this comment.
@jwyyy Instead of special casing XGBoost logic in fit_mlflow, can we define a new method called fit_mlflow_xgboost that just calls original(self, *args, **kwargs) and then logs self using mlflow.xgboost.log_model()? This will also allow us to revert changes to XGBoost autologging's train() method, since we can control how the model gets logged here.
We can then add a parameter to patched_fit (
mlflow/mlflow/sklearn/__init__.py
Line 1372 in 6e4b64b
fit_mlflow (for sklearn models) or fit_mlflow_xgboost (for xgboost sklearn models). Perhaps we can call this parameter fit_fn.
Let me know if you have questions here!
There was a problem hiding this comment.
@dbczumar Thank you for your suggestion! There is a small issue on model logging in the train() method when calling fit(). The fit() method calls train() internally and assigns a Booster object to the internal _Booster in a XGBoost sklearn model [see L1331]. The current train() in mlflow.xgboost.autolog() logs models before returning the model object, which means (1) the logged model is a Booster object; (2) we cannot log XGBoost sklearn models before assigning the trained Booster to _Booster. The changes in mlflow.xgboost.autolog() try to log sklearn models directly. I think we definitely can log sklearn models in fit_mlflow_xgboost() but it is extra work. Because models are logged when calling train(), and calling fit_mlflow_xgboost() just logs new information to replace old ones. However, I also think adopting your suggestion makes the code logic easier to read. Please let me know which solution sounds better to you. Thank you!
There was a problem hiding this comment.
@jwyyy Ah, thanks for letting me know! Can we decompose train() into two methods - one for parameter, metric, & non-model artifact logging, and one for model logging? We can then use the former method to patch xgboost.sklearn.train().
There was a problem hiding this comment.
Sounds good to me! I will make some adjustments.
| safe_patch_call_count = ( | ||
| safe_patch_mock.call_count + xgb_sklearn_safe_patch_mock.call_count | ||
| ) | ||
| else: |
There was a problem hiding this comment.
On the subject of test coverage, can we add a test case to https://github.com/mlflow/mlflow/blob/master/tests/xgboost/test_xgboost_autolog.py ensuring that autologging works as expected for XGBoost scikit-learn models? Feel free to use code from your excellent example above.
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
| early_stopping = ( | ||
| num_pos_args >= early_stopping_index + 1 or "early_stopping_rounds" in kwargs | ||
| early_stopping = num_pos_args >= early_stopping_index + 1 or ( | ||
| "early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"] |
There was a problem hiding this comment.
| "early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"] | |
| in kwargs.get("early_stopping_rounds") |
Can we use get here?
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
|
Hi @dbczumar @harupy, I made some updates in this PR. Thank you for your review and suggestions! A new test file was added (to highlight some differences in XGBoost sklearn model testing), and it may contain tests more than what we need. We can remove them later. I also modified doc and re-organized xgboost examples. Please let me know your suggestions when you have time to review it again.Thank you very much! Happy Holidays to you! 😄 |
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
|
@jwyyy I pulled your PR branch and ran Looks like I just saw what |
|
@harupy Thank you for running the example! Yes, there is a line logging the model after training. We can remove it if it looks redundant. |
|
Let's remove it because the model is automatically logged. |
| @@ -0,0 +1,498 @@ | |||
| from packaging.version import Version | |||
There was a problem hiding this comment.
I think adding a simple test like below in tests/xgboost/test_xgboost_autolog.py should be enough.
@pytest.mark.large
def test_xgb_autolog_sklearn():
mlflow.xgboost.autolog()
X, y = datasets.load_iris(return_X_y=True)
params = {"n_estimators": 10, "reg_lambda": 1}
model = xgb.XGBRegressor(**params)
with mlflow.start_run() as run:
model.fit(X, y)
model_uri = mlflow.get_artifact_uri("model")
client = mlflow.tracking.MlflowClient()
run = client.get_run(run.info.run_id)
assert run.data.metrics.items() <= params.items()
artifacts = set(x.path for x in client.list_artifacts(run.info.run_id))
assert artifacts >= set(["feature_importance_weight.png", "feature_importance_weight.json"])
loaded_model = mlflow.xgboost.load_model(model_uri)
np.testing.assert_allclose(loaded_model.predict(X), model.predict(X))| # are done in `train()` in `mlflow.xgboost.autolog()` | ||
| fit_output = original(self, *args, **kwargs) | ||
| # log models after training | ||
| (X, _, _) = _get_args_for_metrics(self.fit, args, kwargs) |
There was a problem hiding this comment.
Does _get_args_for_metrics always return a tuple with 3 elements?
There was a problem hiding this comment.
Never mind. It does:
mlflow/mlflow/sklearn/utils.py
Line 48 in 22e69c3
There was a problem hiding this comment.
X = _get_args_for_metrics(self.fit, args, kwargs)[0] might be safer.
There was a problem hiding this comment.
We might want to rename _get_args_for_metrics to something like _get_X_y_and_sample_weight. I'll take care of this.
| mse = mean_squared_error(y_test, y_pred) | ||
| run_id = run.info.run_id | ||
| print("Logged data and model in run {}".format(run_id)) | ||
| mlflow.xgboost.log_model(regressor, artifact_path="log_model") |
| @@ -0,0 +1,12 @@ | |||
| # XGBoost Scikit-learn Model Example | |||
|
|
|||
| This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model. | |||
There was a problem hiding this comment.
| This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model. | |
| This example trains an [`XGBoost.XGBRegressor`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model. |
| @@ -0,0 +1,44 @@ | |||
| from pprint import pprint | |||
There was a problem hiding this comment.
Nit: can we rename this file to train.py because other examples use this convetion?
| - mlflow | ||
| - mypy-extensions==0.4.3 | ||
| - pandas==1.3.4 | ||
| - scikit-learn==0.24.2 | ||
| - typing-extensions==4.0.0 | ||
| - xgboost==1.5.0 |
There was a problem hiding this comment.
| - mlflow | |
| - mypy-extensions==0.4.3 | |
| - pandas==1.3.4 | |
| - scikit-learn==0.24.2 | |
| - typing-extensions==4.0.0 | |
| - xgboost==1.5.0 | |
| - mlflow | |
| - pandas==1.3.4 | |
| - scikit-learn==0.24.2 | |
| - xgboost==1.5.0 |
Can we remove mypy-extensions and typing-extensions?
| diabetes = load_diabetes() | ||
| X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) | ||
| y = pd.Series(diabetes.target) |
There was a problem hiding this comment.
| diabetes = load_diabetes() | |
| X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) | |
| y = pd.Series(diabetes.target) | |
| X, y = load_diabetes(return_X_y=True, as_frame=True) |
Let's use return_X_y and as_frame to simplify the code.
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
|
@harupy Thank you for the review! |
|
@dbczumar Thank you for your review! |

What changes are proposed in this pull request?
This is the second PR to add autologging for XGBoost sklearn models using
mlflow.sklearnautologging routine.(Previous PR: #4954)
(Draft + discussion: #4885)
How is this patch tested?
A new example is provided. Tests will be added later.
Does this PR change the documentation?
ci/circleci: build_doccheck. If it's successful, proceed to thenext step, otherwise fix it.
Detailson the right to open the job page of CircleCI.Artifactstab.docs/build/html/index.html.Release Notes
Is this a user-facing change?
Success merge of this PR will enable autologging for XGBoost scikit-learn models.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts: Artifact stores and artifact loggingarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesarea/examples: Example codearea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models: MLmodel format, model serialization/deserialization, flavorsarea/projects: MLproject format, project running backendsarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra: MLflow Tracking server backendarea/tracking: Tracking Service, tracking client APIs, autologgingInterface
area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows: Windows supportLanguage
language/r: R APIs and clientslanguage/java: Java APIs and clientslanguage/new: Proposals for new client languagesIntegrations
integrations/azure: Azure and Azure ML integrationsintegrations/sagemaker: SageMaker integrationsintegrations/databricks: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notes