Evaluation Default evaluator#5092
Conversation
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
| if not _matplotlib_initialized: | ||
| import matplotlib.pyplot as pyplot | ||
|
|
||
| pyplot.rcParams["figure.dpi"] = 144 |
There was a problem hiding this comment.
I don't think we should mutate rcParams. Can we just specify dpi and figsize when saving plots?
There was a problem hiding this comment.
seems cannot. We need set it when creating figure, if so we need modify shap source code.
There was a problem hiding this comment.
maybe we could try this
https://stackoverflow.com/a/35394637
There was a problem hiding this comment.
Yes. I just also found it. Use with matplotlib.rc_context(...)
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
| _last_failed_evaluator = None | ||
|
|
||
|
|
||
| def _get_last_failed_evaluator(): |
There was a problem hiding this comment.
In usage log, we want to log error raised from which evaluator (TODO), for debug purpose. so I add this.
There was a problem hiding this comment.
why do we only store the last failed evaluator?
There was a problem hiding this comment.
The evaluate API will raise error when it hit the first failed evaluator. (The API designed to be so.) When any evaluator fail evaluate raise error, otherwise return all merged success results
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
| sampled_X = sampled_X.rename(columns=truncated_feature_name_map, copy=False) | ||
|
|
||
| if algorithm: | ||
| supported_algos = ['exact', 'permutation', 'partition'] |
There was a problem hiding this comment.
Note:
in shap 0.40, the shap.Explainer argument algo , only when setting 'exact', 'permutation', 'partition', it works fine, the remaining algo (listed in shap doc) are all buggy in my test. So I limit them to be the 3 values.
| try: | ||
| pyplot.clf() | ||
| do_plot() | ||
| pyplot.savefig(artifact_file_local_path) |
There was a problem hiding this comment.
Can we set the dpi value to something higher than the default of 100? The images look really blurry and out of place in the tracking UI. 300 might be a better option.
pyplot.savefig(artifact_file_local_path, dpi=300)There was a problem hiding this comment.
Or perhaps save in .svg format to ensure that scalability in different zoom levels in browsers will work well?
There was a problem hiding this comment.
I increase dpi to 288 (previous in databricks notebook it is 72)
Save fig as svg is not good choice, the Pillow package does not support svg. But we need return artifact as the pillow Image instance.
| sk_metrics.ConfusionMatrixDisplay( | ||
| confusion_matrix=confusion_matrix, | ||
| display_labels=self.label_list, | ||
| ).plot(cmap="Blues") |
There was a problem hiding this comment.
We need to scale these plots for class count > 6. The example for multiclass that has 10 classes is illegible with numeric wrapping overwrites of values. Setting figsize overrides on the canvas and changing the dpi at figure save time will make a much more clear visualization.
There was a problem hiding this comment.
The autologging code also generate confusion matrix plot for training dataset and has similar issue. Let's do a follow-up PR to improve for both cases.
There was a problem hiding this comment.
Create a follow-up ticket for this:
https://databricks.atlassian.net/browse/ML-19817
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
jinzhang21
left a comment
There was a problem hiding this comment.
LGTM. Nice implementation! Sorry for the delayed review, Weichen.
|
|
||
| @experimental | ||
| def evaluate( | ||
| *, |
There was a problem hiding this comment.
What does the "*" mean here? Do you intend to make it a nameless positional arg like *args? It may worth to spell it out even if it's not used, and put it in the end of the argument list, e.g.
def _evaluate(model, model_type, dataset, actual_run_id, evaluator_name_list, evaluator_name_to_conf_map, *unused_args):
There was a problem hiding this comment.
No. The * is a python grammar, not a argument, it means all following arguments must be a key-word-only arguments. https://www.python.org/dev/peps/pep-3102/
Mark all argument as key-word-only is a proposal raised by @dbczumar , so that in future if mlflow containerized model provide "model_type" property, we can remove the "model_type" argument here but keep backwards compatibility
There was a problem hiding this comment.
Wait, removing the model_type argumnet is a breaking change, right?
There was a problem hiding this comment.
In future, we can change API to be :
def evaluate(
*, model, dataset, run_id, evaluators=None, evaluator_config=None, **kwargs
):
if "model_type" in kwargs:
model_type = kwargs["kwargs"]
logger.warning("deprecation: ....")
else:
model_type = model.model_type
...
Then we can keep backwards compatiblity.
There was a problem hiding this comment.
I didn't know about this usage pattern. Interesting!
Another way to keep backward compatibility is to use keyword arg for model_type, e.g.
def evaluate(
model, dataset, run_id, model_type=None, evaluators=None, evaluator_config=None, **kwargs
):
What's the downside of going this more "conventional" way of specifying optional arguments?
There was a problem hiding this comment.
In current evaluate implementation, the model_type argument is required. In future we will make it "optional and deprecated" and then remove this argument.
| a nested dictionary whose key is the evaluator name. | ||
| :return: An :py:class:`mlflow.models.evaluation.EvaluationDataset` instance containing | ||
| evaluation results. | ||
|
|
There was a problem hiding this comment.
Nice! I really like the documentation here to spell out the expected outcome and limitations. Hopefully we'll make it shorter and easier to navigate over time.
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
What changes are proposed in this pull request?
Prototype Evaluation Default evaluator
Test code:
How is this patch tested?
(Details)
Does this PR change the documentation?
ci/circleci: build_doccheck. If it's successful, proceed to thenext step, otherwise fix it.
Detailson the right to open the job page of CircleCI.Artifactstab.docs/build/html/index.html.Release Notes
Is this a user-facing change?
Implement builtin default evaluator for model evaluation.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts: Artifact stores and artifact loggingarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesarea/examples: Example codearea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models: MLmodel format, model serialization/deserialization, flavorsarea/projects: MLproject format, project running backendsarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra: MLflow Tracking server backendarea/tracking: Tracking Service, tracking client APIs, autologgingInterface
area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows: Windows supportLanguage
language/r: R APIs and clientslanguage/java: Java APIs and clientslanguage/new: Proposals for new client languagesIntegrations
integrations/azure: Azure and Azure ML integrationsintegrations/sagemaker: SageMaker integrationsintegrations/databricks: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notes