Skip to content

Evaluation Default evaluator#5092

Merged
WeichenXu123 merged 128 commits into
mlflow:masterfrom
WeichenXu123:eval-api-default-eval
Jan 10, 2022
Merged

Evaluation Default evaluator#5092
WeichenXu123 merged 128 commits into
mlflow:masterfrom
WeichenXu123:eval-api-default-eval

Conversation

@WeichenXu123

@WeichenXu123 WeichenXu123 commented Nov 22, 2021

Copy link
Copy Markdown
Collaborator

What changes are proposed in this pull request?

Prototype Evaluation Default evaluator

Test code:

import numpy as np
from sklearn.linear_model import LogisticRegression
import shap
import matplotlib.pyplot as plt
import sklearn
import pandas as pd

import xgboost
import shap
from mlflow.models.evaluation import evaluate, EvaluationDataset
import mlflow
from mlflow.tracking.artifact_utils import get_artifact_uri

# train XGBoost model
X, y = shap.datasets.adult()
new_feature_names_map = {f: f + '12345678901234' for f in X.columns}
X = X.rename(columns=new_feature_names_map)

model = xgboost.XGBClassifier().fit(X, y)

with mlflow.start_run() as run:
    mlflow.sklearn.log_model(model, 'xgb_model')

model_uri = get_artifact_uri(run.info.run_id, 'xgb_model')


def predict_proba(self, data):
    return self._model_impl.predict_proba(data)


mlflow.pyfunc.PyFuncModel.predict_proba = predict_proba

y = y.astype(np.int32)


X['label'] = y

dataset = EvaluationDataset(data=X, labels='label', name='adult')

"""
        evaluator_config={'default': {
            'explainality.algorithm': 'partition',
            'explainality.nsamples': 1000
        }}
"""

with mlflow.start_run() as run:
    result = evaluate(
        model_uri,
        model_type='classifier',
        dataset=dataset,
        evaluators=['default'],
    )

print(f'run_id={run.info.run_id}\n')
client = mlflow.tracking.MlflowClient()
artifacts = [f.path for f in client.list_artifacts(run.info.run_id)]
print(f'artifacts={artifacts}')

How is this patch tested?

(Details)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Implement builtin default evaluator for model evaluation.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 marked this pull request as draft November 22, 2021 16:17
Comment thread mlflow/models/evaluation/base.py Outdated
Comment thread mlflow/models/evaluation/base.py Outdated
Comment thread mlflow/models/evaluation/base.py Outdated
Comment thread mlflow/models/evaluation/base.py Outdated
Comment thread mlflow/models/evaluation/default_evaluator.py Outdated
Comment thread mlflow/models/evaluation/default_evaluator.py Outdated
Comment thread mlflow/models/evaluation/default_evaluator.py Outdated
if not _matplotlib_initialized:
import matplotlib.pyplot as pyplot

pyplot.rcParams["figure.dpi"] = 144

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should mutate rcParams. Can we just specify dpi and figsize when saving plots?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems cannot. We need set it when creating figure, if so we need modify shap source code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could try this
https://stackoverflow.com/a/35394637

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I just also found it. Use with matplotlib.rc_context(...)

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Comment thread tests/models/test_default_evaluator.py Outdated
_last_failed_evaluator = None


def _get_last_failed_evaluator():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use this function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In usage log, we want to log error raised from which evaluator (TODO), for debug purpose. so I add this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we only store the last failed evaluator?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluate API will raise error when it hit the first failed evaluator. (The API designed to be so.) When any evaluator fail evaluate raise error, otherwise return all merged success results

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
sampled_X = sampled_X.rename(columns=truncated_feature_name_map, copy=False)

if algorithm:
supported_algos = ['exact', 'permutation', 'partition']

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:
in shap 0.40, the shap.Explainer argument algo , only when setting 'exact', 'permutation', 'partition', it works fine, the remaining algo (listed in shap doc) are all buggy in my test. So I limit them to be the 3 values.

try:
pyplot.clf()
do_plot()
pyplot.savefig(artifact_file_local_path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set the dpi value to something higher than the default of 100? The images look really blurry and out of place in the tracking UI. 300 might be a better option.

pyplot.savefig(artifact_file_local_path, dpi=300)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps save in .svg format to ensure that scalability in different zoom levels in browsers will work well?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increase dpi to 288 (previous in databricks notebook it is 72)

Save fig as svg is not good choice, the Pillow package does not support svg. But we need return artifact as the pillow Image instance.

Comment thread mlflow/models/evaluation/lift_curve.py Outdated
Comment thread mlflow/models/evaluation/default_evaluator.py
Comment thread mlflow/models/evaluation/default_evaluator.py
sk_metrics.ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix,
display_labels=self.label_list,
).plot(cmap="Blues")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to scale these plots for class count > 6. The example for multiclass that has 10 classes is illegible with numeric wrapping overwrites of values. Setting figsize overrides on the canvas and changing the dpi at figure save time will make a much more clear visualization.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The autologging code also generate confusion matrix plot for training dataset and has similar issue. Let's do a follow-up PR to improve for both cases.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a follow-up ticket for this:
https://databricks.atlassian.net/browse/ML-19817

Comment thread mlflow/models/evaluation/default_evaluator.py
Comment thread mlflow/models/evaluation/default_evaluator.py
Comment thread mlflow/models/evaluation/default_evaluator.py
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

@jinzhang21 jinzhang21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice implementation! Sorry for the delayed review, Weichen.

Comment thread mlflow/models/evaluation/artifacts.py

@experimental
def evaluate(
*,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the "*" mean here? Do you intend to make it a nameless positional arg like *args? It may worth to spell it out even if it's not used, and put it in the end of the argument list, e.g.
def _evaluate(model, model_type, dataset, actual_run_id, evaluator_name_list, evaluator_name_to_conf_map, *unused_args):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The * is a python grammar, not a argument, it means all following arguments must be a key-word-only arguments. https://www.python.org/dev/peps/pep-3102/

Mark all argument as key-word-only is a proposal raised by @dbczumar , so that in future if mlflow containerized model provide "model_type" property, we can remove the "model_type" argument here but keep backwards compatibility

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, removing the model_type argumnet is a breaking change, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, we can change API to be :

def evaluate(
    *, model, dataset, run_id, evaluators=None, evaluator_config=None, **kwargs
):
   if "model_type" in kwargs:
      model_type = kwargs["kwargs"]
      logger.warning("deprecation: ....")
   else:
      model_type = model.model_type
   ...

Then we can keep backwards compatiblity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this usage pattern. Interesting!
Another way to keep backward compatibility is to use keyword arg for model_type, e.g.

def evaluate(
    model, dataset, run_id, model_type=None, evaluators=None, evaluator_config=None, **kwargs
):

What's the downside of going this more "conventional" way of specifying optional arguments?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current evaluate implementation, the model_type argument is required. In future we will make it "optional and deprecated" and then remove this argument.

a nested dictionary whose key is the evaluator name.
:return: An :py:class:`mlflow.models.evaluation.EvaluationDataset` instance containing
evaluation results.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I really like the documentation here to spell out the expected outcome and limitations. Hopefully we'll make it shorter and easier to navigate over time.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 marked this pull request as ready for review January 10, 2022 10:02
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 merged commit 964f5ab into mlflow:master Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants