Skip to content

[ML-59816][1/n] Add new fields to the genai_evaluation event telemetry#19018

Merged
xsh310 merged 2 commits intomlflow:masterfrom
xsh310:stack/ML-59816
Nov 26, 2025
Merged

[ML-59816][1/n] Add new fields to the genai_evaluation event telemetry#19018
xsh310 merged 2 commits intomlflow:masterfrom
xsh310:stack/ML-59816

Conversation

@xsh310
Copy link
Collaborator

@xsh310 xsh310 commented Nov 25, 2025

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

We would like to add a few new fields to the genai_evaluate event logging. This logging changes has been reviewed and approved in http://go/.c441888. We hope that these new fields will help us have a better understanding of the distribution of eval dataset size and distribution of usage across different scorer types. This information will help us better prioritize features that add the most user value.

Logging Spec (Part One Covers Bold Items):

  • eval_data_size: int
  • eval_data_type: one of the str values below
    • “genai.EvaluationDataset”,
    • “entities.EvaluationDataset”,
    • “list[Trace]”,
    • “List[dict]”,
    • “pyspark.DataFrame”,
    • “pandas.DataFrame”
  • eval_data_provided_fields: set{“inputs” | “outputs” | “expectations” | “trace”}?
  • predict_fn_provided: bool
  • scorer_kind_count: dict<
    one of the str values: "builtin" | "decorator" | "instructions" | "guidelines",
    int
    >
  • Existing builtin_scorers: str[]
    • str is the ClassName of the BuiltInScore (including Guidelines)

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Manual Test Plan

Tested the telemetry by running the following the python notebook:

from mlflow.genai.scorers import Safety, RelevanceToQuery, Guidelines

@mlflow.genai.scorer(name="custom_scorer_2")
def custom_scorer_2():
        return 2.0


result = mlflow.genai.evaluate(
        data=_EVALUATION_TEST_CASES,
        predict_fn=predict_fn,
        scorers=[Safety(name="my_secret_app's_safety_scorer"), RelevanceToQuery(), Guidelines(
                name="test_guidelines",
                guidelines=["The response must be helpful and accurate"],
        ),
        custom_scorer_2
        ],
);

Record Parameter

Before After
{'builtin_scorers': ['Safety', 'RelevanceToQuery', 'Guidelines']} {'predict_fn_provided': True, 'builtin_scorers': ['Guidelines', 'RelevanceToQuery', 'Safety'], 'scorer_kind_count': {'builtin': 2, 'decorator': 1, 'guidelines': 1}}

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)


# If the input dataset is a managed dataset, we pass the original dataset
# to the evaluate function to preserve metadata like dataset name.
data = data if is_managed_dataset else df
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this logic into _run_harness so that we can log the original data type passed to the genai.evaluate call

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not efficient to convert data to df twice (line 246 and inside _run_harness), can you refactor?

# Check if records are loaded to avoid triggering expensive load
if data.has_records():
df = data.to_df()
eval_data_size = len(df)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if there's a better way to get the dataset size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this is too heavy to compute the dataset size by converting it to dataframe. Reading the code, this data is converted to pandas dataframe when evaluating anyways, can we calculate the dataset size after that?

CLASS = "class"
BUILTIN = "builtin"
DECORATOR = "decorator"
INSTRUCTIONS = "instructions"
Copy link
Collaborator Author

@xsh310 xsh310 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand the kind property of scorer to distinguish different types of scorer more easily.

Let me know if there's a better kind name for InstructionsJudge (ie. the judge created by make_judge)

@xsh310 xsh310 changed the title [ML-59816] Add eval_data_size and eval_data_type fields to genai_evaluate event logging [ML-59816] Add new fields to the genai_evaluation event telemetry Nov 25, 2025
@xsh310 xsh310 marked this pull request as ready for review November 25, 2025 05:32
Comment on lines +74 to +75
from mlflow.entities.evaluation_dataset import EvaluationDataset as EntityEvaluationDataset
from mlflow.genai.datasets import EvaluationDataset as ManagedEvaluationDataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these imported here to avoid circular imports? If so can you add comments, otherwise let's move it at top level

else:
# Records not loaded, don't trigger load for telemetry
eval_data_size = None
elif isinstance(data, EntityEvaluationDataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the same as above except data type, can we unify them as if isinstance(data, (xxx, xxx))?

Comment on lines +102 to +107
if data and isinstance(data[0], Trace):
eval_data_type = "list[Trace]"
elif data and isinstance(data[0], dict):
eval_data_type = "list[dict]"
else:
eval_data_type = "list"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if data and isinstance(data[0], Trace):
eval_data_type = "list[Trace]"
elif data and isinstance(data[0], dict):
eval_data_type = "list[dict]"
else:
eval_data_type = "list"
eval_data_type = f"list[{type(data[0].__name__}]" if data else ...

Is it possible data is empty list though?

eval_data_type = "pyspark.DataFrame"
else:
eval_data_type = "pandas.DataFrame"
eval_data_size = len(data) if hasattr(data, "__len__") else data.count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.count() will trigger the spark job for calculating total count which is expensive, do we really need to know the count for spark dataframe?

if isinstance(data, (ManagedEvaluationDataset, EntityEvaluationDataset)):
try:
if data.has_records():
data = data.to_df()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar as above, let's avoid multiple conversions to pandas dataframe on user's data


# Check pandas DataFrame
try:
import pandas as pd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this fail? I thought all dataset are converted to pandas dataframe?

try:
if not data[field].isna().all():
provided_fields.add(field)
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

# Check which columns are present in the DataFrame
for field in target_fields:
if field in data.columns:
# Check if the column has any non-null values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually why does this matter? What signal can we get by knowing whether inputs/outputs are none or not? I assume the scorers have expectations on the data anyways

if not data:
return provided_fields
# List of Trace objects
if hasattr(data[0], "__class__") and data[0].__class__.__name__ == "Trace":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if hasattr(data[0], "__class__") and data[0].__class__.__name__ == "Trace":
if instance(data[0], Trace):

Comment on lines +185 to +187
for field in target_fields:
if field in data[0]:
provided_fields.add(field)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for field in target_fields:
if field in data[0]:
provided_fields.add(field)
provided_fields = data[0].keys() & target_fields


@property
def kind(self) -> ScorerKind:
"""Get the kind of this scorer."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Get the kind of this scorer."""

@github-actions github-actions bot added area/evaluation MLflow Evaluation area/tracking Tracking service, tracking client APIs, autologging labels Nov 25, 2025
@github-actions github-actions bot added the rn/none List under Small Changes in Changelogs. label Nov 25, 2025
@xsh310 xsh310 changed the title [ML-59816] Add new fields to the genai_evaluation event telemetry [ML-59816][Part One] Add new fields to the genai_evaluation event telemetry Nov 25, 2025
@xsh310 xsh310 changed the title [ML-59816][Part One] Add new fields to the genai_evaluation event telemetry [ML-59816] Add new fields to the genai_evaluation event telemetry: Part ONE Nov 25, 2025
@xsh310 xsh310 changed the title [ML-59816] Add new fields to the genai_evaluation event telemetry: Part ONE [ML-59816] Add new fields to the genai_evaluation event telemetry: Part One Nov 25, 2025
@xsh310 xsh310 changed the title [ML-59816] Add new fields to the genai_evaluation event telemetry: Part One [ML-59816][1/n] Add new fields to the genai_evaluation event telemetry Nov 26, 2025
Comment on lines +96 to +102
scorer_kind_count = Counter()
for scorer in scorers:
if isinstance(scorer, Scorer):
try:
scorer_kind_count[scorer.kind.value] += 1
except Exception:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work:

scorer_kind_count = Counter(
    scorer.kind.value
    for scorer in scorers
    if isinstance(scorer, Scorer)
)

Why do we need try catch for the assignment?

Comment on lines +698 to +702
"""Test that built-in scorers have the correct kind property.

Most built-in scorers should return ScorerKind.BUILTIN, except:
- Guidelines scorer which returns ScorerKind.GUIDELINES
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Test that built-in scorers have the correct kind property.
Most built-in scorers should return ScorerKind.BUILTIN, except:
- Guidelines scorer which returns ScorerKind.GUIDELINES
"""

Comment on lines +397 to +398
from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import Guidelines, RelevanceToQuery
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid importing within the test

Copy link
Collaborator

@serena-ruan serena-ruan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Left some nits

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
…_evaluate event logging

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
@github-actions
Copy link
Contributor

Documentation preview for d605f37 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@xsh310 xsh310 added this pull request to the merge queue Nov 26, 2025
Merged via the queue into mlflow:master with commit 1964fae Nov 26, 2025
48 checks passed
@xsh310 xsh310 deleted the stack/ML-59816 branch November 26, 2025 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation area/tracking Tracking service, tracking client APIs, autologging rn/none List under Small Changes in Changelogs. v3.6.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants