[ML-59816][1/n] Add new fields to the genai_evaluation event telemetry by xsh310 · Pull Request #19018 · mlflow/mlflow

xsh310 · 2025-11-25T05:09:37Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/ML-59816 [Files changed]
- stack/ML-59816-part-2 [Files changed]

What changes are proposed in this pull request?

We would like to add a few new fields to the genai_evaluate event logging. This logging changes has been reviewed and approved in http://go/.c441888. We hope that these new fields will help us have a better understanding of the distribution of eval dataset size and distribution of usage across different scorer types. This information will help us better prioritize features that add the most user value.

Logging Spec (Part One Covers Bold Items):

eval_data_size: int
eval_data_type: one of the str values below
- “genai.EvaluationDataset”,
- “entities.EvaluationDataset”,
- “list[Trace]”,
- “List[dict]”,
- “pyspark.DataFrame”,
- “pandas.DataFrame”
eval_data_provided_fields: set{“inputs” | “outputs” | “expectations” | “trace”}?
predict_fn_provided: bool
scorer_kind_count: dict<
one of the str values: "builtin" | "decorator" | "instructions" | "guidelines",
int
>
Existing builtin_scorers: str[]
- str is the ClassName of the BuiltInScore (including Guidelines)

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Manual Test Plan

Tested the telemetry by running the following the python notebook:

from mlflow.genai.scorers import Safety, RelevanceToQuery, Guidelines

@mlflow.genai.scorer(name="custom_scorer_2")
def custom_scorer_2():
        return 2.0


result = mlflow.genai.evaluate(
        data=_EVALUATION_TEST_CASES,
        predict_fn=predict_fn,
        scorers=[Safety(name="my_secret_app's_safety_scorer"), RelevanceToQuery(), Guidelines(
                name="test_guidelines",
                guidelines=["The response must be helpful and accurate"],
        ),
        custom_scorer_2
        ],
);

Record Parameter

Before	After
`{'builtin_scorers': ['Safety', 'RelevanceToQuery', 'Guidelines']}`	`{'predict_fn_provided': True, 'builtin_scorers': ['Guidelines', 'RelevanceToQuery', 'Safety'], 'scorer_kind_count': {'builtin': 2, 'decorator': 1, 'guidelines': 1}}`

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

xsh310 · 2025-11-25T05:12:57Z

mlflow/genai/evaluation/base.py


-    # If the input dataset is a managed dataset, we pass the original dataset
-    # to the evaluate function to preserve metadata like dataset name.
-    data = data if is_managed_dataset else df


Moving this logic into _run_harness so that we can log the original data type passed to the genai.evaluate call

It's not efficient to convert data to df twice (line 246 and inside _run_harness), can you refactor?

xsh310 · 2025-11-25T05:13:53Z

mlflow/genai/evaluation/utils.py

+            # Check if records are loaded to avoid triggering expensive load
+            if data.has_records():
+                df = data.to_df()
+                eval_data_size = len(df)


Let me know if there's a better way to get the dataset size

Apparently this is too heavy to compute the dataset size by converting it to dataframe. Reading the code, this data is converted to pandas dataframe when evaluating anyways, can we calculate the dataset size after that?

xsh310 · 2025-11-25T05:16:02Z

mlflow/genai/scorers/base.py

    CLASS = "class"
    BUILTIN = "builtin"
    DECORATOR = "decorator"
+    INSTRUCTIONS = "instructions"


Expand the kind property of scorer to distinguish different types of scorer more easily.

Let me know if there's a better kind name for InstructionsJudge (ie. the judge created by make_judge)

serena-ruan · 2025-11-25T07:44:26Z

mlflow/genai/evaluation/utils.py

+    from mlflow.entities.evaluation_dataset import EvaluationDataset as EntityEvaluationDataset
+    from mlflow.genai.datasets import EvaluationDataset as ManagedEvaluationDataset


Are these imported here to avoid circular imports? If so can you add comments, otherwise let's move it at top level

serena-ruan · 2025-11-25T07:50:14Z

mlflow/genai/evaluation/utils.py

+            else:
+                # Records not loaded, don't trigger load for telemetry
+                eval_data_size = None
+        elif isinstance(data, EntityEvaluationDataset):


This is basically the same as above except data type, can we unify them as if isinstance(data, (xxx, xxx))?

serena-ruan · 2025-11-25T07:54:26Z

mlflow/genai/evaluation/utils.py

+            if data and isinstance(data[0], Trace):
+                eval_data_type = "list[Trace]"
+            elif data and isinstance(data[0], dict):
+                eval_data_type = "list[dict]"
+            else:
+                eval_data_type = "list"


Suggested change

if data and isinstance(data[0], Trace):

eval_data_type = "list[Trace]"

elif data and isinstance(data[0], dict):

eval_data_type = "list[dict]"

else:

eval_data_type = "list"

eval_data_type = f"list[{type(data[0].__name__}]" if data else ...

Is it possible data is empty list though?

serena-ruan · 2025-11-25T08:00:37Z

mlflow/genai/evaluation/utils.py

+                    eval_data_type = "pyspark.DataFrame"
+                else:
+                    eval_data_type = "pandas.DataFrame"
+                eval_data_size = len(data) if hasattr(data, "__len__") else data.count()


data.count() will trigger the spark job for calculating total count which is expensive, do we really need to know the count for spark dataframe?

serena-ruan · 2025-11-25T08:02:04Z

mlflow/genai/evaluation/utils.py

+    if isinstance(data, (ManagedEvaluationDataset, EntityEvaluationDataset)):
+        try:
+            if data.has_records():
+                data = data.to_df()


Similar as above, let's avoid multiple conversions to pandas dataframe on user's data

serena-ruan · 2025-11-25T08:03:30Z

mlflow/genai/evaluation/utils.py

+
+    # Check pandas DataFrame
+    try:
+        import pandas as pd


When would this fail? I thought all dataset are converted to pandas dataframe?

serena-ruan · 2025-11-25T08:05:27Z

mlflow/genai/evaluation/utils.py

+                    try:
+                        if not data[field].isna().all():
+                            provided_fields.add(field)
+                    except Exception:


Do we need this?

serena-ruan · 2025-11-25T08:07:12Z

mlflow/genai/evaluation/utils.py

+            # Check which columns are present in the DataFrame
+            for field in target_fields:
+                if field in data.columns:
+                    # Check if the column has any non-null values


Actually why does this matter? What signal can we get by knowing whether inputs/outputs are none or not? I assume the scorers have expectations on the data anyways

serena-ruan · 2025-11-25T08:08:05Z

mlflow/genai/evaluation/utils.py

+        if not data:
+            return provided_fields
+        # List of Trace objects
+        if hasattr(data[0], "__class__") and data[0].__class__.__name__ == "Trace":


Suggested change

if hasattr(data[0], "__class__") and data[0].__class__.__name__ == "Trace":

if instance(data[0], Trace):

serena-ruan · 2025-11-25T08:10:04Z

mlflow/genai/evaluation/utils.py

+            for field in target_fields:
+                if field in data[0]:
+                    provided_fields.add(field)


Suggested change

for field in target_fields:

if field in data[0]:

provided_fields.add(field)

provided_fields = data[0].keys() & target_fields

serena-ruan · 2025-11-25T08:10:34Z

mlflow/genai/scorers/builtin_scorers.py


+    @property
+    def kind(self) -> ScorerKind:
+        """Get the kind of this scorer."""


Suggested change

"""Get the kind of this scorer."""

serena-ruan · 2025-11-26T04:30:50Z

mlflow/telemetry/events.py

+        scorer_kind_count = Counter()
+        for scorer in scorers:
+            if isinstance(scorer, Scorer):
+                try:
+                    scorer_kind_count[scorer.kind.value] += 1
+                except Exception:
+                    pass


Does this work:

scorer_kind_count = Counter( scorer.kind.value for scorer in scorers if isinstance(scorer, Scorer) )

Why do we need try catch for the assignment?

serena-ruan · 2025-11-26T04:32:19Z

tests/genai/scorers/test_builtin_scorers.py

+    """Test that built-in scorers have the correct kind property.
+
+    Most built-in scorers should return ScorerKind.BUILTIN, except:
+    - Guidelines scorer which returns ScorerKind.GUIDELINES
+    """


Suggested change

"""Test that built-in scorers have the correct kind property.

Most built-in scorers should return ScorerKind.BUILTIN, except:

- Guidelines scorer which returns ScorerKind.GUIDELINES

"""

serena-ruan · 2025-11-26T04:33:11Z

tests/telemetry/test_tracked_events.py

+    from mlflow.genai.judges import make_judge
+    from mlflow.genai.scorers import Guidelines, RelevanceToQuery


Let's avoid importing within the test

serena-ruan

Overall LGTM! Left some nits

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

…_evaluate event logging Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

github-actions · 2025-11-26T06:09:04Z

Documentation preview for d605f37 is available at:

https://pr-19018--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

xsh310 mentioned this pull request Nov 25, 2025

[ML-59815] Use class name instead of scorer name in genai_evaluate telemetry event #18990

Merged

29 tasks

xsh310 commented Nov 25, 2025

View reviewed changes

xsh310 changed the title ~~[ML-59816] Add eval_data_size and eval_data_type fields to genai_evaluate event logging~~ [ML-59816] Add new fields to the genai_evaluation event telemetry Nov 25, 2025

xsh310 requested review from AveshCSingh, B-Step62, BenWilson2, alkispoly-db, dbczumar, harupy, serena-ruan and smoorjani November 25, 2025 05:31

xsh310 marked this pull request as ready for review November 25, 2025 05:32

serena-ruan reviewed Nov 25, 2025

View reviewed changes

xsh310 force-pushed the stack/ML-59816 branch from 37c6a49 to 5937aed Compare November 25, 2025 23:33

github-actions bot added the v3.6.1 label Nov 25, 2025

xsh310 mentioned this pull request Nov 25, 2025

[ML-59816][[2/n] Add new fields to the genai_evaluation event telemetry #19040

Closed

29 tasks

github-actions bot added area/evaluation MLflow Evaluation area/tracking Tracking service, tracking client APIs, autologging labels Nov 25, 2025

github-actions bot added the rn/none List under Small Changes in Changelogs. label Nov 25, 2025

xsh310 changed the title ~~[ML-59816] Add new fields to the genai_evaluation event telemetry~~ [ML-59816][Part One] Add new fields to the genai_evaluation event telemetry Nov 25, 2025

xsh310 changed the title ~~[ML-59816][Part One] Add new fields to the genai_evaluation event telemetry~~ [ML-59816] Add new fields to the genai_evaluation event telemetry: Part ONE Nov 25, 2025

xsh310 changed the title ~~[ML-59816] Add new fields to the genai_evaluation event telemetry: Part ONE~~ [ML-59816] Add new fields to the genai_evaluation event telemetry: Part One Nov 25, 2025

xsh310 force-pushed the stack/ML-59816 branch from 5937aed to 234346f Compare November 25, 2025 23:58

xsh310 changed the title ~~[ML-59816] Add new fields to the genai_evaluation event telemetry: Part One~~ [ML-59816][1/n] Add new fields to the genai_evaluation event telemetry Nov 26, 2025

serena-ruan reviewed Nov 26, 2025

View reviewed changes

serena-ruan approved these changes Nov 26, 2025

View reviewed changes

xsh310 added 2 commits November 25, 2025 21:23

[ML-59816] Add predict_fn_provided field to genai_evaluate event logging

e504fba

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

[ML-59816] Update kind assignments and add scorer_kind_count to genai…

d605f37

…_evaluate event logging Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 force-pushed the stack/ML-59816 branch from 234346f to d605f37 Compare November 26, 2025 05:59

xsh310 enabled auto-merge November 26, 2025 06:04

xsh310 added this pull request to the merge queue Nov 26, 2025

Merged via the queue into mlflow:master with commit 1964fae Nov 26, 2025
48 checks passed

xsh310 deleted the stack/ML-59816 branch November 26, 2025 06:42

		from mlflow.entities.evaluation_dataset import EvaluationDataset as EntityEvaluationDataset
		from mlflow.genai.datasets import EvaluationDataset as ManagedEvaluationDataset

	if hasattr(data[0], "__class__") and data[0].__class__.__name__ == "Trace":
	if instance(data[0], Trace):

		from mlflow.genai.judges import make_judge
		from mlflow.genai.scorers import Guidelines, RelevanceToQuery

Conversation

xsh310 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

What changes are proposed in this pull request?

How is this PR tested?

Manual Test Plan

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xsh310 Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serena-ruan left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xsh310 commented Nov 25, 2025 •

edited

Loading

xsh310 Nov 25, 2025 •

edited

Loading