Support evaluating list of traces by B-Step62 · Pull Request #18695 · mlflow/mlflow

B-Step62 · 2025-11-05T15:22:43Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18695/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18695/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18695/merge

What changes are proposed in this pull request?

Support passing a list of Trace object to evaluation. This is requested by a CUJ, and also useful when we implement a UI trigger to running evaluation on traces (we will need a way to run evaluation on a set of trace IDs).

trace_ids = ["tr-1", "tr-2"]
traces = [mlflow.get_trace(trace_id) for trace_id in trace_ids]
mlflow.genai.evaluate(data=traces, scorers=[...])

Btw, we could also add sth like mlflow.get_traces(trace_id=[...]) to make it even easier. However, it is not super trivial given that we now have v3 and v4 backend, so I consider it YAGNI now.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

github-actions · 2025-11-05T15:36:35Z

Documentation preview for e3b57ec is available at:

https://pr-18695--mlflow-docs-preview.netlify.app/docs/latest/

Changed Pages (1)

genai/tracing/integrations/listing/openai-agent (modified)

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

harupy · 2025-11-11T04:22:21Z

tests/genai/evaluate/test_evaluation.py

+            f"Expected 6 assessments, got {len(trace.info.assessments)}"
+            f"Assessments: {[a.name for a in trace.info.assessments]}"
+        )  # 2 expectations + 4 feedbacks


do we need assert message?

It's for convenience because we cannot see the actual assessment names from the varaible logger (trace repr is minimal). Any concern in having this?

makes sense!

harupy · 2025-11-11T04:29:38Z

mlflow/genai/evaluation/harness.py

+    new_expectations = []
+    for exp in eval_item.get_expectation_assessments():
+        if exp.name not in existing_expectations:
+            new_expectations.append(exp)
+    return new_expectations


Suggested change

new_expectations = []

for exp in eval_item.get_expectation_assessments():

if exp.name not in existing_expectations:

new_expectations.append(exp)

return new_expectations

return [

exp

for exp in eval_item.get_expectation_assessments()

if exp.name not in existing_expectations

]

can we use list comprehension?

harupy

Left a couple comments, otherwise LGTM!

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

TomeHirata

Left some suggestions, otherwise LGTM

TomeHirata · 2025-11-12T07:55:09Z

mlflow/genai/evaluation/utils.py

+    from mlflow.entities.evaluation_dataset import EvaluationDataset as EntityEvaluationDataset
+    from mlflow.genai.datasets.evaluation_dataset import EvaluationDataset
+
+    if isinstance(data, (EvaluationDataset, EntityEvaluationDataset)):


I think this is not necessary since it's handled inside _convert_eval_set_to_df

TomeHirata · 2025-11-12T07:55:54Z

mlflow/genai/evaluation/utils.py

+    if isinstance(data, (EvaluationDataset, EntityEvaluationDataset)):
+        return data.to_df()
+
+    if isinstance(data, list) and all(isinstance(item, Trace) for item in data):


Similarly, can we move this logic to _convert_eval_set_to_df? It's better to consolidate the data conversion part in that method so that we can reuse the logic.

ah nice catch, sure it should be handled in ..._to_df func

TomeHirata · 2025-11-12T07:59:43Z

tests/genai/evaluate/test_utils.py

        _convert_to_eval_set(df)


-def test_convert_to_eval_set_evaluation_dataset():


Shouldn't we add a new fixture for EvaluationDataset to _ALL_DATA_FIXTURES if we remove this test case?

have some minor suggestions

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

mlflow/genai/evaluation/utils.py

TomeHirata

LGTM, can we fix tests?

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

TomeHirata · 2025-11-14T06:32:51Z

mlflow/genai/evaluation/utils.py

+    Takes in a dataset in the multiple format that mlflow.genai.evaluate() expects and converts
+    it into a standardized Pandas DataFrame.
    """
-    column_mapping = {


Finally 🙂

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Signed-off-by: Tian Lan <sky.blue266000@gmail.com>

Support evaluating list of traces

0d50c0d

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

github-actions bot added v3.6.0 area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Nov 5, 2025

B-Step62 removed the v3.6.0 label Nov 6, 2025

remove stale test

86e8845

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

github-actions bot added the v3.6.1 label Nov 10, 2025

lint

fb98248

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

B-Step62 added the team-review Trigger a team review request label Nov 10, 2025

github-actions bot requested review from BenWilson2, TomeHirata, WeichenXu123, daniellok-db, harupy, kevin-lyn, serena-ruan and xq-yin November 10, 2025 14:31

harupy reviewed Nov 11, 2025

View reviewed changes

harupy approved these changes Nov 11, 2025

View reviewed changes

comment

042d2a2

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

TomeHirata previously approved these changes Nov 12, 2025

View reviewed changes

TomeHirata reviewed Nov 12, 2025

View reviewed changes

comments

18d331f

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

TomeHirata reviewed Nov 13, 2025

View reviewed changes

mlflow/genai/evaluation/utils.py Outdated Show resolved Hide resolved

TomeHirata approved these changes Nov 13, 2025

View reviewed changes

fix tests

26f3fb3

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

TomeHirata reviewed Nov 14, 2025

View reviewed changes

B-Step62 added 2 commits November 19, 2025 19:18

Merge branch 'master' into add-trace-id-params-for-search-traces

0d2c72c

fix

e3b57ec

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

B-Step62 added this pull request to the merge queue Nov 19, 2025

Merged via the queue into mlflow:master with commit 80cc350 Nov 19, 2025
45 checks passed

B-Step62 deleted the add-trace-id-params-for-search-traces branch November 19, 2025 14:21

Tian-Sky-Lan pushed a commit to Tian-Sky-Lan/mlflow that referenced this pull request Nov 24, 2025

Support evaluating list of traces (mlflow#18695)

3bc50ab

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Signed-off-by: Tian Lan <sky.blue266000@gmail.com>

		_convert_to_eval_set(df)


		def test_convert_to_eval_set_evaluation_dataset():

Conversation

B-Step62 commented Nov 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harupy Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B-Step62 Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harupy Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

harupy Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harupy left a comment

Choose a reason for hiding this comment

Uh oh!

TomeHirata left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

TomeHirata Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B-Step62 Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

TomeHirata Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomeHirata left a comment

Choose a reason for hiding this comment

Uh oh!

TomeHirata Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

B-Step62 commented Nov 5, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Nov 5, 2025 •

edited

Loading

harupy Nov 11, 2025 •

edited

Loading

B-Step62 Nov 12, 2025 •

edited

Loading

harupy Nov 11, 2025 •

edited

Loading

TomeHirata left a comment •

edited

Loading

TomeHirata Nov 12, 2025 •

edited

Loading