Skip to content

Support evaluating list of traces#18695

Merged
B-Step62 merged 8 commits intomlflow:masterfrom
B-Step62:add-trace-id-params-for-search-traces
Nov 19, 2025
Merged

Support evaluating list of traces#18695
B-Step62 merged 8 commits intomlflow:masterfrom
B-Step62:add-trace-id-params-for-search-traces

Conversation

@B-Step62
Copy link
Collaborator

@B-Step62 B-Step62 commented Nov 5, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18695/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18695/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18695/merge

What changes are proposed in this pull request?

Support passing a list of Trace object to evaluation. This is requested by a CUJ, and also useful when we implement a UI trigger to running evaluation on traces (we will need a way to run evaluation on a set of trace IDs).

trace_ids = ["tr-1", "tr-2"]
traces = [mlflow.get_trace(trace_id) for trace_id in trace_ids]
mlflow.genai.evaluate(data=traces, scorers=[...])

Btw, we could also add sth like mlflow.get_traces(trace_id=[...]) to make it even easier. However, it is not super trivial given that we now have v3 and v4 backend, so I consider it YAGNI now.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
@github-actions github-actions bot added v3.6.0 area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Nov 5, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

Documentation preview for e3b57ec is available at:

Changed Pages (1)

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@B-Step62 B-Step62 removed the v3.6.0 label Nov 6, 2025
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Comment on lines +90 to +92
f"Expected 6 assessments, got {len(trace.info.assessments)}"
f"Assessments: {[a.name for a in trace.info.assessments]}"
) # 2 expectations + 4 feedbacks
Copy link
Member

@harupy harupy Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need assert message?

Copy link
Collaborator Author

@B-Step62 B-Step62 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for convenience because we cannot see the actual assessment names from the varaible logger (trace repr is minimal). Any concern in having this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense!

Comment on lines +264 to +268
new_expectations = []
for exp in eval_item.get_expectation_assessments():
if exp.name not in existing_expectations:
new_expectations.append(exp)
return new_expectations
Copy link
Member

@harupy harupy Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new_expectations = []
for exp in eval_item.get_expectation_assessments():
if exp.name not in existing_expectations:
new_expectations.append(exp)
return new_expectations
return [
exp
for exp in eval_item.get_expectation_assessments()
if exp.name not in existing_expectations
]

can we use list comprehension?

Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments, otherwise LGTM!

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
TomeHirata
TomeHirata previously approved these changes Nov 12, 2025
Copy link
Collaborator

@TomeHirata TomeHirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions, otherwise LGTM

from mlflow.entities.evaluation_dataset import EvaluationDataset as EntityEvaluationDataset
from mlflow.genai.datasets.evaluation_dataset import EvaluationDataset

if isinstance(data, (EvaluationDataset, EntityEvaluationDataset)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not necessary since it's handled inside _convert_eval_set_to_df

if isinstance(data, (EvaluationDataset, EntityEvaluationDataset)):
return data.to_df()

if isinstance(data, list) and all(isinstance(item, Trace) for item in data):
Copy link
Collaborator

@TomeHirata TomeHirata Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can we move this logic to _convert_eval_set_to_df? It's better to consolidate the data conversion part in that method so that we can reuse the logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice catch, sure it should be handled in ..._to_df func

_convert_to_eval_set(df)


def test_convert_to_eval_set_evaluation_dataset():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add a new fixture for EvaluationDataset to _ALL_DATA_FIXTURES if we remove this test case?

@TomeHirata TomeHirata dismissed their stale review November 12, 2025 08:02

have some minor suggestions

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Copy link
Collaborator

@TomeHirata TomeHirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can we fix tests?

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Takes in a dataset in the multiple format that mlflow.genai.evaluate() expects and converts
it into a standardized Pandas DataFrame.
"""
column_mapping = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally 🙂

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
@B-Step62 B-Step62 added this pull request to the merge queue Nov 19, 2025
Merged via the queue into mlflow:master with commit 80cc350 Nov 19, 2025
45 checks passed
@B-Step62 B-Step62 deleted the add-trace-id-params-for-search-traces branch November 19, 2025 14:21
Tian-Sky-Lan pushed a commit to Tian-Sky-Lan/mlflow that referenced this pull request Nov 24, 2025
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: Tian Lan <sky.blue266000@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. team-review Trigger a team review request v3.6.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants