[ML-59303] Add helper functions for multi-turn evaluation session processing by AveshCSingh · Pull Request #18898 · mlflow/mlflow

AveshCSingh · 2025-11-18T22:12:21Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18898/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18898/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18898/merge

This PR is stacked on top of #18897. Click here to see a clean diff.

What changes are proposed in this pull request?

This PR adds three internal helper functions to support multi-turn evaluation in mlflow.genai.evaluate. These are pure, side-effect-free utility functions that will be used in subsequent PRs to implement the full multi-turn evaluation feature.

Functions Added:

_classify_scorers() - Separates scorers into single-turn and multi-turn categories based on the is_multi_turn property (introduced in PR [ML-59303] Support multiturn judge creation with make_judge api and direct judge invocation #18897)
_group_traces_by_session() - Groups evaluation items by session_id, extracting from trace metadata using TraceMetadataKey.TRACE_SESSION
_get_first_trace_in_session() - Identifies the chronologically first trace in a session using trace.info.request_time

How is this PR tested?

New unit tests (12 tests added)

Does this PR require documentation update?

No. These are internal helper functions not exposed in the public API.

Release Notes

Is this a user-facing change?

No. This PR adds internal helper functions for an upcoming feature.

Components

area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows

How should the PR be classified in the release notes?

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section

Should this PR be included in the next patch release?

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

Additional Context

This is Part 2 of the multi-turn evaluation implementation plan:

✅ PR Fix example file path #1: Infrastructure setup (completed in [ML-59303] Support multiturn judge creation with make_judge api and direct judge invocation #18897 by @xsh310)
✅ PR models.rst typo #2: Helper functions (this PR)
🔄 PR TypeError: __init__() got an unexpected keyword argument 'file' #3: Core multi-turn evaluation logic (upcoming)
🔄 PR Run quickstart example failed #4: Integration into main evaluate() (upcoming)
🔄 PR mlflow run local project will fetch project from git #5: Documentation and examples (upcoming)

🤖 Generated with Claude Code

github-actions · 2025-11-18T22:12:37Z

@AveshCSingh Thank you for the contribution! Could you fix the following issue(s)?

⚠ DCO check

The DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

…irect judge invocation Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

This commit adds three internal helper functions to support multi-turn evaluation in mlflow.genai.evaluate: 1. _classify_scorers(): Separates scorers into single-turn and multi-turn categories based on the is_multi_turn property (added in PR mlflow#18897) 2. _group_traces_by_session(): Groups evaluation items by session_id, extracting from trace metadata using TraceMetadataKey.TRACE_SESSION 3. _get_first_trace_in_session(): Identifies the chronologically first trace in a session using trace.info.request_time Comprehensive unit tests added covering: - Scorer classification (4 tests) - Session grouping (5 tests) - First trace identification (3 tests) All functions are pure (no side effects) and handle edge cases gracefully including None traces, missing session_ids, and empty lists. This is part 2 of the multi-turn evaluation implementation plan. Related to PR mlflow#18897 which added the is_multi_turn property to base Scorer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

…irect judge invocation Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

B-Step62

LGTM!

mlflow/genai/evaluation/utils.py

B-Step62 · 2025-11-19T03:33:34Z

mlflow/genai/evaluation/utils.py

+    return dict(session_groups)
+
+
+def _get_first_trace_in_session(


nit: Q: Any reason we don't pass list[EvalItem] here? Trace is a part of it.

That's a better idea. Updated.

github-actions · 2025-11-19T03:49:54Z

Documentation preview for b7cc388 is available at:

https://pr-18898--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

smoorjani · 2025-11-19T16:25:31Z

mlflow/genai/evaluation/utils.py

+        if not hasattr(item, "trace") or item.trace is None:
+            continue
+
+        session_id = item.trace.info.trace_metadata.get(TraceMetadataKey.TRACE_SESSION)


should we also sort the traces by timestamp? or is this done in the forward pass of the scorer?

It's done in Scorer.call, so I don't think we need to sort here

mlflow/genai/evaluation/utils.py

smoorjani · 2025-11-19T16:28:33Z

mlflow/genai/judges/instructions_judge/__init__.py

        return self._instructions_prompt.variables

+    @property
+    def is_multi_turn(self) -> bool:


you can rebase off of the PR from @xsh310 (#18912). Prefer consistent naming across our codebase so we don't get confused, specifically, let's try only using session internally and then for user-facing interfaces, we can use conversation or, if absolutely necessary, multi-turn

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

…ality - Move imports to top level in _group_traces_by_session - Fix type hint: change EvalResult to EvalItem - Use walrus operator for cleaner session_id assignment - Simplify return types: use EvalItem instead of (item, trace) tuples - Update all tests to match new API Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

- Move TraceInfo, TraceLocation, TraceState, TraceData imports to top - Move TraceMetadataKey import to top - Move EvalItem import to top - Remove local imports from _create_mock_trace and _create_mock_eval_item helpers Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

- Break _MultiTurnTestScorer docstring into multiple lines - Break test_classify_scorers_all_multi_turn docstring into multiple lines Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

…cessing (mlflow#18898) Signed-off-by: Avesh Singh <aveshcsingh@gmail.com> Co-authored-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Kevin Wang <kevinwang2040@gmail.com>

…cessing (mlflow#18898) Signed-off-by: Avesh Singh <aveshcsingh@gmail.com> Co-authored-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tian Lan <sky.blue266000@gmail.com>

github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Nov 18, 2025

xsh310 and others added 4 commits November 18, 2025 23:17

[ML-59303] Support multiturn judge creation with make_judge api and d…

714a963

…irect judge invocation Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

[ML-59303] Support multiturn judge creation with make_judge api and d…

b42d3d9

…irect judge invocation Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

update tests

ddd2cba

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

AveshCSingh force-pushed the ML-59303_genai-eval-helpers branch from bc644b7 to ddd2cba Compare November 18, 2025 23:18

AveshCSingh requested review from B-Step62 and smoorjani November 19, 2025 00:46

B-Step62 self-assigned this Nov 19, 2025

B-Step62 approved these changes Nov 19, 2025

View reviewed changes

smoorjani reviewed Nov 19, 2025

View reviewed changes

AveshCSingh added 4 commits November 20, 2025 02:19

resolve merge conflicts

6acf71a

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

Fix linter errors: shorten long docstrings to meet 100-char limit

b7cc388

- Break _MultiTurnTestScorer docstring into multiple lines - Break test_classify_scorers_all_multi_turn docstring into multiple lines Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

AveshCSingh added this pull request to the merge queue Nov 20, 2025

Merged via the queue into mlflow:master with commit ce7cd94 Nov 20, 2025
46 checks passed

AveshCSingh deleted the ML-59303_genai-eval-helpers branch November 20, 2025 23:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML-59303] Add helper functions for multi-turn evaluation session processing#18898

[ML-59303] Add helper functions for multi-turn evaluation session processing#18898
AveshCSingh merged 8 commits intomlflow:masterfrom
AveshCSingh:ML-59303_genai-eval-helpers

AveshCSingh commented Nov 18, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

B-Step62 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

B-Step62 Nov 19, 2025

Uh oh!

AveshCSingh Nov 20, 2025

Uh oh!

github-actions bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

smoorjani Nov 19, 2025

Uh oh!

AveshCSingh Nov 20, 2025

Uh oh!

Uh oh!

smoorjani Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AveshCSingh commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

What changes are proposed in this pull request?

Functions Added:

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

Components

How should the PR be classified in the release notes?

Should this PR be included in the next patch release?

Additional Context

Uh oh!

github-actions bot commented Nov 18, 2025

⚠ DCO check

Uh oh!

B-Step62 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

B-Step62 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

AveshCSingh Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smoorjani Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

AveshCSingh Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smoorjani Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AveshCSingh commented Nov 18, 2025 •

edited

Loading

github-actions bot commented Nov 19, 2025 •

edited

Loading