Skip to content

[ML-59303] Add helper functions for multi-turn evaluation session processing#18898

Merged
AveshCSingh merged 8 commits intomlflow:masterfrom
AveshCSingh:ML-59303_genai-eval-helpers
Nov 20, 2025
Merged

[ML-59303] Add helper functions for multi-turn evaluation session processing#18898
AveshCSingh merged 8 commits intomlflow:masterfrom
AveshCSingh:ML-59303_genai-eval-helpers

Conversation

@AveshCSingh
Copy link
Collaborator

@AveshCSingh AveshCSingh commented Nov 18, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18898/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18898/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18898/merge

This PR is stacked on top of #18897. Click here to see a clean diff.

What changes are proposed in this pull request?

This PR adds three internal helper functions to support multi-turn evaluation in mlflow.genai.evaluate. These are pure, side-effect-free utility functions that will be used in subsequent PRs to implement the full multi-turn evaluation feature.

Functions Added:

  1. _classify_scorers() - Separates scorers into single-turn and multi-turn categories based on the is_multi_turn property (introduced in PR [ML-59303] Support multiturn judge creation with make_judge api and direct judge invocation #18897)

  2. _group_traces_by_session() - Groups evaluation items by session_id, extracting from trace metadata using TraceMetadataKey.TRACE_SESSION

  3. _get_first_trace_in_session() - Identifies the chronologically first trace in a session using trace.info.request_time

How is this PR tested?

  • New unit tests (12 tests added)

Does this PR require documentation update?

  • No. These are internal helper functions not exposed in the public API.

Release Notes

Is this a user-facing change?

  • No. This PR adds internal helper functions for an upcoming feature.

Components

  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows

How should the PR be classified in the release notes?

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section

Should this PR be included in the next patch release?

  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Additional Context

This is Part 2 of the multi-turn evaluation implementation plan:

🤖 Generated with Claude Code

@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Nov 18, 2025
@github-actions
Copy link
Contributor

@AveshCSingh Thank you for the contribution! Could you fix the following issue(s)?

⚠ DCO check

The DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

xsh310 and others added 4 commits November 18, 2025 23:17
…irect judge invocation

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
This commit adds three internal helper functions to support multi-turn
evaluation in mlflow.genai.evaluate:

1. _classify_scorers(): Separates scorers into single-turn and multi-turn
   categories based on the is_multi_turn property (added in PR mlflow#18897)

2. _group_traces_by_session(): Groups evaluation items by session_id,
   extracting from trace metadata using TraceMetadataKey.TRACE_SESSION

3. _get_first_trace_in_session(): Identifies the chronologically first
   trace in a session using trace.info.request_time

Comprehensive unit tests added covering:
- Scorer classification (4 tests)
- Session grouping (5 tests)
- First trace identification (3 tests)

All functions are pure (no side effects) and handle edge cases gracefully
including None traces, missing session_ids, and empty lists.

This is part 2 of the multi-turn evaluation implementation plan.
Related to PR mlflow#18897 which added the is_multi_turn property to base Scorer.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
…irect judge invocation

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
@AveshCSingh AveshCSingh force-pushed the ML-59303_genai-eval-helpers branch from bc644b7 to ddd2cba Compare November 18, 2025 23:18
@B-Step62 B-Step62 self-assigned this Nov 19, 2025
Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

return dict(session_groups)


def _get_first_trace_in_session(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Q: Any reason we don't pass list[EvalItem] here? Trace is a part of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a better idea. Updated.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 19, 2025

Documentation preview for b7cc388 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

if not hasattr(item, "trace") or item.trace is None:
continue

session_id = item.trace.info.trace_metadata.get(TraceMetadataKey.TRACE_SESSION)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also sort the traces by timestamp? or is this done in the forward pass of the scorer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done in Scorer.call, so I don't think we need to sort here

return self._instructions_prompt.variables

@property
def is_multi_turn(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can rebase off of the PR from @xsh310 (#18912). Prefer consistent naming across our codebase so we don't get confused, specifically, let's try only using session internally and then for user-facing interfaces, we can use conversation or, if absolutely necessary, multi-turn

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
…ality

- Move imports to top level in _group_traces_by_session
- Fix type hint: change EvalResult to EvalItem
- Use walrus operator for cleaner session_id assignment
- Simplify return types: use EvalItem instead of (item, trace) tuples
- Update all tests to match new API

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
- Move TraceInfo, TraceLocation, TraceState, TraceData imports to top
- Move TraceMetadataKey import to top
- Move EvalItem import to top
- Remove local imports from _create_mock_trace and _create_mock_eval_item helpers

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
- Break _MultiTurnTestScorer docstring into multiple lines
- Break test_classify_scorers_all_multi_turn docstring into multiple lines

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
@AveshCSingh AveshCSingh added this pull request to the merge queue Nov 20, 2025
Merged via the queue into mlflow:master with commit ce7cd94 Nov 20, 2025
46 checks passed
@AveshCSingh AveshCSingh deleted the ML-59303_genai-eval-helpers branch November 20, 2025 23:39
kevin-wangg pushed a commit to kevin-wangg/mlflow that referenced this pull request Nov 21, 2025
…cessing (mlflow#18898)

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
Co-authored-by: Xiang Shen <xshen.shc@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Kevin Wang <kevinwang2040@gmail.com>
Tian-Sky-Lan pushed a commit to Tian-Sky-Lan/mlflow that referenced this pull request Nov 24, 2025
…cessing (mlflow#18898)

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
Co-authored-by: Xiang Shen <xshen.shc@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tian Lan <sky.blue266000@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants