[ML-59303] Add helper functions for multi-turn evaluation session processing#18898
Conversation
|
@AveshCSingh Thank you for the contribution! Could you fix the following issue(s)? ⚠ DCO checkThe DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details. |
…irect judge invocation Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
This commit adds three internal helper functions to support multi-turn evaluation in mlflow.genai.evaluate: 1. _classify_scorers(): Separates scorers into single-turn and multi-turn categories based on the is_multi_turn property (added in PR mlflow#18897) 2. _group_traces_by_session(): Groups evaluation items by session_id, extracting from trace metadata using TraceMetadataKey.TRACE_SESSION 3. _get_first_trace_in_session(): Identifies the chronologically first trace in a session using trace.info.request_time Comprehensive unit tests added covering: - Scorer classification (4 tests) - Session grouping (5 tests) - First trace identification (3 tests) All functions are pure (no side effects) and handle edge cases gracefully including None traces, missing session_ids, and empty lists. This is part 2 of the multi-turn evaluation implementation plan. Related to PR mlflow#18897 which added the is_multi_turn property to base Scorer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
…irect judge invocation Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
bc644b7 to
ddd2cba
Compare
mlflow/genai/evaluation/utils.py
Outdated
| return dict(session_groups) | ||
|
|
||
|
|
||
| def _get_first_trace_in_session( |
There was a problem hiding this comment.
nit: Q: Any reason we don't pass list[EvalItem] here? Trace is a part of it.
There was a problem hiding this comment.
That's a better idea. Updated.
|
Documentation preview for b7cc388 is available at: More info
|
mlflow/genai/evaluation/utils.py
Outdated
| if not hasattr(item, "trace") or item.trace is None: | ||
| continue | ||
|
|
||
| session_id = item.trace.info.trace_metadata.get(TraceMetadataKey.TRACE_SESSION) |
There was a problem hiding this comment.
should we also sort the traces by timestamp? or is this done in the forward pass of the scorer?
There was a problem hiding this comment.
It's done in Scorer.call, so I don't think we need to sort here
| return self._instructions_prompt.variables | ||
|
|
||
| @property | ||
| def is_multi_turn(self) -> bool: |
There was a problem hiding this comment.
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
…ality - Move imports to top level in _group_traces_by_session - Fix type hint: change EvalResult to EvalItem - Use walrus operator for cleaner session_id assignment - Simplify return types: use EvalItem instead of (item, trace) tuples - Update all tests to match new API Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
- Move TraceInfo, TraceLocation, TraceState, TraceData imports to top - Move TraceMetadataKey import to top - Move EvalItem import to top - Remove local imports from _create_mock_trace and _create_mock_eval_item helpers Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
- Break _MultiTurnTestScorer docstring into multiple lines - Break test_classify_scorers_all_multi_turn docstring into multiple lines Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
…cessing (mlflow#18898) Signed-off-by: Avesh Singh <aveshcsingh@gmail.com> Co-authored-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Kevin Wang <kevinwang2040@gmail.com>
…cessing (mlflow#18898) Signed-off-by: Avesh Singh <aveshcsingh@gmail.com> Co-authored-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tian Lan <sky.blue266000@gmail.com>
🛠 DevTools 🛠
Install mlflow from this PR
For Databricks, use the following command:
This PR is stacked on top of #18897. Click here to see a clean diff.
What changes are proposed in this pull request?
This PR adds three internal helper functions to support multi-turn evaluation in
mlflow.genai.evaluate. These are pure, side-effect-free utility functions that will be used in subsequent PRs to implement the full multi-turn evaluation feature.Functions Added:
_classify_scorers()- Separates scorers into single-turn and multi-turn categories based on theis_multi_turnproperty (introduced in PR [ML-59303] Support multiturn judge creation with make_judge api and direct judge invocation #18897)_group_traces_by_session()- Groups evaluation items by session_id, extracting from trace metadata usingTraceMetadataKey.TRACE_SESSION_get_first_trace_in_session()- Identifies the chronologically first trace in a session usingtrace.info.request_timeHow is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
Components
area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsHow should the PR be classified in the release notes?
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionShould this PR be included in the next patch release?
Additional Context
This is Part 2 of the multi-turn evaluation implementation plan:
🤖 Generated with Claude Code