[3/4] Add support for multi-turn deepeval scorers by smoorjani · Pull Request #19263 · mlflow/mlflow

smoorjani · 2025-12-08T01:14:27Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19263/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19263/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19263/merge

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Adding support for multi-turn deepeval scorers.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

import mlflow
from mlflow.entities.trace_info import TraceMetadataKey
from mlflow.genai.scorers.deepeval import ConversationCompleteness

mlflow.set_tracking_uri("databricks")
mlflow.set_experiment(experiment_id="3011836326718646")

traces = mlflow.search_traces(return_type="list")
session_id = None
for trace in traces:
    session_id = trace.info.trace_metadata.get(TraceMetadataKey.TRACE_SESSION)
    if session_id:
        break

session = mlflow.search_traces(
    return_type="list",
    filter_string=f"metadata.`{TraceMetadataKey.TRACE_SESSION}` = '{session_id}'",
)

print(f"Testing session {session_id} with {len(session)} traces")

scorer = ConversationCompleteness(model="databricks", threshold=0.5)
feedback = scorer(session=session)

if feedback.error:
    print(f"Error: {feedback.error}")
else:
    print(f"Value: {feedback.value}")
    print(f"Score: {feedback.metadata.get('score')}")
    print(f"Rationale: {feedback.rationale}")

outputs:

Testing session 9a14d892-2315-408c-a30e-81edce461425 with 6 traces
Value: yes
Score: 0.6
Rationale: The score is 0.6 because while the LLM addressed some of the user's requests, such as modifying documentation and removing specific emojis, it failed to simplify explanations as the user intended. Additionally, the LLM did not adequately resolve the user's parsing error, merely stating that it should be fixed without confirming the specific issue or providing steps for verification.

Does this PR require documentation update?

Will add a follow-up PR for this.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

github-actions · 2025-12-08T01:22:25Z

Documentation preview for f6f2472 is available at:

https://pr-19263--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

mlflow/genai/scorers/deepeval/registry.py

AveshCSingh · 2025-12-11T20:16:32Z

tests/genai/scorers/deepeval/test_deepeval_scorer.py

            assert result.error.error_code == "RuntimeError"
            assert result.error.error_message == "Test error"
            assert result.source.source_type == AssessmentSourceType.LLM_JUDGE
+


This is perhaps out of scope for this PR, but are you planning on adding integration tests that install the latest deepeval and confirm that a single-turn and a multi-turn scorer work?

Good question - I think we'll need to do this for all integrations, did you have a specific code pointer/place in mind? I can file a follow-up ticket.

Yes, check out how langchain integration testing works: https://sourcegraph.prod.databricks-corp.com/mlflow/mlflow/-/tree/tests/langchain

From Claude:

Great question! Yes, integration tests that install the actual deepeval package would be valuable. Here's where they should go: Location: tests/genai/scorers/deepeval/ You'd create a new directory structure similar to how other integrations are organized (e.g., tests/langchain/, tests/openai/). Based on MLflow's patterns, I'd recommend: tests/genai/scorers/deepeval/ ├── __init__.py ├── conftest.py # For fixtures and deepeval-specific setup └── test_deepeval_integration.py # Integration tests with real deepeval CI Integration: The integration tests would run as part of the existing genai CI job in .github/workflows/master.yml (around line 380-420). You'd need to: 1. Add deepeval to the pip install line in the genai job: - name: Install dependencies run: | source ./dev/install-common-deps.sh pip install openai dspy deepeval # Add deepeval here 2. The tests would automatically run with pytest tests/genai

Claude also commends you on your decision to include this in a follow-up PR :D

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

AveshCSingh

Let one small comment, otw LGTM. Please track the follow-ups

mlflow/genai/scorers/deepeval/__init__.py

AveshCSingh · 2025-12-16T02:22:34Z

tests/genai/scorers/deepeval/test_deepeval_scorer.py

            assert result.error.error_code == "RuntimeError"
            assert result.error.error_message == "Test error"
            assert result.source.source_type == AssessmentSourceType.LLM_JUDGE
+


Yes, check out how langchain integration testing works: https://sourcegraph.prod.databricks-corp.com/mlflow/mlflow/-/tree/tests/langchain

From Claude:

Great question! Yes, integration tests that install the actual deepeval package would be valuable. Here's where they should go: Location: tests/genai/scorers/deepeval/ You'd create a new directory structure similar to how other integrations are organized (e.g., tests/langchain/, tests/openai/). Based on MLflow's patterns, I'd recommend: tests/genai/scorers/deepeval/ ├── __init__.py ├── conftest.py # For fixtures and deepeval-specific setup └── test_deepeval_integration.py # Integration tests with real deepeval CI Integration: The integration tests would run as part of the existing genai CI job in .github/workflows/master.yml (around line 380-420). You'd need to: 1. Add deepeval to the pip install line in the genai job: - name: Install dependencies run: | source ./dev/install-common-deps.sh pip install openai dspy deepeval # Add deepeval here 2. The tests would automatically run with pytest tests/genai

Claude also commends you on your decision to include this in a follow-up PR :D

mlflow/genai/scorers/deepeval/__init__.py

mlflow/genai/scorers/deepeval/utils.py

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

smoorjani added 19 commits November 24, 2025 12:17

Add basic deepeval judge wrapping

6b02eed

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

minor fixes

9e558b3

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

move from adapter to scorer

a4e08bd

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

minor fixes

44da108

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

add unit tests

f71c5cb

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

.

2a05028

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

minor fixes

906ffe5

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

add support for dbx model serving

00baea8

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

comment out conversational metrics for now

046e117

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

fix tests and lint

a63baac

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

address pr comments

213a5c2

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

address pr feedback

523761d

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

fix ci

5c5d366

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

.

20ea827

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

,

4d2cf5f

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

remove indicators

ef77749

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

Namespace deepeval scorers in mlflow.genai.scorers.deepeval

918309a

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

fix docs example

2dcfbd5

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

.cleanup

d302edc

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 8, 2025

smoorjani added 2 commits December 9, 2025 12:49

merge with master

a9768b7

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

[3/4] Add support for multi-turn deepeval scorers

5a548bf

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

smoorjani force-pushed the stack/deepeval-multiturn branch from c3033da to 5a548bf Compare December 10, 2025 13:48

smoorjani requested review from AveshCSingh and B-Step62 December 10, 2025 14:49

smoorjani added 2 commits December 10, 2025 06:55

fix bug

c46cbd3

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

fix bug

96b4b11

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

AveshCSingh reviewed Dec 11, 2025

View reviewed changes

merge with master

e0697b9

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

smoorjani added the v3.8.0 label Dec 12, 2025

address comments

016a227

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

smoorjani requested a review from AveshCSingh December 12, 2025 18:16

AveshCSingh approved these changes Dec 16, 2025

View reviewed changes

B-Step62 reviewed Dec 16, 2025

View reviewed changes

mlflow/genai/scorers/deepeval/__init__.py Outdated Show resolved Hide resolved

mlflow/genai/scorers/deepeval/utils.py Outdated Show resolved Hide resolved

github-actions bot assigned B-Step62 Dec 16, 2025

smoorjani added 2 commits December 16, 2025 12:06

address comments

65e986a

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

merge with master

f6f2472

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

smoorjani added this pull request to the merge queue Dec 16, 2025

Merged via the queue into mlflow:master with commit b5ee71a Dec 16, 2025
47 checks passed

smoorjani deleted the stack/deepeval-multiturn branch December 16, 2025 22:19

WeichenXu123 pushed a commit to WeichenXu123/mlflow that referenced this pull request Dec 19, 2025

[3/4] Add support for multi-turn deepeval scorers (mlflow#19263)

52834c9

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

WeichenXu123 pushed a commit that referenced this pull request Dec 19, 2025

[3/4] Add support for multi-turn deepeval scorers (#19263)

348e19e

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3/4] Add support for multi-turn deepeval scorers#19263

[3/4] Add support for multi-turn deepeval scorers#19263
smoorjani merged 27 commits intomlflow:masterfrom
smoorjani:stack/deepeval-multiturn

smoorjani commented Dec 8, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

AveshCSingh Dec 11, 2025

Uh oh!

smoorjani Dec 12, 2025

Uh oh!

AveshCSingh Dec 16, 2025

Uh oh!

AveshCSingh left a comment

Uh oh!

Uh oh!

AveshCSingh Dec 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

smoorjani commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

AveshCSingh Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

smoorjani Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

AveshCSingh Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

AveshCSingh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AveshCSingh Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smoorjani commented Dec 8, 2025 •

edited

Loading

github-actions bot commented Dec 8, 2025 •

edited

Loading