Skip to content

Add TruLens third-party scorer integration#19492

Merged
smoorjani merged 41 commits intomlflow:masterfrom
debu-sinha:feature/trulens-scorers
Feb 1, 2026
Merged

Add TruLens third-party scorer integration#19492
smoorjani merged 41 commits intomlflow:masterfrom
debu-sinha:feature/trulens-scorers

Conversation

@debu-sinha
Copy link
Contributor

@debu-sinha debu-sinha commented Dec 18, 2025

Related Issues/PRs

Partial fix for #19062
Consolidates #19328 (agent trace scorers)

What changes are proposed in this pull request?

Implements TruLens feedback functions as MLflow GenAI scorers for evaluating LLM applications. This enables users to leverage TruLens' evaluation capabilities directly within MLflow's evaluation framework.

RAG Evaluation Scorers:

Scorer Description
Groundedness Evaluates if outputs are grounded in context
ContextRelevance Evaluates context relevance to query
AnswerRelevance Evaluates answer relevance to query
Coherence Evaluates logical flow of outputs

Agent Trace Evaluation Scorers:

Scorer Description
LogicalConsistency Evaluates reasoning quality of agent traces
ExecutionEfficiency Evaluates execution efficiency
PlanAdherence Evaluates plan adherence
PlanQuality Evaluates plan quality
ToolSelection Evaluates tool selection quality
ToolCalling Evaluates tool calling quality

Based on TruLens' benchmarked goal-plan-action alignment evaluations achieving 95% error coverage against TRAIL.


Usage Examples

RAG Evaluation

from mlflow.genai.scorers.trulens import Groundedness, ContextRelevance

# Direct scorer call
scorer = Groundedness(model="openai:/gpt-4o-mini")
feedback = scorer(trace=trace)

# With mlflow.genai.evaluate()
results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[
        Groundedness(model="openai:/gpt-4o-mini"),
        ContextRelevance(model="openai:/gpt-4o-mini"),
    ],
)

Agent Trace Evaluation

from mlflow.genai.scorers.trulens import LogicalConsistency, PlanAdherence

# Evaluate agent traces
traces = mlflow.search_traces(experiment_ids=["1"])
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[
        LogicalConsistency(model="openai:/gpt-4o-mini"),
        PlanAdherence(model="openai:/gpt-4o-mini"),
    ],
)

Model Support:

  • OpenAI: model="openai:/gpt-4o-mini"
  • Databricks managed judge: model="databricks"
  • Databricks serving endpoint: model="databricks:/endpoint-name"
  • Any LiteLLM provider: model="anthropic:/claude-3", model="bedrock:/model-id", etc.

How is this PR tested?

  • 46 unit tests (all passing)
  • Manual integration tests with real OpenAI API calls

Real API Integration Test Results

RAG Scorers (with OpenAI gpt-4o-mini)

1. Groundedness
   Input: "The Eiffel Tower is 330 meters tall and located in Paris."
   Context: "The Eiffel Tower stands at 330 meters in Paris, France."
   Value: yes
   Score: 1.0
   Threshold: 0.5

2. ContextRelevance
   Query: "What is the height of the Eiffel Tower?"
   Context: "The Eiffel Tower is 330 meters tall. It was built in 1889."
   Value: yes
   Score: 1.0

3. AnswerRelevance
   Query: "What is the capital of France?"
   Response: "Paris is the capital of France, known for the Eiffel Tower."
   Value: yes
   Score: 1.0

4. Coherence
   Text: "MLflow is an open-source platform. It manages the ML lifecycle..."
   Value: yes
   Score: 0.67

Agent Trace Scorers (with OpenAI gpt-4o-mini)

Agent Trace: research_agent with planning → doc_search → summarize spans

5. LogicalConsistency
   Value: 1.0

6. ExecutionEfficiency
   Value: 1.0

7. PlanAdherence
   Value: 1.0

8. PlanQuality
   Value: 0.67

9. ToolSelection
   Value: 1.0

10. ToolCalling
    Value: 1.0

Batch Evaluation with mlflow.genai.evaluate()

# RAG scorers batch evaluation
results = mlflow.genai.evaluate(data=eval_data, scorers=[...])

Metrics:
   AnswerRelevance/mean: 1.00
   Coherence/mean: 1.00
   ContextRelevance/mean: 1.00
   Groundedness/mean: 1.00

# Agent trace scorers batch evaluation
results = mlflow.genai.evaluate(data=[trace], scorers=[...])

Metrics:
   execution_efficiency/mean: 1.00
   logical_consistency/mean: 1.00
   plan_adherence/mean: 1.00

Does this PR require documentation update?

  • Yes. I've updated:
    • API docstrings with examples

Release Notes

Is this a user-facing change?

  • Yes

Added TruLens third-party scorer integration for MLflow GenAI evaluation:

RAG Scorers: Groundedness, ContextRelevance, AnswerRelevance, Coherence

Agent Trace Scorers: LogicalConsistency, ExecutionEfficiency, PlanAdherence, PlanQuality, ToolSelection, ToolCalling

Install dependencies:

pip install trulens trulens-providers-litellm

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/evaluation: MLflow model evaluation features

How should the PR be classified in the release notes?

  • rn/feature - A new user-facing feature worth mentioning in the release notes

Should this PR be included in the next patch release?

  • Yes
  • No (this PR will be included in the next minor release)

Implements TruLens feedback functions as MLflow GenAI scorers for
evaluating LLM applications within MLflow's evaluation framework.

TruLens scorers:
- Groundedness: Evaluates if outputs are grounded in context
- ContextRelevance: Evaluates context relevance to query
- AnswerRelevance: Evaluates answer relevance to query
- Coherence: Evaluates logical flow of outputs

Features:
- Multiple model providers: OpenAI, LiteLLM, Bedrock, Cortex
- Databricks managed judge support via call_chat_completions
- Databricks serving endpoint support
- Trace input extraction for RAG evaluation
- Configurable pass/fail threshold
- Consistent scorer interface returning Feedback objects

Partial fix for mlflow#19062

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@github-actions
Copy link
Contributor

@debu-sinha Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. labels Dec 18, 2025
- Move TruLensScorer and metric classes from trulens.py to __init__.py
- Delete trulens.py (implementation now in __init__.py)

This matches the module structure used in mlflow/genai/scorers/deepeval/

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

@B-Step62 This TruLens PR is ready for review. It follows the same patterns as the Phoenix PR (#19473) based on your feedback - using call_chat_completions for Databricks managed judge, match/case syntax, version 3.9.0, and the deepeval module structure.

debu-sinha added a commit to debu-sinha/mlflow that referenced this pull request Dec 18, 2025
Based on feedback from PRs mlflow#19473 and mlflow#19492, apply consistent patterns:
- Update experimental version from 3.8.0 to 3.9.0
- Remove score clamping (pass through third-party scores directly)
- Use match/case syntax for provider selection
- Return None from _format_rationale when no reasoning available
- Simplify module docstring to match deepeval pattern
- Remove score clamping test and update empty rationale test

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Adds test coverage for the TruLens scorer integration following
the same patterns established in the DeepEval tests:

- test_trulens.py: Core scorer functionality tests for all metric types
  (Groundedness, ContextRelevance, AnswerRelevance, Coherence)
- test_models.py: Provider creation tests for Databricks, OpenAI, LiteLLM
- test_registry.py: Metric registry lookup tests
- test_utils.py: Input mapping and rationale formatting tests

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

debu-sinha commented Dec 26, 2025

Added unit tests covering:

  • Core scorer functionality (Groundedness, ContextRelevance, AnswerRelevance, Coherence)
  • Provider creation (Databricks managed judge, serving endpoints, OpenAI, LiteLLM)
  • Metric registry lookup
  • Input mapping and rationale formatting

All 42 tests passing.

- Remove one-liner docstring from mock_trulens_dependencies fixture
  to align with Phoenix PR review feedback
- Add 'Tru' to typos extend-words for TruLens library name

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

debu-sinha commented Jan 8, 2026

Based on Phoenix PR review feedback:

  • Removed one-liner docstring from mock_trulens_dependencies fixture
  • Added Tru to typos allowlist for TruLens library name

@smoorjani @B-Step62 - Ready for review. Thanks!

- Remove file docstrings from registry.py and utils.py
- Use pytest.importorskip at module level (tests fail if trulens not installed)
- Reduce mocking in tests: use real providers, only mock feedback method calls
- Add test_trulens_scorer_provider_is_real_instance to verify real providers
- Fix lint issues: parameterize dict type, use walrus operator
- Flatten nested with blocks in tests

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 13, 2026

Documentation preview for 56f703c is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

- Convert class-based tests to function-based in test_utils.py (MLF0029)
- Use @pytest.mark.parametrize for test cases with similar patterns
- Use monkeypatch.setenv for OpenAI API key in test_create_trulens_provider_openai

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this integration! Directionally, looks mostly good to me! Left comments to address, LMK if you have questions.

_logger = logging.getLogger(__name__)

# Threshold for determining pass/fail
DEFAULT_THRESHOLD = 0.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DEFAULT_THRESHOLD = 0.5
_DEFAULT_THRESHOLD = 0.5

nit: let's make this internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - renamed to _DEFAULT_THRESHOLD with underscore prefix (line 43).

TruLensScorer instance that can be called with MLflow's scorer interface

Examples:
>>> scorer = get_scorer("Groundedness", model="openai:/gpt-4")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the .. code-block:: python format for documenting these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - all examples now use .. code-block:: python format.

self,
metric_name: str | None = None,
model: str | None = None,
threshold: float = DEFAULT_THRESHOLD,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a kwargs here to pass into the trulens class? same for the get_scorer method below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added **kwargs: Any to both TruLensScorer.__init__ and get_scorer, passed through to create_trulens_provider.

scorer = Groundedness(model="openai:/gpt-4")
feedback = scorer(
outputs="The Eiffel Tower is 330 meters tall.",
expectations={"context": "The Eiffel Tower stands at 330 meters."},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit odd to pass context via expectations since context is not a ground-truth - can we have this example use trace instead? same above on L162

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - all examples now use trace=trace instead of expectations.

supported by the source material.

Args:
model: Model URI (e.g., "openai:/gpt-4", "databricks", "databricks:/endpoint")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use @format_docstring(_MODEL_API_DOC) for this as well? same for all scorers below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added @format_docstring(_MODEL_API_DOC) to all scorers.

get_feedback_method_name("InvalidMetric")


def test_get_metric_config_groundedness():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update these tests after updating registry.py above, but generally aim to parameterize repetitive test patterns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Simplified tests to match the new minimal registry. Tests are now parameterized to verify each metric maps to its correct feedback method name.

)

assert result.value == CategoricalRating.NO
assert result.metadata["score"] == 0.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's assert over the entire result object so we know the exact output. same for tests below/above (e.g., include threshold)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Integration tests use real scorer instances with only the provider's _create_chat_completion method mocked. Tests verify Feedback objects are created correctly and scores are returned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert something like:

assert result.metadata == {
  "score": 0.3,
  "threshold": ...,
  ...
}

just so we know everything inside the metadata? same for all the tests below.

even stronger (and most preferable) is to do:

assert result == Feedback(
  value=...,
  rationale=...,
  metadata=...,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - all tests now assert exact metadata values using:

assert result.metadata == {
    'mlflow.scorer.framework': 'trulens',
    'score': score,
    'threshold': 0.5,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the full assert result == Feedback(...) pattern: The Feedback class has auto-generated timestamps (create_time_ms, last_update_time_ms) set in __post_init__, making exact object comparison non-deterministic. We assert all meaningful fields instead:

  • result.name
  • result.value
  • result.rationale
  • result.source.source_type
  • result.source.source_id
  • result.metadata (full dict)

This provides equivalent coverage while being deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated tests to assert all meaningful fields in each test:

assert isinstance(result, Feedback)
assert result.name == "Groundedness"
assert result.value == CategoricalRating.NO
assert result.rationale == "reason: Low score"
assert result.source.source_type == AssessmentSourceType.LLM_JUDGE
assert result.source.source_id == "openai:/gpt-4"
assert result.metadata == {
    "mlflow.scorer.framework": "trulens",
    "score": 0.3,
    "threshold": 0.5,
}

All 10 tests now follow this pattern, providing equivalent coverage to full object comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - updated all tests to assert all meaningful fields:

assert isinstance(result, Feedback)
assert result.name == "Groundedness"
assert result.value == CategoricalRating.NO
assert result.rationale == "reason: Low score"
assert result.source.source_type == AssessmentSourceType.LLM_JUDGE
assert result.source.source_id == "openai:/gpt-4"
assert result.metadata == {
    "mlflow.scorer.framework": "trulens",
    "score": 0.3,
    "threshold": 0.5,
}

The full assert result == Feedback(...) pattern isn't practical because Feedback has auto-generated timestamps (create_time_ms, last_update_time_ms) set in __post_init__. The above approach provides equivalent deterministic coverage.



def test_map_scorer_inputs_with_trace():
mock_trace = Mock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use a real trace similar to the Arize tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Tests use real Trace objects created with create_test_trace() helper to verify trace-based context extraction works correctly with actual MLflow trace structures.

assert format_trulens_rationale(reasons) == expected


def test_format_trulens_rationale_multiple_reasons():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include these in the parameterized test above (test_format_trulens_rationale) - same for test_format_trulens_rationale_dict_reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Tests are parameterized using @pytest.mark.parametrize to test all metrics with their expected TruLens argument names, avoiding repetitive test code.

Resolve conflict in CI workflow by combining Phoenix and TruLens deps.
Switch to trulens-providers-litellm per reviewer feedback.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Make DEFAULT_THRESHOLD internal with _ prefix
- Add **kwargs support to pass to TruLens providers
- Simplify to use LiteLLM for all non-Databricks providers
- Update docstring examples to use trace instead of expectations
- Add @format_docstring decorator to all scorer classes
- Simplify registry by removing unused get_metric_config
- Simplify rationale formatting in utils
- Update tests to use LiteLLM and assert sub-method calls
- Fix pip install message to reference trulens-providers-litellm

Signed-off-by: Debu Sinha <debu.sinha@example.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

debu-sinha commented Jan 14, 2026

@smoorjani Thanks for the review. Addressed everything:

  • Made _DEFAULT_THRESHOLD internal with underscore prefix
  • Added **kwargs support to pass args through to TruLens classes
  • Switched to LiteLLM for all non-Databricks providers (cleaner than separate OpenAI/Bedrock/Cortex cases)
  • Updated examples to use trace instead of expectations
  • Added @format_docstring(_MODEL_API_DOC) to all scorers
  • Simplified registry by removing unused get_metric_config
  • Fixed pip install message to reference trulens-providers-litellm

Tests: 33 passed, 2 skipped. The skips are due to a TruLens LiteLLM instrumentation bug - filed issue #2327 and fix PR truera/trulens#2328 (approved, pending merge).

Kept the metric-specific argument mapping in utils.py since each TruLens feedback function expects different arg names. Happy to refactor if you prefer a different pattern.

- Update model examples from openai:/gpt-4 to openai:/gpt-5
- Remove "databricks" from OSS docstring examples
- Move serialize_chat_messages_to_prompts from scorer_utils.py to message_utils.py
- Update test_utils.py to use mlflow.start_span instead of constructing traces directly

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Addressed all remaining feedback:

  1. Model examples updated - Changed all openai:/gpt-4 to openai:/gpt-5 in docstrings

  2. Removed "databricks" from OSS docs - All examples now use openai:/gpt-5 instead. Databricks-specific docs can be added separately.

  3. Moved serialize_chat_messages_to_prompts to message_utils.py - The function that converts dict messages to Databricks prompts is now in mlflow/genai/utils/message_utils.py alongside the litellm Message version.

  4. Updated test_utils.py - Now uses mlflow.start_span() instead of manually constructing trace objects.

All 46 tests pass. Ready for re-review!

- Merge test_trulens_scorer_fail into parameterized test
- Remove redundant metric_name parameter (same as scorer_name)

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Also addressed the test feedback:

  • Merged test_trulens_scorer_fail into the parameterized test_trulens_scorer test with a new case for score=0.3 returning CategoricalRating.NO

  • Simplified parameterization - Removed redundant metric_name parameter since it's always the same as scorer_name

All 46 tests pass.

…tils

- Use _parse_model_uri from mlflow/metrics/genai/model_utils.py in TruLens, DeepEval, Phoenix, and Ragas models
- Move serialize_messages_to_databricks_prompts to mlflow/genai/utils/message_utils.py
- Update imports in databricks_managed_judge_adapter.py and simulator.py
- Update tests to match new error message format from _parse_model_uri

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Pushed additional fixes:

  1. Using _parse_model_uri across all 4 integrations - TruLens, DeepEval, Phoenix, and Ragas now all use the shared helper from mlflow/metrics/genai/model_utils.py instead of inline parsing.

  2. Consolidated serialize_messages_to_databricks_prompts - Removed duplicate from databricks_managed_judge_adapter.py, now imported from mlflow/genai/utils/message_utils.py by both judges and simulators modules.

  3. Updated test error messages - Tests now match the "Malformed model uri" error format from _parse_model_uri.

All tests pass.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Added the missing tests/genai/utils/test_message_utils.py with 8 tests covering:

  • Basic, system, assistant, and multi-user messages
  • Tool calls and tool response messages
  • Full conversation serialization
  • Edge cases (empty messages)

All tests pass.

… provider test

Test now validates that the Databricks managed judge provider uses call_chat_completions
as expected when _create_chat_completion is invoked.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

All review comments have been addressed:

  1. Model examples updated - Changed from openai:/gpt-4 to openai:/gpt-5 across all docstrings
  2. Removed "databricks" from OSS docs - Examples now use model="openai:/gpt-5" consistently
  3. Using _parse_model_uri helper - Replaced inline parsing with shared helper from mlflow/metrics/genai/model_utils.py across TruLens, DeepEval, Phoenix, and Ragas integrations
  4. Consolidated message utilities - Moved serialize_messages_to_databricks_prompts and serialize_chat_messages_to_prompts to mlflow/genai/utils/message_utils.py and updated all imports
  5. Added tests for message utils - Created tests/genai/utils/test_message_utils.py with 8 unit tests
  6. Test improvements:
    • Merged pass/fail test cases into parameterized test
    • Simplified trace creation using mlflow.start_span()
    • Added assertion to verify call_chat_completions is called in Databricks provider test

All TruLens tests pass (46 tests). Ready for re-review.

Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few nits to address before merging, otherwise looks great!

from typing import Any


def serialize_messages_to_databricks_prompts(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge these two functions? they look quite similar

from mlflow.genai.utils.message_utils import serialize_messages_to_databricks_prompts


class TestSerializeMessagesToDatabricksPrompts:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we follow the format for other pytest files? claude does this a lot, but we don't use this pattern of creating a test class.


class TestSerializeMessagesToDatabricksPrompts:
def test_basic_user_message(self):
msg = Mock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the ChatMessage object directly instead of a mock?

assert user_prompt == "Hello"
assert system_prompt is None

def test_system_message(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just parameterize all these tests into a handful or single test

Copy link
Collaborator

@AveshCSingh AveshCSingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly defer to Samraj's thourough review --the implementation + validations described in the PR looks reasonable. Left one small comment inline.

One thing we should consider is whether to hook up 3p scorers with the MLflow AI Gateway. This does not block the PR merge though, and is a potential future improvement. cc @BenWilson2 @B-Step62



def test_serialize_chat_messages_to_prompts_basic():
from mlflow.genai.scorers.scorer_utils import serialize_chat_messages_to_prompts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't serialize_chat_messages_to_prompts be imported from message_utils, here and below?

)

# Parse provider:/model format using shared helper
provider, model_name = _parse_model_uri(model_uri)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

- Merge serialize_messages_to_databricks_prompts and serialize_chat_messages_to_prompts
  into a unified serialize_messages_to_prompts function that handles both Message
  objects and dicts
- Add backwards compatibility aliases for existing imports
- Refactor test_message_utils.py: remove test class, use real ChatMessage objects
  instead of mocks, parameterize tests
- Remove misplaced serialization tests from test_scorer_utils.py (now covered in
  test_message_utils.py)

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Update tests to expect MlflowException when model URI lacks required slash.
The _parse_model_uri function requires format provider:/model, not provider:model.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha debu-sinha requested a review from smoorjani January 31, 2026 01:20
Resolve CI workflow conflict by including all packages:
- deepeval, ragas, arize-phoenix-evals (existing)
- trulens, trulens-providers-litellm (TruLens PR)
- guardrails-ai (from master)

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants