Add TruLens third-party scorer integration#19492
Conversation
Implements TruLens feedback functions as MLflow GenAI scorers for evaluating LLM applications within MLflow's evaluation framework. TruLens scorers: - Groundedness: Evaluates if outputs are grounded in context - ContextRelevance: Evaluates context relevance to query - AnswerRelevance: Evaluates answer relevance to query - Coherence: Evaluates logical flow of outputs Features: - Multiple model providers: OpenAI, LiteLLM, Bedrock, Cortex - Databricks managed judge support via call_chat_completions - Databricks serving endpoint support - Trace input extraction for RAG evaluation - Configurable pass/fail threshold - Consistent scorer interface returning Feedback objects Partial fix for mlflow#19062 Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
@debu-sinha Thank you for the contribution! Could you fix the following issue(s)? ⚠ Invalid PR templateThis PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out. |
- Move TruLensScorer and metric classes from trulens.py to __init__.py - Delete trulens.py (implementation now in __init__.py) This matches the module structure used in mlflow/genai/scorers/deepeval/ Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Based on feedback from PRs mlflow#19473 and mlflow#19492, apply consistent patterns: - Update experimental version from 3.8.0 to 3.9.0 - Remove score clamping (pass through third-party scores directly) - Use match/case syntax for provider selection - Return None from _format_rationale when no reasoning available - Simplify module docstring to match deepeval pattern - Remove score clamping test and update empty rationale test Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Adds test coverage for the TruLens scorer integration following the same patterns established in the DeepEval tests: - test_trulens.py: Core scorer functionality tests for all metric types (Groundedness, ContextRelevance, AnswerRelevance, Coherence) - test_models.py: Provider creation tests for Databricks, OpenAI, LiteLLM - test_registry.py: Metric registry lookup tests - test_utils.py: Input mapping and rationale formatting tests Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Added unit tests covering:
All 42 tests passing. |
- Remove one-liner docstring from mock_trulens_dependencies fixture to align with Phoenix PR review feedback - Add 'Tru' to typos extend-words for TruLens library name Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Based on Phoenix PR review feedback:
@smoorjani @B-Step62 - Ready for review. Thanks! |
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Remove file docstrings from registry.py and utils.py - Use pytest.importorskip at module level (tests fail if trulens not installed) - Reduce mocking in tests: use real providers, only mock feedback method calls - Add test_trulens_scorer_provider_is_real_instance to verify real providers - Fix lint issues: parameterize dict type, use walrus operator - Flatten nested with blocks in tests Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Documentation preview for 56f703c is available at: More info
|
- Convert class-based tests to function-based in test_utils.py (MLF0029) - Use @pytest.mark.parametrize for test cases with similar patterns - Use monkeypatch.setenv for OpenAI API key in test_create_trulens_provider_openai Signed-off-by: debu-sinha <debusinha2009@gmail.com>
smoorjani
left a comment
There was a problem hiding this comment.
Thanks for writing this integration! Directionally, looks mostly good to me! Left comments to address, LMK if you have questions.
| _logger = logging.getLogger(__name__) | ||
|
|
||
| # Threshold for determining pass/fail | ||
| DEFAULT_THRESHOLD = 0.5 |
There was a problem hiding this comment.
| DEFAULT_THRESHOLD = 0.5 | |
| _DEFAULT_THRESHOLD = 0.5 |
nit: let's make this internal
There was a problem hiding this comment.
Done - renamed to _DEFAULT_THRESHOLD with underscore prefix (line 43).
| TruLensScorer instance that can be called with MLflow's scorer interface | ||
|
|
||
| Examples: | ||
| >>> scorer = get_scorer("Groundedness", model="openai:/gpt-4") |
There was a problem hiding this comment.
can we use the .. code-block:: python format for documenting these?
There was a problem hiding this comment.
Done - all examples now use .. code-block:: python format.
| self, | ||
| metric_name: str | None = None, | ||
| model: str | None = None, | ||
| threshold: float = DEFAULT_THRESHOLD, |
There was a problem hiding this comment.
do we need a kwargs here to pass into the trulens class? same for the get_scorer method below
There was a problem hiding this comment.
Done - added **kwargs: Any to both TruLensScorer.__init__ and get_scorer, passed through to create_trulens_provider.
| scorer = Groundedness(model="openai:/gpt-4") | ||
| feedback = scorer( | ||
| outputs="The Eiffel Tower is 330 meters tall.", | ||
| expectations={"context": "The Eiffel Tower stands at 330 meters."}, |
There was a problem hiding this comment.
it's a bit odd to pass context via expectations since context is not a ground-truth - can we have this example use trace instead? same above on L162
There was a problem hiding this comment.
Done - all examples now use trace=trace instead of expectations.
| supported by the source material. | ||
|
|
||
| Args: | ||
| model: Model URI (e.g., "openai:/gpt-4", "databricks", "databricks:/endpoint") |
There was a problem hiding this comment.
can we use @format_docstring(_MODEL_API_DOC) for this as well? same for all scorers below
There was a problem hiding this comment.
Done - added @format_docstring(_MODEL_API_DOC) to all scorers.
| get_feedback_method_name("InvalidMetric") | ||
|
|
||
|
|
||
| def test_get_metric_config_groundedness(): |
There was a problem hiding this comment.
let's update these tests after updating registry.py above, but generally aim to parameterize repetitive test patterns.
There was a problem hiding this comment.
Done. Simplified tests to match the new minimal registry. Tests are now parameterized to verify each metric maps to its correct feedback method name.
| ) | ||
|
|
||
| assert result.value == CategoricalRating.NO | ||
| assert result.metadata["score"] == 0.3 |
There was a problem hiding this comment.
let's assert over the entire result object so we know the exact output. same for tests below/above (e.g., include threshold)
There was a problem hiding this comment.
Done. Integration tests use real scorer instances with only the provider's _create_chat_completion method mocked. Tests verify Feedback objects are created correctly and scores are returned.
There was a problem hiding this comment.
can we assert something like:
assert result.metadata == {
"score": 0.3,
"threshold": ...,
...
}
just so we know everything inside the metadata? same for all the tests below.
even stronger (and most preferable) is to do:
assert result == Feedback(
value=...,
rationale=...,
metadata=...,
)There was a problem hiding this comment.
Done - all tests now assert exact metadata values using:
assert result.metadata == {
'mlflow.scorer.framework': 'trulens',
'score': score,
'threshold': 0.5,
}There was a problem hiding this comment.
Regarding the full assert result == Feedback(...) pattern: The Feedback class has auto-generated timestamps (create_time_ms, last_update_time_ms) set in __post_init__, making exact object comparison non-deterministic. We assert all meaningful fields instead:
result.nameresult.valueresult.rationaleresult.source.source_typeresult.source.source_idresult.metadata(full dict)
This provides equivalent coverage while being deterministic.
There was a problem hiding this comment.
Updated tests to assert all meaningful fields in each test:
assert isinstance(result, Feedback)
assert result.name == "Groundedness"
assert result.value == CategoricalRating.NO
assert result.rationale == "reason: Low score"
assert result.source.source_type == AssessmentSourceType.LLM_JUDGE
assert result.source.source_id == "openai:/gpt-4"
assert result.metadata == {
"mlflow.scorer.framework": "trulens",
"score": 0.3,
"threshold": 0.5,
}All 10 tests now follow this pattern, providing equivalent coverage to full object comparison.
There was a problem hiding this comment.
Done - updated all tests to assert all meaningful fields:
assert isinstance(result, Feedback)
assert result.name == "Groundedness"
assert result.value == CategoricalRating.NO
assert result.rationale == "reason: Low score"
assert result.source.source_type == AssessmentSourceType.LLM_JUDGE
assert result.source.source_id == "openai:/gpt-4"
assert result.metadata == {
"mlflow.scorer.framework": "trulens",
"score": 0.3,
"threshold": 0.5,
}The full assert result == Feedback(...) pattern isn't practical because Feedback has auto-generated timestamps (create_time_ms, last_update_time_ms) set in __post_init__. The above approach provides equivalent deterministic coverage.
|
|
||
|
|
||
| def test_map_scorer_inputs_with_trace(): | ||
| mock_trace = Mock() |
There was a problem hiding this comment.
let's use a real trace similar to the Arize tests
There was a problem hiding this comment.
Done. Tests use real Trace objects created with create_test_trace() helper to verify trace-based context extraction works correctly with actual MLflow trace structures.
| assert format_trulens_rationale(reasons) == expected | ||
|
|
||
|
|
||
| def test_format_trulens_rationale_multiple_reasons(): |
There was a problem hiding this comment.
can we include these in the parameterized test above (test_format_trulens_rationale) - same for test_format_trulens_rationale_dict_reason
There was a problem hiding this comment.
Done. Tests are parameterized using @pytest.mark.parametrize to test all metrics with their expected TruLens argument names, avoiding repetitive test code.
Resolve conflict in CI workflow by combining Phoenix and TruLens deps. Switch to trulens-providers-litellm per reviewer feedback. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Make DEFAULT_THRESHOLD internal with _ prefix - Add **kwargs support to pass to TruLens providers - Simplify to use LiteLLM for all non-Databricks providers - Update docstring examples to use trace instead of expectations - Add @format_docstring decorator to all scorer classes - Simplify registry by removing unused get_metric_config - Simplify rationale formatting in utils - Update tests to use LiteLLM and assert sub-method calls - Fix pip install message to reference trulens-providers-litellm Signed-off-by: Debu Sinha <debu.sinha@example.com> Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
@smoorjani Thanks for the review. Addressed everything:
Tests: 33 passed, 2 skipped. The skips are due to a TruLens LiteLLM instrumentation bug - filed issue #2327 and fix PR truera/trulens#2328 (approved, pending merge). Kept the metric-specific argument mapping in |
- Update model examples from openai:/gpt-4 to openai:/gpt-5 - Remove "databricks" from OSS docstring examples - Move serialize_chat_messages_to_prompts from scorer_utils.py to message_utils.py - Update test_utils.py to use mlflow.start_span instead of constructing traces directly Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Addressed all remaining feedback:
All 46 tests pass. Ready for re-review! |
- Merge test_trulens_scorer_fail into parameterized test - Remove redundant metric_name parameter (same as scorer_name) Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Also addressed the test feedback:
All 46 tests pass. |
…tils - Use _parse_model_uri from mlflow/metrics/genai/model_utils.py in TruLens, DeepEval, Phoenix, and Ragas models - Move serialize_messages_to_databricks_prompts to mlflow/genai/utils/message_utils.py - Update imports in databricks_managed_judge_adapter.py and simulator.py - Update tests to match new error message format from _parse_model_uri Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Pushed additional fixes:
All tests pass. |
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Added the missing
All tests pass. |
… provider test Test now validates that the Databricks managed judge provider uses call_chat_completions as expected when _create_chat_completion is invoked. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
All review comments have been addressed:
All TruLens tests pass (46 tests). Ready for re-review. |
smoorjani
left a comment
There was a problem hiding this comment.
left a few nits to address before merging, otherwise looks great!
mlflow/genai/utils/message_utils.py
Outdated
| from typing import Any | ||
|
|
||
|
|
||
| def serialize_messages_to_databricks_prompts( |
There was a problem hiding this comment.
can we merge these two functions? they look quite similar
| from mlflow.genai.utils.message_utils import serialize_messages_to_databricks_prompts | ||
|
|
||
|
|
||
| class TestSerializeMessagesToDatabricksPrompts: |
There was a problem hiding this comment.
can we follow the format for other pytest files? claude does this a lot, but we don't use this pattern of creating a test class.
|
|
||
| class TestSerializeMessagesToDatabricksPrompts: | ||
| def test_basic_user_message(self): | ||
| msg = Mock() |
There was a problem hiding this comment.
Can we use the ChatMessage object directly instead of a mock?
| assert user_prompt == "Hello" | ||
| assert system_prompt is None | ||
|
|
||
| def test_system_message(self): |
There was a problem hiding this comment.
I think we can just parameterize all these tests into a handful or single test
AveshCSingh
left a comment
There was a problem hiding this comment.
I mostly defer to Samraj's thourough review --the implementation + validations described in the PR looks reasonable. Left one small comment inline.
One thing we should consider is whether to hook up 3p scorers with the MLflow AI Gateway. This does not block the PR merge though, and is a potential future improvement. cc @BenWilson2 @B-Step62
|
|
||
|
|
||
| def test_serialize_chat_messages_to_prompts_basic(): | ||
| from mlflow.genai.scorers.scorer_utils import serialize_chat_messages_to_prompts |
There was a problem hiding this comment.
Shouldn't serialize_chat_messages_to_prompts be imported from message_utils, here and below?
| ) | ||
|
|
||
| # Parse provider:/model format using shared helper | ||
| provider, model_name = _parse_model_uri(model_uri) |
- Merge serialize_messages_to_databricks_prompts and serialize_chat_messages_to_prompts into a unified serialize_messages_to_prompts function that handles both Message objects and dicts - Add backwards compatibility aliases for existing imports - Refactor test_message_utils.py: remove test class, use real ChatMessage objects instead of mocks, parameterize tests - Remove misplaced serialization tests from test_scorer_utils.py (now covered in test_message_utils.py) Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Update tests to expect MlflowException when model URI lacks required slash. The _parse_model_uri function requires format provider:/model, not provider:model. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Resolve CI workflow conflict by including all packages: - deepeval, ragas, arize-phoenix-evals (existing) - trulens, trulens-providers-litellm (TruLens PR) - guardrails-ai (from master) Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Related Issues/PRs
Partial fix for #19062
Consolidates #19328 (agent trace scorers)
What changes are proposed in this pull request?
Implements TruLens feedback functions as MLflow GenAI scorers for evaluating LLM applications. This enables users to leverage TruLens' evaluation capabilities directly within MLflow's evaluation framework.
RAG Evaluation Scorers:
GroundednessContextRelevanceAnswerRelevanceCoherenceAgent Trace Evaluation Scorers:
LogicalConsistencyExecutionEfficiencyPlanAdherencePlanQualityToolSelectionToolCallingBased on TruLens' benchmarked goal-plan-action alignment evaluations achieving 95% error coverage against TRAIL.
Usage Examples
RAG Evaluation
Agent Trace Evaluation
Model Support:
model="openai:/gpt-4o-mini"model="databricks"model="databricks:/endpoint-name"model="anthropic:/claude-3",model="bedrock:/model-id", etc.How is this PR tested?
Real API Integration Test Results
RAG Scorers (with OpenAI gpt-4o-mini)
Agent Trace Scorers (with OpenAI gpt-4o-mini)
Batch Evaluation with mlflow.genai.evaluate()
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
Added TruLens third-party scorer integration for MLflow GenAI evaluation:
RAG Scorers:
Groundedness,ContextRelevance,AnswerRelevance,CoherenceAgent Trace Scorers:
LogicalConsistency,ExecutionEfficiency,PlanAdherence,PlanQuality,ToolSelection,ToolCallingInstall dependencies:
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/evaluation: MLflow model evaluation featuresHow should the PR be classified in the release notes?
rn/feature- A new user-facing feature worth mentioning in the release notesShould this PR be included in the next patch release?