Add TruLens third-party scorer integration by debu-sinha · Pull Request #19492 · mlflow/mlflow

debu-sinha · 2025-12-18T16:43:07Z

Related Issues/PRs

Partial fix for #19062
Consolidates #19328 (agent trace scorers)

What changes are proposed in this pull request?

Implements TruLens feedback functions as MLflow GenAI scorers for evaluating LLM applications. This enables users to leverage TruLens' evaluation capabilities directly within MLflow's evaluation framework.

RAG Evaluation Scorers:

Scorer	Description
`Groundedness`	Evaluates if outputs are grounded in context
`ContextRelevance`	Evaluates context relevance to query
`AnswerRelevance`	Evaluates answer relevance to query
`Coherence`	Evaluates logical flow of outputs

Agent Trace Evaluation Scorers:

Scorer	Description
`LogicalConsistency`	Evaluates reasoning quality of agent traces
`ExecutionEfficiency`	Evaluates execution efficiency
`PlanAdherence`	Evaluates plan adherence
`PlanQuality`	Evaluates plan quality
`ToolSelection`	Evaluates tool selection quality
`ToolCalling`	Evaluates tool calling quality

Based on TruLens' benchmarked goal-plan-action alignment evaluations achieving 95% error coverage against TRAIL.

Usage Examples

RAG Evaluation

from mlflow.genai.scorers.trulens import Groundedness, ContextRelevance

# Direct scorer call
scorer = Groundedness(model="openai:/gpt-4o-mini")
feedback = scorer(trace=trace)

# With mlflow.genai.evaluate()
results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[
        Groundedness(model="openai:/gpt-4o-mini"),
        ContextRelevance(model="openai:/gpt-4o-mini"),
    ],
)

Agent Trace Evaluation

from mlflow.genai.scorers.trulens import LogicalConsistency, PlanAdherence

# Evaluate agent traces
traces = mlflow.search_traces(experiment_ids=["1"])
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[
        LogicalConsistency(model="openai:/gpt-4o-mini"),
        PlanAdherence(model="openai:/gpt-4o-mini"),
    ],
)

Model Support:

OpenAI: model="openai:/gpt-4o-mini"
Databricks managed judge: model="databricks"
Databricks serving endpoint: model="databricks:/endpoint-name"
Any LiteLLM provider: model="anthropic:/claude-3", model="bedrock:/model-id", etc.

How is this PR tested?

46 unit tests (all passing)
Manual integration tests with real OpenAI API calls

Real API Integration Test Results

RAG Scorers (with OpenAI gpt-4o-mini)

1. Groundedness
   Input: "The Eiffel Tower is 330 meters tall and located in Paris."
   Context: "The Eiffel Tower stands at 330 meters in Paris, France."
   Value: yes
   Score: 1.0
   Threshold: 0.5

2. ContextRelevance
   Query: "What is the height of the Eiffel Tower?"
   Context: "The Eiffel Tower is 330 meters tall. It was built in 1889."
   Value: yes
   Score: 1.0

3. AnswerRelevance
   Query: "What is the capital of France?"
   Response: "Paris is the capital of France, known for the Eiffel Tower."
   Value: yes
   Score: 1.0

4. Coherence
   Text: "MLflow is an open-source platform. It manages the ML lifecycle..."
   Value: yes
   Score: 0.67

Agent Trace Scorers (with OpenAI gpt-4o-mini)

Agent Trace: research_agent with planning → doc_search → summarize spans

5. LogicalConsistency
   Value: 1.0

6. ExecutionEfficiency
   Value: 1.0

7. PlanAdherence
   Value: 1.0

8. PlanQuality
   Value: 0.67

9. ToolSelection
   Value: 1.0

10. ToolCalling
    Value: 1.0

Batch Evaluation with mlflow.genai.evaluate()

# RAG scorers batch evaluation
results = mlflow.genai.evaluate(data=eval_data, scorers=[...])

Metrics:
   AnswerRelevance/mean: 1.00
   Coherence/mean: 1.00
   ContextRelevance/mean: 1.00
   Groundedness/mean: 1.00

# Agent trace scorers batch evaluation
results = mlflow.genai.evaluate(data=[trace], scorers=[...])

Metrics:
   execution_efficiency/mean: 1.00
   logical_consistency/mean: 1.00
   plan_adherence/mean: 1.00

Does this PR require documentation update?

Yes. I've updated:
- API docstrings with examples

Release Notes

Is this a user-facing change?

Yes

Added TruLens third-party scorer integration for MLflow GenAI evaluation:

RAG Scorers: Groundedness, ContextRelevance, AnswerRelevance, Coherence

Agent Trace Scorers: LogicalConsistency, ExecutionEfficiency, PlanAdherence, PlanQuality, ToolSelection, ToolCalling

Install dependencies:

pip install trulens trulens-providers-litellm

What component(s), interfaces, languages, and integrations does this PR affect?

Components

area/evaluation: MLflow model evaluation features

How should the PR be classified in the release notes?

rn/feature - A new user-facing feature worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes
No (this PR will be included in the next minor release)

Implements TruLens feedback functions as MLflow GenAI scorers for evaluating LLM applications within MLflow's evaluation framework. TruLens scorers: - Groundedness: Evaluates if outputs are grounded in context - ContextRelevance: Evaluates context relevance to query - AnswerRelevance: Evaluates answer relevance to query - Coherence: Evaluates logical flow of outputs Features: - Multiple model providers: OpenAI, LiteLLM, Bedrock, Cortex - Databricks managed judge support via call_chat_completions - Databricks serving endpoint support - Trace input extraction for RAG evaluation - Configurable pass/fail threshold - Consistent scorer interface returning Feedback objects Partial fix for mlflow#19062 Signed-off-by: debu-sinha <debusinha2009@gmail.com>

github-actions · 2025-12-18T16:43:23Z

@debu-sinha Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

- Move TruLensScorer and metric classes from trulens.py to __init__.py - Delete trulens.py (implementation now in __init__.py) This matches the module structure used in mlflow/genai/scorers/deepeval/ Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2025-12-18T18:59:10Z

@B-Step62 This TruLens PR is ready for review. It follows the same patterns as the Phoenix PR (#19473) based on your feedback - using call_chat_completions for Databricks managed judge, match/case syntax, version 3.9.0, and the deepeval module structure.

Based on feedback from PRs mlflow#19473 and mlflow#19492, apply consistent patterns: - Update experimental version from 3.8.0 to 3.9.0 - Remove score clamping (pass through third-party scores directly) - Use match/case syntax for provider selection - Return None from _format_rationale when no reasoning available - Simplify module docstring to match deepeval pattern - Remove score clamping test and update empty rationale test Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Adds test coverage for the TruLens scorer integration following the same patterns established in the DeepEval tests: - test_trulens.py: Core scorer functionality tests for all metric types (Groundedness, ContextRelevance, AnswerRelevance, Coherence) - test_models.py: Provider creation tests for Databricks, OpenAI, LiteLLM - test_registry.py: Metric registry lookup tests - test_utils.py: Input mapping and rationale formatting tests Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2025-12-26T06:33:22Z

Added unit tests covering:

Core scorer functionality (Groundedness, ContextRelevance, AnswerRelevance, Coherence)
Provider creation (Databricks managed judge, serving endpoints, OpenAI, LiteLLM)
Metric registry lookup
Input mapping and rationale formatting

All 42 tests passing.

- Remove one-liner docstring from mock_trulens_dependencies fixture to align with Phoenix PR review feedback - Add 'Tru' to typos extend-words for TruLens library name Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2026-01-08T23:24:46Z

Based on Phoenix PR review feedback:

Removed one-liner docstring from mock_trulens_dependencies fixture
Added Tru to typos allowlist for TruLens library name

@smoorjani @B-Step62 - Ready for review. Thanks!

…rers

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

- Remove file docstrings from registry.py and utils.py - Use pytest.importorskip at module level (tests fail if trulens not installed) - Reduce mocking in tests: use real providers, only mock feedback method calls - Add test_trulens_scorer_provider_is_real_instance to verify real providers - Fix lint issues: parameterize dict type, use walrus operator - Flatten nested with blocks in tests Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

github-actions · 2026-01-13T23:39:20Z

Documentation preview for 56f703c is available at:

https://pr-19492--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

- Convert class-based tests to function-based in test_utils.py (MLF0029) - Use @pytest.mark.parametrize for test cases with similar patterns - Use monkeypatch.setenv for OpenAI API key in test_create_trulens_provider_openai Signed-off-by: debu-sinha <debusinha2009@gmail.com>

smoorjani

Thanks for writing this integration! Directionally, looks mostly good to me! Left comments to address, LMK if you have questions.

smoorjani · 2026-01-14T02:33:18Z