Add Phoenix and TruLens third-party scorer integrations#19237
Add Phoenix and TruLens third-party scorer integrations#19237debu-sinha wants to merge 10 commits intomlflow:masterfrom
Conversation
|
@debu-sinha Thank you for the contribution! Could you fix the following issue(s)? ⚠ Invalid PR templateThis PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out. |
|
@AveshCSingh Would you be able to take a look at this PR when you have a chance? It adds third-party scorer integrations for Phoenix (Arize) and TruLens evaluation frameworks, enabling seamless use of established LLM evaluation tools within MLflow's GenAI evaluation pipeline. Happy to address any feedback. |
09faf86 to
9a740c6
Compare
mlflow/genai/scorers/__init__.py
Outdated
| } | ||
|
|
||
| # Third-party scorer integrations (Phoenix, TruLens) | ||
| _THIRDPARTY_IMPORTS = { |
There was a problem hiding this comment.
Instead of third party, it would be better to split them into two different folders
One for Phoenix and other for Trulens
That would be better instead of using third party.
There was a problem hiding this comment.
Thanks for the suggestion @joelrobin18! I've restructured the code as you recommended - Phoenix and TruLens now have their own separate folders:
mlflow/genai/scorers/phoenix/mlflow/genai/scorers/trulens/
Each integration has its own __init__.py with proper exports. Updated the lazy loading in the main scorers/__init__.py accordingly.
Let me know if there's anything else you'd like me to adjust!
9a740c6 to
2e0cb24
Compare
|
Hi @joelrobin18, Thanks for the feedback on the folder structure. I've restructured the code as suggested:
Would appreciate your re-review when you get a chance. Also cc @AveshCSingh for visibility. |
This PR now focuses exclusively on TruLens agent trace scorers for goal-plan-action alignment evaluation. Basic TruLens scorers (Groundedness, ContextRelevance, AnswerRelevance, Coherence) are already provided in PR mlflow#19237 (Phoenix/TruLens third-party scorer integrations). Changes: - Remove mlflow/genai/scorers/trulens/basic.py (moved to mlflow#19237) - Update trulens/__init__.py with comprehensive examples for all 6 agent trace scorers - Update scorers/__init__.py to only export agent trace scorers - Update tests to only test agent trace scorers (19 tests remain) Agent trace scorers provided: - TruLensLogicalConsistencyScorer - TruLensExecutionEfficiencyScorer - TruLensPlanAdherenceScorer - TruLensPlanQualityScorer - TruLensToolSelectionScorer - TruLensToolCallingScorer 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: debu-sinha <debusinha2009@gmail.com>
This PR now focuses exclusively on TruLens agent trace scorers for goal-plan-action alignment evaluation. Basic TruLens scorers (Groundedness, ContextRelevance, AnswerRelevance, Coherence) are already provided in PR mlflow#19237 (Phoenix/TruLens third-party scorer integrations). Changes: - Remove mlflow/genai/scorers/trulens/basic.py (moved to mlflow#19237) - Update trulens/__init__.py with comprehensive examples for all 6 agent trace scorers - Update scorers/__init__.py to only export agent trace scorers - Update tests to only test agent trace scorers (19 tests remain) Agent trace scorers provided: - TruLensLogicalConsistencyScorer - TruLensExecutionEfficiencyScorer - TruLensPlanAdherenceScorer - TruLensPlanQualityScorer - TruLensToolSelectionScorer - TruLensToolCallingScorer Signed-off-by: debu-sinha <debusinha2009@gmail.com>
255af44 to
a0fff3b
Compare
|
Note: I also contributed a fix to the TruLens project (truera/trulens#2308) to add |
|
Update: Based on feedback from TruLens maintainers on the integration issue I opened (truera/trulens#2302), they requested adding their goal-plan-action alignment evaluators for agent trace analysis. I've created a follow-up PR for this: #19328 The TruLens team has been collaborative throughout - they merged my companion PR (truera/trulens#2308) to fix Databricks compatibility, and @sfc-gh-nvytla approved the integration on #19328. |
| @@ -0,0 +1,51 @@ | |||
| """ | |||
There was a problem hiding this comment.
The docstring in the module will be shown in API doc, but not visible from the doc website. I think we can keep this simple like #19345 and file a follow-up PR to have a proper documentation page. @smoorjani is adding doc for DeepEval scorers in #19409, so let's address this once the PR is merged.
There was a problem hiding this comment.
My current docs PR is here: #19409
I'll be adding another one for RAGAS, but yes we can do this in a follow-up.
| from phoenix.evals import OpenAIModel | ||
|
|
||
| return OpenAIModel(model=self.model_name or "gpt-4o-mini") | ||
|
|
There was a problem hiding this comment.
Phoenix support several different models not only OpenAI: https://arize-phoenix.readthedocs.io/en/arize-phoenix-v4.10.1/api/evals.models.html
Can we support them? LiteLLM model will be fallback for models that are not natively supported by them e.g., Gemini.
| """Get the Phoenix OpenAI model instance.""" | ||
| _check_phoenix_installed() | ||
|
|
||
| from phoenix.evals import OpenAIModel |
There was a problem hiding this comment.
We need to support Databricks judge and model serving endpoint. Could you read #19345 and apply the same approach?
| ) | ||
| normalized_score = min(1.0, max(0.0, normalized_score)) | ||
| else: | ||
| normalized_score = 1.0 if label == positive_label else 0.0 |
There was a problem hiding this comment.
Can we use CategoricalRating.YES and CategoricalRating.NO?
| if normalized_score < 0.0 or normalized_score > 1.0: | ||
| import logging | ||
|
|
||
| logging.getLogger(__name__).warning( |
There was a problem hiding this comment.
Can we define logger at top level so we can reuse?
import logging
logger = logging.getLogger(__name__)
class _PhoenixScorerBase(Scorer):
...
|
|
||
| logging.getLogger(__name__).warning( | ||
| f"Phoenix returned score {normalized_score} outside expected 0-1 range. " | ||
| "This may indicate a version incompatibility. Clamping to valid range." |
There was a problem hiding this comment.
Do we need to crop the score to 0-1?
There was a problem hiding this comment.
Good question - I've added clamping with a warning log. Phoenix metrics typically return 0-1, but defensive clamping ensures we don't pass unexpected values downstream. The warning helps surface any edge cases during debugging.
There was a problem hiding this comment.
Do we have any example on when the score wasn't between 0-1?
| evaluator = HallucinationEvaluator(model=model) | ||
|
|
||
| # Build record dict as expected by Phoenix | ||
| query = inputs.get("query", str(inputs)) if inputs else "" |
There was a problem hiding this comment.
We have to handle more input patterns (e.g. trace, messages, etc). Can you update the logic similarly to the map_scorer_inputs_to_ragas_sample in #18988?
| context: str | None = None, | ||
| **kwargs, |
There was a problem hiding this comment.
A scorer should only have inputs, outputs, expectations, and trace as arguments. Other things like context should be derived from one of these.
| # Already aligned with MLflow convention (higher = better) | ||
| score, rationale = self._parse_result(result, positive_label="factual") | ||
|
|
||
| return Feedback(name=self.name, value=score, rationale=rationale) |
There was a problem hiding this comment.
Let's set assessment source as well.
There was a problem hiding this comment.
Can we also support get_judge() API?
There was a problem hiding this comment.
small update - we called it get_scorer() now
smoorjani
left a comment
There was a problem hiding this comment.
left some nits in addition to Yuki's comments - I think the general ask here is to follow the existing pattern from DeepEval/RAGAS where we have a generalized implementation for any scorer (get_scorer) and the classes themselves (e.g., FaithfulnessScorer) are super lightweight wrappers.
mlflow/genai/scorers/__init__.py
Outdated
| get_all_scorers, | ||
| ) | ||
| from mlflow.genai.scorers.phoenix import ( | ||
| PhoenixHallucinationScorer, |
There was a problem hiding this comment.
nit: can we follow the existing pattern of mlflow.genai.scorers.phoenix import HallucinationScorer so as to not repeat Phoenix. We can follow-up later if this is verbose and renamespace as mlflow.geani.scorers import PhoenixHallucinationScorer.
| @@ -0,0 +1,51 @@ | |||
| """ | |||
There was a problem hiding this comment.
My current docs PR is here: #19409
I'll be adding another one for RAGAS, but yes we can do this in a follow-up.
There was a problem hiding this comment.
small update - we called it get_scorer() now
| try: | ||
| import phoenix.evals # noqa: F401 | ||
|
|
||
| return True |
There was a problem hiding this comment.
nit: does this need to return anything?
There was a problem hiding this comment.
This was for _check_phoenix_installed() - it either raises an exception or returns implicitly. In the refactored code, the check happens in models.py during model creation, so it fails fast with a clear error message before any evaluation runs.
|
|
||
|
|
||
| def _check_phoenix_installed(): | ||
| """Check if phoenix.evals is installed and raise a helpful error if not.""" |
There was a problem hiding this comment.
nit: let's avoid one-line docstring, I think the linter will complain about this anyhow
|
|
||
| return OpenAIModel(model=self.model_name or "gpt-4o-mini") | ||
|
|
||
| def _parse_result( |
There was a problem hiding this comment.
maybe this should be _construct_mlflow_assessment and this should return an assessment?
There was a problem hiding this comment.
Agreed - I've refactored this entirely. The method now directly constructs and returns a object with proper . No separate helper method needed since each scorer's evaluation logic is self-contained.
|
Thanks for the detailed feedback @B-Step62 @smoorjani! I've addressed all the comments in the latest push: Structural changes:
Model support:
Interface standardization:
Ready for another look! |
a33545e to
4938e31
Compare
Implements support for Phoenix (Arize) and TruLens evaluation frameworks as MLflow GenAI scorers, enabling seamless integration of established LLM evaluation tools within MLflow's evaluation pipeline. Phoenix scorers (mlflow/genai/scorers/phoenix/): - PhoenixHallucinationScorer: Detects hallucinations in model outputs - PhoenixRelevanceScorer: Evaluates response relevance to queries - PhoenixToxicityScorer: Assesses content toxicity - PhoenixQAScorer: Evaluates QA correctness - PhoenixSummarizationScorer: Assesses summarization quality TruLens scorers (mlflow/genai/scorers/trulens/): - TruLensGroundednessScorer: Evaluates groundedness in context - TruLensContextRelevanceScorer: Assesses context relevance - TruLensAnswerRelevanceScorer: Evaluates answer relevance - TruLensCoherenceScorer: Evaluates logical flow of outputs Features: - Lazy loading to avoid import overhead when not used - Configurable model providers (OpenAI, LiteLLM for TruLens) - Consistent scorer interface returning Feedback objects - Helpful error messages when optional dependencies missing Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Key changes: 1. Phoenix scorers: Remove incorrect score inversion for Hallucination and Toxicity evaluators. Phoenix natively returns scores aligned with MLflow convention (1.0 = good, 0.0 = bad). 2. Both Phoenix and TruLens scorers: Replace silent score clamping with validation that logs warnings when scores are outside expected 0-1 range. This helps detect potential version incompatibilities. 3. Update tests to reflect correct Phoenix score semantics: - factual = 1.0 (not inverted from 0) - non-toxic = 1.0 (not inverted from 0) 4. Update docstrings and comments to accurately describe score semantics. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Align with existing DeepEval/RAGAS patterns for consistency - Add get_scorer() API for both Phoenix and TruLens integrations - Create lightweight wrapper classes (Hallucination, Relevance, etc.) - Add multi-model support: Databricks managed judge, serving endpoints, LiteLLM - Use standard scorer arguments: inputs, outputs, expectations, trace - Add AssessmentSource for proper attribution - Use CategoricalRating.YES/NO for binary feedback values - Add models.py, registry.py, utils.py for each integration - Update tests to match new API patterns Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Use function-level patching instead of module-level mocking - Import classes before applying mocks to ensure module is loaded - Fix line length issues in test files Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Phoenix ToxicityEvaluator expects text in the 'input' field, but MLflow scorers typically pass text via 'outputs' parameter. When Toxicity is called with only outputs (no inputs), the text should be mapped to Phoenix's 'input' field. This fix ensures Toxicity()(outputs='text') works correctly by mapping the output to Phoenix's expected 'input' field when no input is provided. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
The call_chat_completions function from databricks.rag_eval requires the @context.eval_context decorator which sets up internal state. When called directly from Phoenix/TruLens model adapters without this context, it fails with 'cannot access local variable' errors. Switch to using _invoke_databricks_serving_endpoint with the databricks-meta-llama-3-3-70b-instruct foundation model endpoint, which works in both notebook and external environments. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Phoenix evaluators use a set_verbosity context manager that expects model adapters to have both _verbose and _rate_limiter._verbose attributes. Without these, the context manager fails with UnboundLocalError. Added: - _NoOpRateLimiter stub class with _verbose attribute - _verbose = False to both Databricks model adapter classes - _rate_limiter = _NoOpRateLimiter() to both classes Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Phoenix: - Add _verbose attribute required by set_verbosity context manager - Add _rate_limiter with _NoOpRateLimiter stub for set_verbosity TruLens: - Create dynamic provider class that inherits from LLMProvider - Implement _create_chat_completion method required by TruLens - TruLens feedback methods (groundedness, relevance, etc.) are inherited from LLMProvider base class Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Phoenix fix: - Handle MultimodalPrompt objects by converting to string before passing to Databricks endpoint (Phoenix evaluators pass MultimodalPrompt, not str) TruLens fix: - Properly initialize Pydantic base class with super().__init__(model_engine=...) - Use class-level attributes for endpoint config to avoid Pydantic field issues Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Add Endpoint object to TruLens Databricks provider initialization. TruLens LLMProvider requires an endpoint to be set for feedback methods like groundedness_measure_with_cot_reasons to work correctly. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
4b1eb0a to
a1058df
Compare
|
Hi @B-Step62, Thanks for the feedback! I've split this PR as requested:
The Phoenix PR includes:
Please review #19473 when you have a chance. I'll close this PR and create a separate TruLens PR once Phoenix is merged. |






Related Issues/PRs
Partial fix for #19062 (TruLens portion)
Note: This PR has been split per reviewer feedback. Phoenix scorers are now in #19473.
What changes are proposed in this pull request?
This PR adds integration for TruLens evaluation framework as MLflow GenAI scorers.
TruLens scorers (
mlflow.genai.scorers.trulens):Groundedness: Evaluates groundedness in contextContextRelevance: Assesses context relevance to queryAnswerRelevance: Evaluates answer relevance to queryCoherence: Evaluates coherence and logical flowImplementation details:
CategoricalRating.YES/NOvalues with scores in metadataget_scorer()API for dynamic metric selectionUsage Examples with Real Output
Direct Scorer Call
Output:
With mlflow.genai.evaluate()
Output:
All 4 TruLens Scorers
Output:
get_scorer() API
How is this PR tested?
tests/genai/scorers/trulens/test_trulens.pyTest coverage includes:
get_scorer()API for all metricsmlflow.genai.evaluate()batch evaluationCategoricalRating.YES/NOvalue validationAssessmentSource.LLM_JUDGEtype validationDoes this PR require documentation update?
Release Notes
Is this a user-facing change?
Added TruLens third-party scorer integration for MLflow GenAI evaluation. 4 new scorers available:
TruLens scorers: Groundedness, ContextRelevance, AnswerRelevance, Coherence
Install dependencies:
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/evaluation: MLflow model evaluation featuresHow should the PR be classified in the release notes?
rn/feature- A new user-facing feature worth mentioning in the release notesShould this PR be included in the next patch release?