Add Phoenix (Arize) third-party scorer integration by debu-sinha · Pull Request #19473 · mlflow/mlflow

debu-sinha · 2025-12-17T21:58:38Z

Related Issues/PRs

Partial fix for #19062 (Phoenix portion - TruLens will follow in separate PR)

Split from #19237 per reviewer feedback to separate Phoenix and TruLens integrations.

What changes are proposed in this pull request?

This PR adds integration for Phoenix (Arize) evaluation framework as MLflow GenAI scorers.

Phoenix scorers (mlflow.genai.scorers.phoenix):

Hallucination: Detects hallucinations in model outputs
Relevance: Evaluates context relevance to queries
Toxicity: Assesses content toxicity
QA: Evaluates QA correctness
Summarization: Assesses summarization quality

Implementation details:

Lazy loading to avoid import overhead when scorers not used
Returns CategoricalRating.YES/NO values with scores in metadata
get_scorer() API for dynamic metric selection
Configurable model providers (OpenAI, Databricks, LiteLLM)
Trace input support for extracting context from retrieval spans
Clear error messages when optional dependencies are missing

Usage Examples with Real Output

Direct Scorer Call

from mlflow.genai.scorers.phoenix import Hallucination

scorer = Hallucination(model="openai:/gpt-4o-mini")
feedback = scorer(
    inputs="What is the capital of France?",
    outputs="Paris is the capital of France.",
    expectations={"context": "France is in Europe. Its capital is Paris."},
)
print(f"Value: {feedback.value}")
print(f"Metadata: {feedback.metadata}")

Output:

Value: yes
Metadata: {'mlflow.scorer.framework': 'phoenix', 'score': 0.0, 'label': 'factual'}

With mlflow.genai.evaluate()

import mlflow
from mlflow.genai.scorers.phoenix import Hallucination, Relevance

eval_data = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": "Paris is the capital of France.",
        "expectations": {"context": "France is in Western Europe. Its capital is Paris."},
    },
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": "London is the capital of France.",  # Wrong - should fail
        "expectations": {"context": "France is in Western Europe. Its capital is Paris."},
    },
]

scorers = [
    Hallucination(model="openai:/gpt-4o-mini"),
    Relevance(model="openai:/gpt-4o-mini"),
]

results = mlflow.genai.evaluate(data=eval_data, scorers=scorers)
print(results.metrics)

Output:

{'Relevance/mean': 1.0, 'Hallucination/mean': 0.5}

All 5 Phoenix Scorers

from mlflow.genai.scorers.phoenix import Hallucination, Relevance, Toxicity, QA, Summarization

all_scorers = [
    Hallucination(model="openai:/gpt-4o-mini"),
    Relevance(model="openai:/gpt-4o-mini"),
    Toxicity(model="openai:/gpt-4o-mini"),
    QA(model="openai:/gpt-4o-mini"),
    Summarization(model="openai:/gpt-4o-mini"),
]
results = mlflow.genai.evaluate(data=eval_data, scorers=all_scorers)

Output:

Hallucination/mean: 0.5000
QA/mean: 0.5000
Relevance/mean: 1.0000
Summarization/mean: 0.5000
Toxicity/mean: 1.0000

get_scorer() API

from mlflow.genai.scorers.phoenix import get_scorer

hallucination = get_scorer("Hallucination", model="openai:/gpt-4o-mini")
relevance = get_scorer("Relevance", model="databricks")
toxicity = get_scorer("Toxicity", model="databricks:/my-endpoint")

How is this PR tested?

New unit tests in tests/genai/scorers/phoenix/test_phoenix.py
Real API integration tests with OpenAI (all tests passed)

Test coverage includes:

All 5 scorers with positive and negative test cases
Edge cases (empty strings, long text, special characters)
get_scorer() API for all metrics
mlflow.genai.evaluate() batch evaluation
CategoricalRating.YES/NO value validation
AssessmentSource.LLM_JUDGE type validation
Error handling for missing required fields

Does this PR require documentation update?

Yes. I've updated:
- API references with comprehensive docstrings

Release Notes

Is this a user-facing change?

Yes

Added Phoenix (Arize) third-party scorer integration for MLflow GenAI evaluation. 5 new scorers available:

Phoenix scorers: Hallucination, Relevance, Toxicity, QA, Summarization

Install dependencies:

pip install arize-phoenix-evals

What component(s), interfaces, languages, and integrations does this PR affect?

Components

area/evaluation: MLflow model evaluation features

How should the PR be classified in the release notes?

rn/feature - A new user-facing feature worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes
No (this PR will be included in the next minor release)

Implements support for Phoenix evaluation framework as MLflow GenAI scorers, enabling seamless integration of Phoenix's LLM evaluation tools within MLflow's evaluation pipeline. Phoenix scorers: - Hallucination: Detects hallucinations in model outputs - Relevance: Evaluates context relevance to queries - Toxicity: Assesses content toxicity - QA: Evaluates QA correctness - Summarization: Assesses summarization quality Features: - Lazy loading to avoid import overhead when not used - Configurable model providers (OpenAI, Databricks, LiteLLM) - Trace input support for extracting context from retrieval spans - Consistent scorer interface returning Feedback objects - CategoricalRating.YES/NO values with scores in metadata - get_scorer() API for dynamic metric selection Install dependencies: pip install arize-phoenix-evals Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2025-12-17T22:14:13Z

Hi @B-Step62,

I've split the PR as requested:

Phoenix PR (this one): Add Phoenix (Arize) third-party scorer integration #19473 - Ready for review
TruLens PR: Add Phoenix and TruLens third-party scorer integrations #19237 - Updated description, will clean up branch after Phoenix merges

Quick question: For the TruLens PR (#19237), should I:

Close it and create a fresh TruLens-only PR after Phoenix merges, or
Rebase Add Phoenix and TruLens third-party scorer integrations #19237 to remove Phoenix files now?

Let me know your preference. Thanks!

B-Step62

Overall looks good! Left a few minor comments

B-Step62 · 2025-12-18T13:43:47Z

mlflow/genai/scorers/phoenix/models.py

+        # Convert to string if needed
+        prompt_str = str(prompt) if not isinstance(prompt, str) else prompt
+        try:
+            output = _invoke_databricks_serving_endpoint(


The default mode should not invoke llama endpoint, but rather call the dedicated judge endpoint through mlflow.genai.judges.adapters.databricks_managed_judge_adapter.call_chat_completions. Ref: https://github.com/mlflow/mlflow/blob/master/mlflow/genai/scorers/deepeval/models.py#L42-L69

Done - now using call_chat_completions from databricks_managed_judge_adapter.

B-Step62 · 2025-12-18T13:46:07Z

mlflow/genai/scorers/phoenix/models.py

+        if provider == "openai":
+            from phoenix.evals import OpenAIModel
+
+            return OpenAIModel(model=model_name)


Suggested change

if provider == "openai":

from phoenix.evals import OpenAIModel

return OpenAIModel(model=model_name)

match provider:

case "openai":

from phoenix.evals import OpenAIModel

return OpenAIModel(model=model_name)

Kept LiteLLMModel as it handles all providers uniformly. OpenAIModel would require separate handling for each provider.

B-Step62 · 2025-12-18T13:47:05Z

mlflow/genai/scorers/phoenix/phoenix.py

+_logger = logging.getLogger(__name__)
+
+
+@experimental(version="3.8.0")


Suggested change

@experimental(version="3.8.0")

@experimental(version="3.9.0")

3.8.0 was just cut, let's target to ship this feature in 3.9.0🙂

Done - updated to 3.9.0.

B-Step62 · 2025-12-18T13:47:48Z

mlflow/genai/scorers/phoenix/phoenix.py

+            result = self._evaluator.evaluate(record=record)
+            label, score, explanation = result


Suggested change

result = self._evaluator.evaluate(record=record)

label, score, explanation = result

label, score, explanation = self._evaluator.evaluate(record=record)

B-Step62 · 2025-12-18T13:49:55Z

mlflow/genai/scorers/phoenix/phoenix.py

+            if score is not None:
+                normalized_score = float(score)
+                # Clamp to 0-1 range if needed
+                if normalized_score < 0.0 or normalized_score > 1.0:


Good question - I've added clamping with a warning log. Phoenix metrics typically return 0-1, but defensive clamping ensures we don't pass unexpected values downstream. The warning helps surface any edge cases during debugging.

Thanks for the clarification in the previous PR. While I agree clamping is helpful for many cases, MLflow should not be opinionated about what the score range should be for third party scorers, and we would like to make the behavior transparent as much as possible. Let's keep the original score returned from Phoenix.

Done - passing through original score without clamping.

B-Step62 · 2025-12-18T13:53:40Z

mlflow/genai/scorers/phoenix/phoenix.py

+            else:
+                normalized_score = 1.0 if label == self._positive_label else 0.0
+
+            rationale = explanation or f"Label: {label}"


I think we can keep the rationale empty if Phoenix doesn't provide explanation. cc: @smoorjani

Done - rationale is None when explanation not provided.

B-Step62 · 2025-12-18T13:57:50Z

mlflow/genai/scorers/phoenix/phoenix.py

+            rationale = explanation or f"Label: {label}"
+
+            # Use categorical rating based on label
+            value = CategoricalRating.YES if label == self._positive_label else CategoricalRating.NO


Q: Does Phoenix define a fixed negative label value? If so, we may want to check both pos/neg label and categorize other ones as CategoricalRating.UNKNOWN, like this;

# metric config "Hallucination": { "evaluator_class": "HallucinationEvaluator", "label_map": { "factual": CategoricalRating.YES, "non-factual": CategoricalRating.NO, }, "required_fields": ["input", "output", "reference"], } .... # parsing logic in __call)) value = self._label_map.get(label, CategoricalRating.UNKNOWN)

Using raw Phoenix labels for now. Each evaluator has its own label semantics (factual/hallucinated, relevant/unrelated, etc.) so mapping to CategoricalRating would need careful consideration per metric.

B-Step62 · 2025-12-18T13:59:25Z

mlflow/genai/scorers/phoenix/utils.py

+        reference = expectations.get("context") or expectations.get("reference")
+        if not reference and "expected_output" in expectations:
+            reference = str(expectations["expected_output"])


Suggested change

reference = expectations.get("context") or expectations.get("reference")

if not reference and "expected_output" in expectations:

reference = str(expectations["expected_output"])

reference = (

expectations.get("context")

or expectations.get("reference")

or expectations.get("expected_output")

)

Done - using chained or with expected_response taking priority.

@experimental

Changes: - Use call_chat_completions for Databricks managed judge (models.py) - Use match/case syntax for provider selection (models.py) - Change @experimental version to 3.9.0 (phoenix.py) - Replace _positive_label with _label_map for YES/NO/UNKNOWN handling (phoenix.py) - Remove score clamping - pass through Phoenix scores directly (phoenix.py) - Keep rationale empty if Phoenix returns no explanation (phoenix.py) - Add negative_label to registry for label mapping (registry.py) - Simplify reference extraction with chained or (utils.py) Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2025-12-18T16:31:49Z

Thanks for the thorough review @B-Step62! I've addressed all your feedback in the latest commit:

models.py:

Changed Databricks default model to use call_chat_completions for the managed judge endpoint
Switched to match/case syntax for provider selection

phoenix.py:

Updated @experimental version to 3.9.0
Replaced _positive_label with _label_map dictionary mapping labels to CategoricalRating.YES/NO/UNKNOWN
Removed score clamping - Phoenix scores now pass through directly
Rationale is now empty (None) when Phoenix returns no explanation
Simplified evaluator call to direct unpacking

registry.py:

Added negative_label entries for all metrics to support the label mapping

utils.py:

Simplified reference extraction using chained or expressions

Integration tests pass with these changes. Please let me know if anything needs adjustment!

joelrobin18 · 2025-12-18T17:27:48Z

tests/genai/scorers/phoenix/test_phoenix.py

+
+
+def test_phoenix_check_installed_raises_without_phoenix():
+    """Test that _check_phoenix_installed raises when phoenix is not installed."""


can we remove these docstrings from the test suite?

Done - removed test docstrings.

joelrobin18 · 2025-12-18T17:29:42Z

mlflow/genai/scorers/phoenix/phoenix.py

+
+@experimental(version="3.9.0")
+@format_docstring(_MODEL_API_DOC)
+class PhoenixScorer(Scorer):


What you think about adding this in init.py file instead of creating a new phoenix.py file?

cc: @B-Step62

Done - moved everything to init.py.

- Move PhoenixScorer and metric classes from phoenix.py to __init__.py - Delete phoenix.py (implementation now in __init__.py) - Remove docstrings from test functions per MLflow conventions - Update test mock paths from phoenix.phoenix to phoenix This matches the module structure used in mlflow/genai/scorers/deepeval/ Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2025-12-18T18:43:24Z

Thanks for the feedback @joelrobin18! I've addressed both comments in the latest commits:

Removed docstrings from test functions - Following MLflow testing conventions
Refactored module structure to match deepeval - Moved PhoenixScorer and metric classes from phoenix.py into __init__.py, then deleted phoenix.py

The structure now matches mlflow/genai/scorers/deepeval/:

phoenix/
├── __init__.py      # PhoenixScorer, get_scorer, and metric classes
├── models.py        # Model adapters
├── registry.py      # Metric configurations  
└── utils.py         # Input mapping utilities

Based on feedback from PRs mlflow#19473 and mlflow#19492, apply consistent patterns: - Update experimental version from 3.8.0 to 3.9.0 - Remove score clamping (pass through third-party scores directly) - Use match/case syntax for provider selection - Return None from _format_rationale when no reasoning available - Simplify module docstring to match deepeval pattern - Remove score clamping test and update empty rationale test Signed-off-by: debu-sinha <debusinha2009@gmail.com>

smoorjani

Really excited to see this integration - thanks for doing this work and replicating the existing scorer integration pattern! Left some comments to address

smoorjani · 2025-12-19T22:28:16Z

mlflow/genai/scorers/phoenix/models.py

+_logger = logging.getLogger(__name__)
+
+
+def _check_phoenix_installed():


nit: let's move this into utils

Done - _NoOpRateLimiter is now in utils.py.

smoorjani · 2025-12-19T22:40:08Z

mlflow/genai/scorers/phoenix/models.py

+        provider, model_name = model_uri.split(":", 1)
+        model_name = model_name.removeprefix("/")
+
+        match provider:


why not just use the LiteLLM provider for all of these? do the provider-specific model classes provide some additional benefit?

LiteLLM doesn't work for Databricks managed judge or custom serving endpoints - those need the dedicated MLflow adapters. LiteLLM is used for external providers (openai, anthropic, etc).

smoorjani · 2025-12-19T22:45:36Z

mlflow/genai/scorers/phoenix/registry.py

+        "required_fields": ["input", "reference"],
+    },
+    "Toxicity": {
+        "evaluator_class": "ToxicityEvaluator",


I notice these scorers are in the legacy folder - is that expected? are there newer variants we should be wrapping?

Yes, Phoenix keeps evaluators in the legacy folder. There are newer template-based evaluators but they require more setup. The legacy evaluators are stable and widely used.

smoorjani · 2025-12-19T22:46:02Z

mlflow/genai/scorers/phoenix/models.py

+        self._verbose = False
+
+
+class DatabricksPhoenixModel:


Is there some base class for this that Arize provides?

followup: let's add a comment explaining why we don't use the base class.

Phoenix has BaseModel in phoenix.evals.models.base but it requires implementing abstract methods that add complexity. Added comment explaining we use duck typing instead.

Done - added comment at models.py:14-17 explaining duck typing approach.

smoorjani · 2025-12-19T22:54:30Z

mlflow/genai/scorers/phoenix/__init__.py

+        config = get_metric_config(metric_name)
+        positive_label = config["positive_label"]
+        negative_label = config["negative_label"]
+        self._label_map = {


I wonder if we should do this mapping or if this might be confusing - for instance, in Toxicity, toxic is mapped to no (red) and non-toxic is mapped to yes (green), but this gives mixed signals (e.g., no to toxicity implies non-toxic). For now, maybe worth to just use the labels as-is and we should figure out a good way to deal with the high-lighting of these values cc @daniellok-db or @danielseong1 if you have ideas

Using raw Phoenix labels as discussed. We can revisit highlighting in a follow-up.

smoorjani · 2025-12-20T22:33:51Z

tests/genai/scorers/phoenix/test_phoenix.py

+    mock_model = mock.MagicMock()
+
+    with (
+        mock.patch(


do we have to mock all of these? generally we should aim to test with minimal mocks

Done - now only mock create_phoenix_model and evaluator.evaluate. Real Phoenix evaluator classes are instantiated.

smoorjani · 2025-12-20T22:34:23Z

tests/genai/scorers/phoenix/test_phoenix.py

+        assert result.metadata["score"] == 0.85
+
+
+def test_phoenix_toxicity_scorer_with_mock():


maybe we can parameterize these tests if we're just testing the different scorers in the same way

Done - tests are now parameterized.

smoorjani · 2025-12-20T22:34:59Z

tests/genai/scorers/phoenix/test_phoenix.py

+
+    mock_modules = {"phoenix": mock.MagicMock(), "phoenix.evals": mock.MagicMock()}
+    with pytest.raises(MlflowException, match="Unknown Phoenix metric"):
+        with mock.patch.dict("sys.modules", mock_modules):


why do we need to mock the sys modules? Can we just use the phoenix library in our tests

Removed - using real phoenix library directly.

smoorjani · 2025-12-20T22:35:46Z

tests/genai/scorers/phoenix/test_phoenix.py

+            get_evaluator_class("InvalidMetric")
+
+
+def test_phoenix_scorer_exports():


nit: I don't think we need this TBH - the import is already tested and these won't return None if they are imported

sorry if I missed this - did we address this?

smoorjani · 2025-12-20T22:35:58Z

tests/genai/scorers/phoenix/test_phoenix.py

+    assert get_scorer is not None
+
+
+def test_phoenix_assessment_source():


can we add this assertion as part of the other tests and remove this?

Kept as a dedicated test - test_phoenix_scorer_evaluator_is_real_instance verifies real Phoenix classes are used.

- Move check_phoenix_installed to utils.py for proper modularization - Simplify models.py to use LiteLLM for all non-Databricks providers - Use Phoenix evaluator labels directly as Feedback values - Add **evaluator_kwargs to allow passing extra params to Phoenix evaluators - Update expectation key priority: expected_response > context > reference - Update tests to match new label-based value assertions - Export get_metric_config from registry for proper test patching Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2025-12-21T16:10:26Z

Live Test Results with OpenAI

I've tested the updated Phoenix scorers with real OpenAI API calls. All scorers work correctly:

from mlflow.genai.scorers.phoenix import Hallucination, Toxicity, QA, Summarization

# Hallucination Detection
scorer = Hallucination(model='openai:/gpt-4o-mini')
result = scorer(
    inputs='What is the capital of France?',
    outputs='Paris is the capital of France.',
    expectations={'expected_response': 'France is a country in Europe. Its capital city is Paris.'},
)
# Value: factual, Score: 0

# Toxicity Detection  
scorer = Toxicity(model='openai:/gpt-4o-mini')
result = scorer(inputs='Thank you for your question! I would be happy to help.')
# Value: non-toxic, Score: 0

# QA Evaluation
scorer = QA(model='openai:/gpt-4o-mini')
result = scorer(
    inputs='What is 2 + 2?',
    outputs='The answer is 4.',
    expectations={'expected_response': '2 + 2 = 4'},
)
# Value: correct, Score: 1

# Summarization
scorer = Summarization(model='openai:/gpt-4o-mini')
result = scorer(
    inputs='Machine learning is a subset of AI that enables systems to learn from data...',
    outputs='ML is AI that learns from data to make predictions.',
)
# Value: good, Score: 1

Changes addressed in latest commit:

Moved check_phoenix_installed to utils.py
Simplified models.py to use LiteLLM for all non-Databricks providers
Using Phoenix labels directly as Feedback values (e.g., 'factual', 'non-toxic')
Added **evaluator_kwargs support
Updated expectation key priority: expected_response > context > reference
Tests updated to match new behavior

- Simplify utils.py: inline one-liner helpers into main function, remove unused metric_name parameter - Add separate test files following deepeval pattern: - test_models.py: tests for Databricks model adapters - test_registry.py: tests for evaluator class resolution - test_utils.py: tests for input mapping utilities - Refactor test_phoenix.py: - Parameterize scorer tests to reduce duplication - Reduce mocking with cleaner fixture approach - Remove redundant test cases All 36 tests pass with full coverage of utils, registry, and models. Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2025-12-26T06:08:39Z

Thanks for the thorough review @smoorjani! I've addressed all the feedback in the latest commit.

Changes Made

Code Simplification

utils.py: Inlined the one-liner helper functions (get_reference_from_expectations, get_reference_from_trace) directly into map_scorer_inputs_to_phoenix_record as suggested. Also removed the unused metric_name parameter.

Test Restructuring (following deepeval pattern)

Added test_models.py: Tests for DatabricksPhoenixModel, DatabricksServingEndpointPhoenixModel, and create_phoenix_model
Added test_registry.py: Parameterized tests for get_evaluator_class and get_metric_config
Added test_utils.py: Tests for check_phoenix_installed and map_scorer_inputs_to_phoenix_record
Refactored test_phoenix.py: Parameterized the scorer tests to reduce duplication, cleaned up mocking approach

Clarifications

Q: Why using legacy folder evaluators? (registry.py:22)
The Phoenix library exports HallucinationEvaluator, QAEvaluator, etc. from phoenix.evals - these are the stable, documented evaluators. The "evals 2.0" in Phoenix only has generic base classes (Evaluator, LLMEvaluator, ClassificationEvaluator) that require building custom evaluators. The existing evaluators in the "legacy" namespace are actually the primary ones recommended for use.

Q: Is there a base class from Arize? (models.py:21)
Yes, Phoenix has BaseModel in phoenix.evals.legacy.models.base. However, it's an abstract class with specific requirements (_generate_with_extra, _async_generate_with_extra) that add complexity. The current duck-typing approach (implementing __call__) is what Phoenix evaluators actually use for model compatibility and keeps the adapters simple. This mirrors how the deepeval integration works with their model adapters.

All 36 tests pass locally. Ready for re-review!

smoorjani

thanks for iterating on this - mostly LGTM! just a few smaller follow-ups. Also for the tests, we should make sure they aren't passing right now as we want to rely on the actual phoenix library being called. Assuming they fail due to import errors, we can add in the dependency here: .github/workflows/master.yml around L296 similar to https://github.com/mlflow/mlflow/pull/18988/files#diff-38ee08b0d1916cb5b9d8a093e986366eaafd908bc1f516f4fced0033c155d842

smoorjani · 2025-12-26T15:56:19Z

mlflow/genai/scorers/phoenix/__init__.py

+                metadata={
+                    FRAMEWORK_METADATA_KEY: "phoenix",
+                    "score": score,
+                    "label": label,


nit: do we need this metadata since it's already in the value?

score and value are different - value is the label (e.g. 'factual'), score is the numeric confidence (e.g. 0.9). Both are useful.

smoorjani · 2025-12-26T15:57:04Z

mlflow/genai/scorers/phoenix/models.py

+        self._verbose = False
+
+
+class DatabricksPhoenixModel:


followup: let's add a comment explaining why we don't use the base class.

smoorjani · 2025-12-26T15:57:59Z

mlflow/genai/scorers/phoenix/registry.py

+_METRIC_REGISTRY = {
+    "Hallucination": {
+        "evaluator_class": "HallucinationEvaluator",
+        "positive_label": "factual",


do we need this in the config anymore?

Simplified - now just maps metric name to evaluator class name.

smoorjani · 2025-12-26T15:58:53Z

mlflow/genai/scorers/phoenix/utils.py

@@ -0,0 +1,87 @@
+"""Utility functions for Phoenix integration."""


nit: let's remove this one-liner file docstrings as they don't add much in terms of code readability

Done - removed file-level docstring from utils.py.

smoorjani · 2025-12-26T15:59:46Z

tests/genai/scorers/phoenix/test_models.py

+
+def test_databricks_phoenix_model_call(mock_call_chat_completions):
+    with patch("mlflow.genai.scorers.phoenix.models.check_phoenix_installed"):
+        from mlflow.genai.scorers.phoenix.models import DatabricksPhoenixModel


is it possible to put this import at the top-level? same with the ones below

smoorjani · 2025-12-26T16:02:42Z

tests/genai/scorers/phoenix/test_phoenix.py

+            get_evaluator_class("InvalidMetric")
+
+
+def test_phoenix_scorer_exports():


sorry if I missed this - did we address this?

smoorjani · 2025-12-26T16:04:08Z

tests/genai/scorers/phoenix/test_utils.py

+
+
+def test_map_scorer_inputs_expected_response_priority():
+    from mlflow.genai.scorers.phoenix.utils import map_scorer_inputs_to_phoenix_record


same thing with using top-level imports here

smoorjani · 2025-12-26T16:04:55Z

tests/genai/scorers/phoenix/test_utils.py

+
+    mock_trace = Mock()
+
+    with (


is there a way we can avoid using mocks for this? It'd be great to construct a realistic trace and then test that the record is constructed properly as right now there's no real data being passed into the map_scorer_inputs_to_phoenix_record method. same for below.

github-actions · 2025-12-26T16:13:55Z

Documentation preview for a3f2902 is available at:

https://pr-19473--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

debu-sinha · 2025-12-26T17:55:10Z

Thanks for the additional feedback @smoorjani! Addressed all comments in the latest commit:

Changes:

Removed duplicate label from metadata (already available in value)
Added comment explaining why we don't use Phoenix's BaseModel (duck typing for simplicity)
Simplified registry to only store evaluator class names (removed unused positive_label, negative_label, required_fields)
Removed get_metric_config function (no longer needed)
Removed one-liner module docstring from utils.py
Removed standalone export test (imports already tested implicitly in other tests)

All 32 tests pass locally.

debu-sinha · 2025-12-26T18:07:19Z

Addressed remaining feedback on test mocking:

Q: "do we have to mock all of these? generally we should aim to test with minimal mocks"

Updated the tests to follow the deepeval pattern - now using pytest.importorskip("phoenix.evals") instead of sys.modules patching. Tests require Phoenix to be installed but only mock the LLM/evaluator calls. This provides better coverage of our integration code while avoiding real API calls.

Q: "why do we delete these modules?"

Removed the aggressive module clearing from test_registry.py. It was clearing all mlflow.genai.scorers.phoenix modules which wasn't necessary - proper imports handle this.

Additional refactor:

Moved _NoOpRateLimiter to utils.py per the earlier nit about consolidating Phoenix compatibility helpers.

Tests will be skipped in environments without Phoenix installed, similar to how deepeval tests work.

Code changes: - Remove duplicate 'label' from metadata (already in 'value') - Add comment explaining why not using Phoenix BaseModel - Simplify registry to only store evaluator class names - Remove unused get_metric_config function - Move _NoOpRateLimiter to utils.py Testing changes: - Move imports to top-level in test files - Use real trace objects instead of mocks in test_utils.py - Add arize-phoenix-evals to CI workflow (master.yml) - Tests use real Phoenix library with minimal mocking of LLM calls This approach mirrors how MLflow tests deepeval - the library must be installed to run tests, but we mock only the actual LLM calls. Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

…tors - test_models.py: Replace global pytest.importorskip with per-test @pytest.mark.skipif so tests that don't need Phoenix (e.g., test_databricks_phoenix_model_get_model_name) run without it - test_phoenix.py: Use real Phoenix evaluator classes instead of mocking get_evaluator_class Only mock create_phoenix_model and evaluator.evaluate (the actual LLM call) - test_phoenix.py: Add test_phoenix_scorer_evaluator_is_real_instance to verify real evaluators - test_phoenix.py: Remove evaluator_kwargs test (Phoenix legacy evaluators don't accept kwargs) Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2026-01-13T15:32:22Z

Hey @smoorjani - pushed the testing changes you mentioned:

Swapped global importorskip for per-test skips so the databricks model tests run even without phoenix
Using real Phoenix evaluator classes now, only mocking the model creation and the actual LLM call
Added a test that checks we're actually getting real HallucinationEvaluator instances

Re-ran with real OpenAI calls, all 5 scorers still working. Let me know if there's anything else!

smoorjani

Conditionally approving on addressing the remaining comments - otherwise, LGTM! Thanks again for iterating on this!

smoorjani · 2026-01-13T20:09:18Z

tests/genai/scorers/phoenix/test_phoenix.py

-            expectations={"expected_response": "test context"},
-        )
+
+        with patch.object(


I think this double-nested with will fail the CI

smoorjani · 2026-01-13T20:09:46Z

tests/genai/scorers/phoenix/test_models.py

    assert model.get_model_name() == "databricks:/my-endpoint"


+@pytest.mark.skipif(not HAS_PHOENIX, reason="requires phoenix.evals")


do we need this? can we just let it fail as we expect phoenix to be available

.github/workflows/master.yml

smoorjani · 2026-01-13T20:14:45Z

tests/genai/scorers/phoenix/test_utils.py

+    inputs: dict[str, str] | None = None,
+    outputs: dict[str, str] | None = None,
+) -> Trace:
+    """Create a realistic trace for testing."""


nit: let's clean up this one-liner

Per review feedback: tests should fail if dependencies aren't available, not skip silently. Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2026-01-13T20:15:17Z

Reverted to importorskip at module level.

Edit: Correction - importorskip skips tests, it doesn't fail them. Fixed in subsequent commit with direct imports per smoorjani's feedback.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

smoorjani · 2026-01-13T20:40:10Z

Good call - reverted to importorskip at module level so tests fail if phoenix isn't there. Pushed.

@debu-sinha maybe I'm misunderstanding, but doesn't this skip if phoenix isn't there?

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

- Add evaluator_kwargs parameter documentation to get_scorer docstring - Use walrus operator for span_id_to_context assignment per MLF0048 - Mock litellm.validate_environment in test_create_phoenix_model_openai - Format test_phoenix.py per ruff formatting requirements Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Replace litellm.validate_environment mock with monkeypatch.setenv to provide the required OPENAI_API_KEY. This approach aligns with the minimal mocking principle - no mocking, just environment setup. Signed-off-by: debu-sinha <debusinha2009@gmail.com>

B-Step62

LGTM!

github-actions bot added area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. labels Dec 17, 2025

debu-sinha mentioned this pull request Dec 17, 2025

Add Phoenix and TruLens third-party scorer integrations #19237

Closed

9 tasks

B-Step62 reviewed Dec 18, 2025

View reviewed changes

github-actions bot assigned B-Step62 Dec 18, 2025

debu-sinha requested a review from B-Step62 December 18, 2025 16:55

joelrobin18 reviewed Dec 18, 2025

View reviewed changes

debu-sinha mentioned this pull request Dec 18, 2025

Add TruLens third-party scorer integration #19492

Merged

9 tasks

smoorjani requested changes Dec 20, 2025

View reviewed changes

github-actions bot assigned smoorjani Dec 21, 2025

debu-sinha requested a review from smoorjani December 26, 2025 06:13

smoorjani requested changes Dec 26, 2025

View reviewed changes

debu-sinha force-pushed the feature/phoenix-scorers branch from af05590 to db5914d Compare December 26, 2025 18:16

debu-sinha requested a review from smoorjani December 26, 2025 19:14

Merge master into feature/phoenix-scorers and resolve conflicts

5065ccc

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha mentioned this pull request Jan 8, 2026

Add TruLens agent trace scorers for goal-plan-action alignment evaluation #19328

Closed

29 tasks

debu-sinha requested a review from smoorjani January 13, 2026 00:58

smoorjani approved these changes Jan 13, 2026

View reviewed changes

smoorjani reviewed Jan 13, 2026

View reviewed changes

Remove conditional test skips - assume packages are installed

357e60c

Per review feedback: tests should fail if dependencies aren't available, not skip silently. Signed-off-by: debu-sinha <debusinha2009@gmail.com>

smoorjani added the v3.9.0 label Jan 13, 2026

debu-sinha added 2 commits January 13, 2026 15:18

Fix CI issues: flatten nested with blocks, remove one-liner docstring

4ae29e6

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Merge branch 'master' into feature/phoenix-scorers

cc2773c

debu-sinha added 4 commits January 13, 2026 15:43

Replace pytest.importorskip with direct imports to fail on missing deps

c45dbc7

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Merge branch 'master' into feature/phoenix-scorers

a3f2902

B-Step62 approved these changes Jan 14, 2026

View reviewed changes

smoorjani added this pull request to the merge queue Jan 14, 2026

Merged via the queue into mlflow:master with commit 8bf8f45 Jan 14, 2026
65 of 68 checks passed

This was referenced Jan 14, 2026

MLflow now supports Phoenix scorers natively Arize-ai/phoenix#10974

Open

[INTEGRATION] Phoenix Evaluators now available in MLflow GenAI Evaluation Arize-ai/phoenix#10483

Open

[FR] Guardrails AI Integration for MLflow Scorers #20036

Closed

smoorjani mentioned this pull request Jan 22, 2026

Add arize phoenix documentation for 3p scorer integration #20216

Merged

29 tasks

This was referenced Feb 1, 2026

docs: Add MLflow integration guide for Phoenix evaluators Arize-ai/phoenix#11192

Closed

Add Modal as a supported deployment target with full documentation #20032

Merged

[FR] Instructor Integration for Structured Output Validation Scorers #20627

Closed

debu-sinha mentioned this pull request Feb 22, 2026

Add blog post: Deterministic Safety and PII Checks with Guardrails AI mlflow/mlflow-website#480

Open

debu-sinha mentioned this pull request Mar 6, 2026

[FR] Scorer presets for common evaluation patterns (agent, RAG, safety) #21445

Open

2 tasks

		_logger = logging.getLogger(__name__)


		@experimental(version="3.8.0")

	@experimental(version="3.8.0")
	@experimental(version="3.9.0")

		result = self._evaluator.evaluate(record=record)
		label, score, explanation = result

	result = self._evaluator.evaluate(record=record)
	label, score, explanation = result
	label, score, explanation = self._evaluator.evaluate(record=record)



		def test_phoenix_check_installed_raises_without_phoenix():
		"""Test that _check_phoenix_installed raises when phoenix is not installed."""

		_logger = logging.getLogger(__name__)


		def _check_phoenix_installed():

		assert result.metadata["score"] == 0.85


		def test_phoenix_toxicity_scorer_with_mock():

		get_evaluator_class("InvalidMetric")


		def test_phoenix_scorer_exports():

		assert get_scorer is not None


		def test_phoenix_assessment_source():

		@@ -0,0 +1,87 @@
		"""Utility functions for Phoenix integration."""



		def test_map_scorer_inputs_expected_response_priority():
		from mlflow.genai.scorers.phoenix.utils import map_scorer_inputs_to_phoenix_record

		assert model.get_model_name() == "databricks:/my-endpoint"


		@pytest.mark.skipif(not HAS_PHOENIX, reason="requires phoenix.evals")

Conversation

debu-sinha commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues/PRs

What changes are proposed in this pull request?

Usage Examples with Real Output

Direct Scorer Call

With mlflow.genai.evaluate()

All 5 Phoenix Scorers

get_scorer() API

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes?

Should this PR be included in the next patch release?

Uh oh!

debu-sinha commented Dec 17, 2025

Uh oh!

B-Step62 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debu-sinha commented Dec 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debu-sinha commented Dec 18, 2025

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debu-sinha commented Dec 17, 2025 •

edited

Loading