Skip to content

Add Phoenix (Arize) third-party scorer integration#19473

Merged
smoorjani merged 16 commits intomlflow:masterfrom
debu-sinha:feature/phoenix-scorers
Jan 14, 2026
Merged

Add Phoenix (Arize) third-party scorer integration#19473
smoorjani merged 16 commits intomlflow:masterfrom
debu-sinha:feature/phoenix-scorers

Conversation

@debu-sinha
Copy link
Contributor

@debu-sinha debu-sinha commented Dec 17, 2025

Related Issues/PRs

Partial fix for #19062 (Phoenix portion - TruLens will follow in separate PR)

Split from #19237 per reviewer feedback to separate Phoenix and TruLens integrations.

What changes are proposed in this pull request?

This PR adds integration for Phoenix (Arize) evaluation framework as MLflow GenAI scorers.

Phoenix scorers (mlflow.genai.scorers.phoenix):

  • Hallucination: Detects hallucinations in model outputs
  • Relevance: Evaluates context relevance to queries
  • Toxicity: Assesses content toxicity
  • QA: Evaluates QA correctness
  • Summarization: Assesses summarization quality

Implementation details:

  • Lazy loading to avoid import overhead when scorers not used
  • Returns CategoricalRating.YES/NO values with scores in metadata
  • get_scorer() API for dynamic metric selection
  • Configurable model providers (OpenAI, Databricks, LiteLLM)
  • Trace input support for extracting context from retrieval spans
  • Clear error messages when optional dependencies are missing

Usage Examples with Real Output

Direct Scorer Call

from mlflow.genai.scorers.phoenix import Hallucination

scorer = Hallucination(model="openai:/gpt-4o-mini")
feedback = scorer(
    inputs="What is the capital of France?",
    outputs="Paris is the capital of France.",
    expectations={"context": "France is in Europe. Its capital is Paris."},
)
print(f"Value: {feedback.value}")
print(f"Metadata: {feedback.metadata}")

Output:

Value: yes
Metadata: {'mlflow.scorer.framework': 'phoenix', 'score': 0.0, 'label': 'factual'}

With mlflow.genai.evaluate()

import mlflow
from mlflow.genai.scorers.phoenix import Hallucination, Relevance

eval_data = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": "Paris is the capital of France.",
        "expectations": {"context": "France is in Western Europe. Its capital is Paris."},
    },
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": "London is the capital of France.",  # Wrong - should fail
        "expectations": {"context": "France is in Western Europe. Its capital is Paris."},
    },
]

scorers = [
    Hallucination(model="openai:/gpt-4o-mini"),
    Relevance(model="openai:/gpt-4o-mini"),
]

results = mlflow.genai.evaluate(data=eval_data, scorers=scorers)
print(results.metrics)

Output:

{'Relevance/mean': 1.0, 'Hallucination/mean': 0.5}

All 5 Phoenix Scorers

from mlflow.genai.scorers.phoenix import Hallucination, Relevance, Toxicity, QA, Summarization

all_scorers = [
    Hallucination(model="openai:/gpt-4o-mini"),
    Relevance(model="openai:/gpt-4o-mini"),
    Toxicity(model="openai:/gpt-4o-mini"),
    QA(model="openai:/gpt-4o-mini"),
    Summarization(model="openai:/gpt-4o-mini"),
]
results = mlflow.genai.evaluate(data=eval_data, scorers=all_scorers)

Output:

Hallucination/mean: 0.5000
QA/mean: 0.5000
Relevance/mean: 1.0000
Summarization/mean: 0.5000
Toxicity/mean: 1.0000

get_scorer() API

from mlflow.genai.scorers.phoenix import get_scorer

hallucination = get_scorer("Hallucination", model="openai:/gpt-4o-mini")
relevance = get_scorer("Relevance", model="databricks")
toxicity = get_scorer("Toxicity", model="databricks:/my-endpoint")

How is this PR tested?

  • New unit tests in tests/genai/scorers/phoenix/test_phoenix.py
  • Real API integration tests with OpenAI (all tests passed)

Test coverage includes:

  • All 5 scorers with positive and negative test cases
  • Edge cases (empty strings, long text, special characters)
  • get_scorer() API for all metrics
  • mlflow.genai.evaluate() batch evaluation
  • CategoricalRating.YES/NO value validation
  • AssessmentSource.LLM_JUDGE type validation
  • Error handling for missing required fields

Does this PR require documentation update?

  • Yes. I've updated:
    • API references with comprehensive docstrings

Release Notes

Is this a user-facing change?

  • Yes

Added Phoenix (Arize) third-party scorer integration for MLflow GenAI evaluation. 5 new scorers available:

Phoenix scorers: Hallucination, Relevance, Toxicity, QA, Summarization

Install dependencies:

pip install arize-phoenix-evals

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/evaluation: MLflow model evaluation features

How should the PR be classified in the release notes?

  • rn/feature - A new user-facing feature worth mentioning in the release notes

Should this PR be included in the next patch release?

  • Yes
  • No (this PR will be included in the next minor release)

Implements support for Phoenix evaluation framework as MLflow GenAI scorers,
enabling seamless integration of Phoenix's LLM evaluation tools within
MLflow's evaluation pipeline.

Phoenix scorers:
- Hallucination: Detects hallucinations in model outputs
- Relevance: Evaluates context relevance to queries
- Toxicity: Assesses content toxicity
- QA: Evaluates QA correctness
- Summarization: Assesses summarization quality

Features:
- Lazy loading to avoid import overhead when not used
- Configurable model providers (OpenAI, Databricks, LiteLLM)
- Trace input support for extracting context from retrieval spans
- Consistent scorer interface returning Feedback objects
- CategoricalRating.YES/NO values with scores in metadata
- get_scorer() API for dynamic metric selection

Install dependencies:
  pip install arize-phoenix-evals

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. labels Dec 17, 2025
@debu-sinha
Copy link
Contributor Author

Hi @B-Step62,

I've split the PR as requested:

Quick question: For the TruLens PR (#19237), should I:

  1. Close it and create a fresh TruLens-only PR after Phoenix merges, or
  2. Rebase Add Phoenix and TruLens third-party scorer integrations #19237 to remove Phoenix files now?

Let me know your preference. Thanks!

Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! Left a few minor comments

# Convert to string if needed
prompt_str = str(prompt) if not isinstance(prompt, str) else prompt
try:
output = _invoke_databricks_serving_endpoint(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default mode should not invoke llama endpoint, but rather call the dedicated judge endpoint through mlflow.genai.judges.adapters.databricks_managed_judge_adapter.call_chat_completions. Ref: https://github.com/mlflow/mlflow/blob/master/mlflow/genai/scorers/deepeval/models.py#L42-L69

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - now using call_chat_completions from databricks_managed_judge_adapter.

Comment on lines +122 to +125
if provider == "openai":
from phoenix.evals import OpenAIModel

return OpenAIModel(model=model_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if provider == "openai":
from phoenix.evals import OpenAIModel
return OpenAIModel(model=model_name)
match provider:
case "openai":
from phoenix.evals import OpenAIModel
return OpenAIModel(model=model_name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept LiteLLMModel as it handles all providers uniformly. OpenAIModel would require separate handling for each provider.

_logger = logging.getLogger(__name__)


@experimental(version="3.8.0")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@experimental(version="3.8.0")
@experimental(version="3.9.0")

3.8.0 was just cut, let's target to ship this feature in 3.9.0🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - updated to 3.9.0.

Comment on lines +94 to +95
result = self._evaluator.evaluate(record=record)
label, score, explanation = result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = self._evaluator.evaluate(record=record)
label, score, explanation = result
label, score, explanation = self._evaluator.evaluate(record=record)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if score is not None:
normalized_score = float(score)
# Clamp to 0-1 range if needed
if normalized_score < 0.0 or normalized_score > 1.0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I've added clamping with a warning log. Phoenix metrics typically return 0-1, but defensive clamping ensures we don't pass unexpected values downstream. The warning helps surface any edge cases during debugging.

Thanks for the clarification in the previous PR. While I agree clamping is helpful for many cases, MLflow should not be opinionated about what the score range should be for third party scorers, and we would like to make the behavior transparent as much as possible. Let's keep the original score returned from Phoenix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - passing through original score without clamping.

else:
normalized_score = 1.0 if label == self._positive_label else 0.0

rationale = explanation or f"Label: {label}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the rationale empty if Phoenix doesn't provide explanation. cc: @smoorjani

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - rationale is None when explanation not provided.

rationale = explanation or f"Label: {label}"

# Use categorical rating based on label
value = CategoricalRating.YES if label == self._positive_label else CategoricalRating.NO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Does Phoenix define a fixed negative label value? If so, we may want to check both pos/neg label and categorize other ones as CategoricalRating.UNKNOWN, like this;

    # metric config
    "Hallucination": {
        "evaluator_class": "HallucinationEvaluator",
        "label_map": {
            "factual": CategoricalRating.YES,
            "non-factual": CategoricalRating.NO,
        },
        "required_fields": ["input", "output", "reference"],
    }
    ....
    
    # parsing logic in __call))
    value = self._label_map.get(label, CategoricalRating.UNKNOWN)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using raw Phoenix labels for now. Each evaluator has its own label semantics (factual/hallucinated, relevant/unrelated, etc.) so mapping to CategoricalRating would need careful consideration per metric.

Comment on lines +64 to +66
reference = expectations.get("context") or expectations.get("reference")
if not reference and "expected_output" in expectations:
reference = str(expectations["expected_output"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reference = expectations.get("context") or expectations.get("reference")
if not reference and "expected_output" in expectations:
reference = str(expectations["expected_output"])
reference = (
expectations.get("context")
or expectations.get("reference")
or expectations.get("expected_output")
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - using chained or with expected_response taking priority.

Changes:
- Use call_chat_completions for Databricks managed judge (models.py)
- Use match/case syntax for provider selection (models.py)
- Change @experimental version to 3.9.0 (phoenix.py)
- Replace _positive_label with _label_map for YES/NO/UNKNOWN handling (phoenix.py)
- Remove score clamping - pass through Phoenix scores directly (phoenix.py)
- Keep rationale empty if Phoenix returns no explanation (phoenix.py)
- Add negative_label to registry for label mapping (registry.py)
- Simplify reference extraction with chained or (utils.py)

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Thanks for the thorough review @B-Step62! I've addressed all your feedback in the latest commit:

models.py:

  • Changed Databricks default model to use call_chat_completions for the managed judge endpoint
  • Switched to match/case syntax for provider selection

phoenix.py:

  • Updated @experimental version to 3.9.0
  • Replaced _positive_label with _label_map dictionary mapping labels to CategoricalRating.YES/NO/UNKNOWN
  • Removed score clamping - Phoenix scores now pass through directly
  • Rationale is now empty (None) when Phoenix returns no explanation
  • Simplified evaluator call to direct unpacking

registry.py:

  • Added negative_label entries for all metrics to support the label mapping

utils.py:

  • Simplified reference extraction using chained or expressions

Integration tests pass with these changes. Please let me know if anything needs adjustment!

@debu-sinha debu-sinha requested a review from B-Step62 December 18, 2025 16:55


def test_phoenix_check_installed_raises_without_phoenix():
"""Test that _check_phoenix_installed raises when phoenix is not installed."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove these docstrings from the test suite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed test docstrings.


@experimental(version="3.9.0")
@format_docstring(_MODEL_API_DOC)
class PhoenixScorer(Scorer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you think about adding this in init.py file instead of creating a new phoenix.py file?

cc: @B-Step62

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - moved everything to init.py.

- Move PhoenixScorer and metric classes from phoenix.py to __init__.py
- Delete phoenix.py (implementation now in __init__.py)
- Remove docstrings from test functions per MLflow conventions
- Update test mock paths from phoenix.phoenix to phoenix

This matches the module structure used in mlflow/genai/scorers/deepeval/

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Thanks for the feedback @joelrobin18! I've addressed both comments in the latest commits:

  1. Removed docstrings from test functions - Following MLflow testing conventions
  2. Refactored module structure to match deepeval - Moved PhoenixScorer and metric classes from phoenix.py into __init__.py, then deleted phoenix.py

The structure now matches mlflow/genai/scorers/deepeval/:

phoenix/
├── __init__.py      # PhoenixScorer, get_scorer, and metric classes
├── models.py        # Model adapters
├── registry.py      # Metric configurations  
└── utils.py         # Input mapping utilities

debu-sinha added a commit to debu-sinha/mlflow that referenced this pull request Dec 18, 2025
Based on feedback from PRs mlflow#19473 and mlflow#19492, apply consistent patterns:
- Update experimental version from 3.8.0 to 3.9.0
- Remove score clamping (pass through third-party scores directly)
- Use match/case syntax for provider selection
- Return None from _format_rationale when no reasoning available
- Simplify module docstring to match deepeval pattern
- Remove score clamping test and update empty rationale test

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited to see this integration - thanks for doing this work and replicating the existing scorer integration pattern! Left some comments to address

_logger = logging.getLogger(__name__)


def _check_phoenix_installed():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's move this into utils

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - _NoOpRateLimiter is now in utils.py.

provider, model_name = model_uri.split(":", 1)
model_name = model_name.removeprefix("/")

match provider:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use the LiteLLM provider for all of these? do the provider-specific model classes provide some additional benefit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LiteLLM doesn't work for Databricks managed judge or custom serving endpoints - those need the dedicated MLflow adapters. LiteLLM is used for external providers (openai, anthropic, etc).

"required_fields": ["input", "reference"],
},
"Toxicity": {
"evaluator_class": "ToxicityEvaluator",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice these scorers are in the legacy folder - is that expected? are there newer variants we should be wrapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Phoenix keeps evaluators in the legacy folder. There are newer template-based evaluators but they require more setup. The legacy evaluators are stable and widely used.

self._verbose = False


class DatabricksPhoenixModel:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some base class for this that Arize provides?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

followup: let's add a comment explaining why we don't use the base class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phoenix has BaseModel in phoenix.evals.models.base but it requires implementing abstract methods that add complexity. Added comment explaining we use duck typing instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added comment at models.py:14-17 explaining duck typing approach.

config = get_metric_config(metric_name)
positive_label = config["positive_label"]
negative_label = config["negative_label"]
self._label_map = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do this mapping or if this might be confusing - for instance, in Toxicity, toxic is mapped to no (red) and non-toxic is mapped to yes (green), but this gives mixed signals (e.g., no to toxicity implies non-toxic). For now, maybe worth to just use the labels as-is and we should figure out a good way to deal with the high-lighting of these values cc @daniellok-db or @danielseong1 if you have ideas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using raw Phoenix labels as discussed. We can revisit highlighting in a follow-up.

mock_model = mock.MagicMock()

with (
mock.patch(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to mock all of these? generally we should aim to test with minimal mocks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - now only mock create_phoenix_model and evaluator.evaluate. Real Phoenix evaluator classes are instantiated.

assert result.metadata["score"] == 0.85


def test_phoenix_toxicity_scorer_with_mock():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can parameterize these tests if we're just testing the different scorers in the same way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - tests are now parameterized.


mock_modules = {"phoenix": mock.MagicMock(), "phoenix.evals": mock.MagicMock()}
with pytest.raises(MlflowException, match="Unknown Phoenix metric"):
with mock.patch.dict("sys.modules", mock_modules):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to mock the sys modules? Can we just use the phoenix library in our tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed - using real phoenix library directly.

get_evaluator_class("InvalidMetric")


def test_phoenix_scorer_exports():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think we need this TBH - the import is already tested and these won't return None if they are imported

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if I missed this - did we address this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

assert get_scorer is not None


def test_phoenix_assessment_source():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add this assertion as part of the other tests and remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept as a dedicated test - test_phoenix_scorer_evaluator_is_real_instance verifies real Phoenix classes are used.

- Move check_phoenix_installed to utils.py for proper modularization
- Simplify models.py to use LiteLLM for all non-Databricks providers
- Use Phoenix evaluator labels directly as Feedback values
- Add **evaluator_kwargs to allow passing extra params to Phoenix evaluators
- Update expectation key priority: expected_response > context > reference
- Update tests to match new label-based value assertions
- Export get_metric_config from registry for proper test patching

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Live Test Results with OpenAI

I've tested the updated Phoenix scorers with real OpenAI API calls. All scorers work correctly:

from mlflow.genai.scorers.phoenix import Hallucination, Toxicity, QA, Summarization

# Hallucination Detection
scorer = Hallucination(model='openai:/gpt-4o-mini')
result = scorer(
    inputs='What is the capital of France?',
    outputs='Paris is the capital of France.',
    expectations={'expected_response': 'France is a country in Europe. Its capital city is Paris.'},
)
# Value: factual, Score: 0

# Toxicity Detection  
scorer = Toxicity(model='openai:/gpt-4o-mini')
result = scorer(inputs='Thank you for your question! I would be happy to help.')
# Value: non-toxic, Score: 0

# QA Evaluation
scorer = QA(model='openai:/gpt-4o-mini')
result = scorer(
    inputs='What is 2 + 2?',
    outputs='The answer is 4.',
    expectations={'expected_response': '2 + 2 = 4'},
)
# Value: correct, Score: 1

# Summarization
scorer = Summarization(model='openai:/gpt-4o-mini')
result = scorer(
    inputs='Machine learning is a subset of AI that enables systems to learn from data...',
    outputs='ML is AI that learns from data to make predictions.',
)
# Value: good, Score: 1

Changes addressed in latest commit:

  • Moved check_phoenix_installed to utils.py
  • Simplified models.py to use LiteLLM for all non-Databricks providers
  • Using Phoenix labels directly as Feedback values (e.g., 'factual', 'non-toxic')
  • Added **evaluator_kwargs support
  • Updated expectation key priority: expected_response > context > reference
  • Tests updated to match new behavior

- Simplify utils.py: inline one-liner helpers into main function,
  remove unused metric_name parameter
- Add separate test files following deepeval pattern:
  - test_models.py: tests for Databricks model adapters
  - test_registry.py: tests for evaluator class resolution
  - test_utils.py: tests for input mapping utilities
- Refactor test_phoenix.py:
  - Parameterize scorer tests to reduce duplication
  - Reduce mocking with cleaner fixture approach
  - Remove redundant test cases

All 36 tests pass with full coverage of utils, registry, and models.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Thanks for the thorough review @smoorjani! I've addressed all the feedback in the latest commit.

Changes Made

Code Simplification

  • utils.py: Inlined the one-liner helper functions (get_reference_from_expectations, get_reference_from_trace) directly into map_scorer_inputs_to_phoenix_record as suggested. Also removed the unused metric_name parameter.

Test Restructuring (following deepeval pattern)

  • Added test_models.py: Tests for DatabricksPhoenixModel, DatabricksServingEndpointPhoenixModel, and create_phoenix_model
  • Added test_registry.py: Parameterized tests for get_evaluator_class and get_metric_config
  • Added test_utils.py: Tests for check_phoenix_installed and map_scorer_inputs_to_phoenix_record
  • Refactored test_phoenix.py: Parameterized the scorer tests to reduce duplication, cleaned up mocking approach

Clarifications

Q: Why using legacy folder evaluators? (registry.py:22)
The Phoenix library exports HallucinationEvaluator, QAEvaluator, etc. from phoenix.evals - these are the stable, documented evaluators. The "evals 2.0" in Phoenix only has generic base classes (Evaluator, LLMEvaluator, ClassificationEvaluator) that require building custom evaluators. The existing evaluators in the "legacy" namespace are actually the primary ones recommended for use.

Q: Is there a base class from Arize? (models.py:21)
Yes, Phoenix has BaseModel in phoenix.evals.legacy.models.base. However, it's an abstract class with specific requirements (_generate_with_extra, _async_generate_with_extra) that add complexity. The current duck-typing approach (implementing __call__) is what Phoenix evaluators actually use for model compatibility and keeps the adapters simple. This mirrors how the deepeval integration works with their model adapters.

All 36 tests pass locally. Ready for re-review!

@debu-sinha debu-sinha requested a review from smoorjani December 26, 2025 06:13
Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for iterating on this - mostly LGTM! just a few smaller follow-ups. Also for the tests, we should make sure they aren't passing right now as we want to rely on the actual phoenix library being called. Assuming they fail due to import errors, we can add in the dependency here: .github/workflows/master.yml around L296 similar to https://github.com/mlflow/mlflow/pull/18988/files#diff-38ee08b0d1916cb5b9d8a093e986366eaafd908bc1f516f4fced0033c155d842

metadata={
FRAMEWORK_METADATA_KEY: "phoenix",
"score": score,
"label": label,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need this metadata since it's already in the value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

score and value are different - value is the label (e.g. 'factual'), score is the numeric confidence (e.g. 0.9). Both are useful.

self._verbose = False


class DatabricksPhoenixModel:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

followup: let's add a comment explaining why we don't use the base class.

_METRIC_REGISTRY = {
"Hallucination": {
"evaluator_class": "HallucinationEvaluator",
"positive_label": "factual",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this in the config anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified - now just maps metric name to evaluator class name.

@@ -0,0 +1,87 @@
"""Utility functions for Phoenix integration."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's remove this one-liner file docstrings as they don't add much in terms of code readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed file-level docstring from utils.py.


def test_databricks_phoenix_model_call(mock_call_chat_completions):
with patch("mlflow.genai.scorers.phoenix.models.check_phoenix_installed"):
from mlflow.genai.scorers.phoenix.models import DatabricksPhoenixModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to put this import at the top-level? same with the ones below

get_evaluator_class("InvalidMetric")


def test_phoenix_scorer_exports():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if I missed this - did we address this?



def test_map_scorer_inputs_expected_response_priority():
from mlflow.genai.scorers.phoenix.utils import map_scorer_inputs_to_phoenix_record
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing with using top-level imports here


mock_trace = Mock()

with (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way we can avoid using mocks for this? It'd be great to construct a realistic trace and then test that the record is constructed properly as right now there's no real data being passed into the map_scorer_inputs_to_phoenix_record method. same for below.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 26, 2025

Documentation preview for a3f2902 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@debu-sinha
Copy link
Contributor Author

Thanks for the additional feedback @smoorjani! Addressed all comments in the latest commit:

Changes:

  • Removed duplicate label from metadata (already available in value)
  • Added comment explaining why we don't use Phoenix's BaseModel (duck typing for simplicity)
  • Simplified registry to only store evaluator class names (removed unused positive_label, negative_label, required_fields)
  • Removed get_metric_config function (no longer needed)
  • Removed one-liner module docstring from utils.py
  • Removed standalone export test (imports already tested implicitly in other tests)

All 32 tests pass locally.

@debu-sinha
Copy link
Contributor Author

Addressed remaining feedback on test mocking:

Q: "do we have to mock all of these? generally we should aim to test with minimal mocks"

Updated the tests to follow the deepeval pattern - now using pytest.importorskip("phoenix.evals") instead of sys.modules patching. Tests require Phoenix to be installed but only mock the LLM/evaluator calls. This provides better coverage of our integration code while avoiding real API calls.

Q: "why do we delete these modules?"

Removed the aggressive module clearing from test_registry.py. It was clearing all mlflow.genai.scorers.phoenix modules which wasn't necessary - proper imports handle this.

Additional refactor:

  • Moved _NoOpRateLimiter to utils.py per the earlier nit about consolidating Phoenix compatibility helpers.

Tests will be skipped in environments without Phoenix installed, similar to how deepeval tests work.

@debu-sinha debu-sinha force-pushed the feature/phoenix-scorers branch from af05590 to db5914d Compare December 26, 2025 18:16
@debu-sinha debu-sinha requested a review from smoorjani December 26, 2025 19:14
Code changes:
- Remove duplicate 'label' from metadata (already in 'value')
- Add comment explaining why not using Phoenix BaseModel
- Simplify registry to only store evaluator class names
- Remove unused get_metric_config function
- Move _NoOpRateLimiter to utils.py

Testing changes:
- Move imports to top-level in test files
- Use real trace objects instead of mocks in test_utils.py
- Add arize-phoenix-evals to CI workflow (master.yml)
- Tests use real Phoenix library with minimal mocking of LLM calls

This approach mirrors how MLflow tests deepeval - the library must be
installed to run tests, but we mock only the actual LLM calls.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
…tors

- test_models.py: Replace global pytest.importorskip with per-test @pytest.mark.skipif
  so tests that don't need Phoenix (e.g., test_databricks_phoenix_model_get_model_name) run without it
- test_phoenix.py: Use real Phoenix evaluator classes instead of mocking get_evaluator_class
  Only mock create_phoenix_model and evaluator.evaluate (the actual LLM call)
- test_phoenix.py: Add test_phoenix_scorer_evaluator_is_real_instance to verify real evaluators
- test_phoenix.py: Remove evaluator_kwargs test (Phoenix legacy evaluators don't accept kwargs)

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

Hey @smoorjani - pushed the testing changes you mentioned:

  • Swapped global importorskip for per-test skips so the databricks model tests run even without phoenix
  • Using real Phoenix evaluator classes now, only mocking the model creation and the actual LLM call
  • Added a test that checks we're actually getting real HallucinationEvaluator instances

Re-ran with real OpenAI calls, all 5 scorers still working. Let me know if there's anything else!

Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditionally approving on addressing the remaining comments - otherwise, LGTM! Thanks again for iterating on this!

expectations={"expected_response": "test context"},
)

with patch.object(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this double-nested with will fail the CI

assert model.get_model_name() == "databricks:/my-endpoint"


@pytest.mark.skipif(not HAS_PHOENIX, reason="requires phoenix.evals")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? can we just let it fail as we expect phoenix to be available

inputs: dict[str, str] | None = None,
outputs: dict[str, str] | None = None,
) -> Trace:
"""Create a realistic trace for testing."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's clean up this one-liner

Per review feedback: tests should fail if dependencies aren't available,
not skip silently.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha
Copy link
Contributor Author

debu-sinha commented Jan 13, 2026

Reverted to importorskip at module level.

Edit: Correction - importorskip skips tests, it doesn't fail them. Fixed in subsequent commit with direct imports per smoorjani's feedback.

@smoorjani
Copy link
Collaborator

Good call - reverted to importorskip at module level so tests fail if phoenix isn't there. Pushed.

@debu-sinha maybe I'm misunderstanding, but doesn't this skip if phoenix isn't there?

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Add evaluator_kwargs parameter documentation to get_scorer docstring
- Use walrus operator for span_id_to_context assignment per MLF0048
- Mock litellm.validate_environment in test_create_phoenix_model_openai
- Format test_phoenix.py per ruff formatting requirements

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Replace litellm.validate_environment mock with monkeypatch.setenv
to provide the required OPENAI_API_KEY. This approach aligns with
the minimal mocking principle - no mocking, just environment setup.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. v3.9.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants