Add Phoenix (Arize) third-party scorer integration#19473
Add Phoenix (Arize) third-party scorer integration#19473smoorjani merged 16 commits intomlflow:masterfrom
Conversation
Implements support for Phoenix evaluation framework as MLflow GenAI scorers, enabling seamless integration of Phoenix's LLM evaluation tools within MLflow's evaluation pipeline. Phoenix scorers: - Hallucination: Detects hallucinations in model outputs - Relevance: Evaluates context relevance to queries - Toxicity: Assesses content toxicity - QA: Evaluates QA correctness - Summarization: Assesses summarization quality Features: - Lazy loading to avoid import overhead when not used - Configurable model providers (OpenAI, Databricks, LiteLLM) - Trace input support for extracting context from retrieval spans - Consistent scorer interface returning Feedback objects - CategoricalRating.YES/NO values with scores in metadata - get_scorer() API for dynamic metric selection Install dependencies: pip install arize-phoenix-evals Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Hi @B-Step62, I've split the PR as requested:
Quick question: For the TruLens PR (#19237), should I:
Let me know your preference. Thanks! |
B-Step62
left a comment
There was a problem hiding this comment.
Overall looks good! Left a few minor comments
| # Convert to string if needed | ||
| prompt_str = str(prompt) if not isinstance(prompt, str) else prompt | ||
| try: | ||
| output = _invoke_databricks_serving_endpoint( |
There was a problem hiding this comment.
The default mode should not invoke llama endpoint, but rather call the dedicated judge endpoint through mlflow.genai.judges.adapters.databricks_managed_judge_adapter.call_chat_completions. Ref: https://github.com/mlflow/mlflow/blob/master/mlflow/genai/scorers/deepeval/models.py#L42-L69
There was a problem hiding this comment.
Done - now using call_chat_completions from databricks_managed_judge_adapter.
| if provider == "openai": | ||
| from phoenix.evals import OpenAIModel | ||
|
|
||
| return OpenAIModel(model=model_name) |
There was a problem hiding this comment.
| if provider == "openai": | |
| from phoenix.evals import OpenAIModel | |
| return OpenAIModel(model=model_name) | |
| match provider: | |
| case "openai": | |
| from phoenix.evals import OpenAIModel | |
| return OpenAIModel(model=model_name) |
There was a problem hiding this comment.
Kept LiteLLMModel as it handles all providers uniformly. OpenAIModel would require separate handling for each provider.
| _logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| @experimental(version="3.8.0") |
There was a problem hiding this comment.
| @experimental(version="3.8.0") | |
| @experimental(version="3.9.0") |
3.8.0 was just cut, let's target to ship this feature in 3.9.0🙂
There was a problem hiding this comment.
Done - updated to 3.9.0.
| result = self._evaluator.evaluate(record=record) | ||
| label, score, explanation = result |
There was a problem hiding this comment.
| result = self._evaluator.evaluate(record=record) | |
| label, score, explanation = result | |
| label, score, explanation = self._evaluator.evaluate(record=record) |
| if score is not None: | ||
| normalized_score = float(score) | ||
| # Clamp to 0-1 range if needed | ||
| if normalized_score < 0.0 or normalized_score > 1.0: |
There was a problem hiding this comment.
Good question - I've added clamping with a warning log. Phoenix metrics typically return 0-1, but defensive clamping ensures we don't pass unexpected values downstream. The warning helps surface any edge cases during debugging.
Thanks for the clarification in the previous PR. While I agree clamping is helpful for many cases, MLflow should not be opinionated about what the score range should be for third party scorers, and we would like to make the behavior transparent as much as possible. Let's keep the original score returned from Phoenix.
There was a problem hiding this comment.
Done - passing through original score without clamping.
| else: | ||
| normalized_score = 1.0 if label == self._positive_label else 0.0 | ||
|
|
||
| rationale = explanation or f"Label: {label}" |
There was a problem hiding this comment.
I think we can keep the rationale empty if Phoenix doesn't provide explanation. cc: @smoorjani
There was a problem hiding this comment.
Done - rationale is None when explanation not provided.
| rationale = explanation or f"Label: {label}" | ||
|
|
||
| # Use categorical rating based on label | ||
| value = CategoricalRating.YES if label == self._positive_label else CategoricalRating.NO |
There was a problem hiding this comment.
Q: Does Phoenix define a fixed negative label value? If so, we may want to check both pos/neg label and categorize other ones as CategoricalRating.UNKNOWN, like this;
# metric config
"Hallucination": {
"evaluator_class": "HallucinationEvaluator",
"label_map": {
"factual": CategoricalRating.YES,
"non-factual": CategoricalRating.NO,
},
"required_fields": ["input", "output", "reference"],
}
....
# parsing logic in __call))
value = self._label_map.get(label, CategoricalRating.UNKNOWN)
There was a problem hiding this comment.
Using raw Phoenix labels for now. Each evaluator has its own label semantics (factual/hallucinated, relevant/unrelated, etc.) so mapping to CategoricalRating would need careful consideration per metric.
| reference = expectations.get("context") or expectations.get("reference") | ||
| if not reference and "expected_output" in expectations: | ||
| reference = str(expectations["expected_output"]) |
There was a problem hiding this comment.
| reference = expectations.get("context") or expectations.get("reference") | |
| if not reference and "expected_output" in expectations: | |
| reference = str(expectations["expected_output"]) | |
| reference = ( | |
| expectations.get("context") | |
| or expectations.get("reference") | |
| or expectations.get("expected_output") | |
| ) |
There was a problem hiding this comment.
Done - using chained or with expected_response taking priority.
Changes: - Use call_chat_completions for Databricks managed judge (models.py) - Use match/case syntax for provider selection (models.py) - Change @experimental version to 3.9.0 (phoenix.py) - Replace _positive_label with _label_map for YES/NO/UNKNOWN handling (phoenix.py) - Remove score clamping - pass through Phoenix scores directly (phoenix.py) - Keep rationale empty if Phoenix returns no explanation (phoenix.py) - Add negative_label to registry for label mapping (registry.py) - Simplify reference extraction with chained or (utils.py) Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Thanks for the thorough review @B-Step62! I've addressed all your feedback in the latest commit: models.py:
phoenix.py:
registry.py:
utils.py:
Integration tests pass with these changes. Please let me know if anything needs adjustment! |
|
|
||
|
|
||
| def test_phoenix_check_installed_raises_without_phoenix(): | ||
| """Test that _check_phoenix_installed raises when phoenix is not installed.""" |
There was a problem hiding this comment.
can we remove these docstrings from the test suite?
There was a problem hiding this comment.
Done - removed test docstrings.
|
|
||
| @experimental(version="3.9.0") | ||
| @format_docstring(_MODEL_API_DOC) | ||
| class PhoenixScorer(Scorer): |
There was a problem hiding this comment.
What you think about adding this in init.py file instead of creating a new phoenix.py file?
cc: @B-Step62
There was a problem hiding this comment.
Done - moved everything to init.py.
- Move PhoenixScorer and metric classes from phoenix.py to __init__.py - Delete phoenix.py (implementation now in __init__.py) - Remove docstrings from test functions per MLflow conventions - Update test mock paths from phoenix.phoenix to phoenix This matches the module structure used in mlflow/genai/scorers/deepeval/ Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Thanks for the feedback @joelrobin18! I've addressed both comments in the latest commits:
The structure now matches |
Based on feedback from PRs mlflow#19473 and mlflow#19492, apply consistent patterns: - Update experimental version from 3.8.0 to 3.9.0 - Remove score clamping (pass through third-party scores directly) - Use match/case syntax for provider selection - Return None from _format_rationale when no reasoning available - Simplify module docstring to match deepeval pattern - Remove score clamping test and update empty rationale test Signed-off-by: debu-sinha <debusinha2009@gmail.com>
smoorjani
left a comment
There was a problem hiding this comment.
Really excited to see this integration - thanks for doing this work and replicating the existing scorer integration pattern! Left some comments to address
| _logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _check_phoenix_installed(): |
There was a problem hiding this comment.
nit: let's move this into utils
There was a problem hiding this comment.
Done - _NoOpRateLimiter is now in utils.py.
| provider, model_name = model_uri.split(":", 1) | ||
| model_name = model_name.removeprefix("/") | ||
|
|
||
| match provider: |
There was a problem hiding this comment.
why not just use the LiteLLM provider for all of these? do the provider-specific model classes provide some additional benefit?
There was a problem hiding this comment.
LiteLLM doesn't work for Databricks managed judge or custom serving endpoints - those need the dedicated MLflow adapters. LiteLLM is used for external providers (openai, anthropic, etc).
| "required_fields": ["input", "reference"], | ||
| }, | ||
| "Toxicity": { | ||
| "evaluator_class": "ToxicityEvaluator", |
There was a problem hiding this comment.
I notice these scorers are in the legacy folder - is that expected? are there newer variants we should be wrapping?
There was a problem hiding this comment.
Yes, Phoenix keeps evaluators in the legacy folder. There are newer template-based evaluators but they require more setup. The legacy evaluators are stable and widely used.
| self._verbose = False | ||
|
|
||
|
|
||
| class DatabricksPhoenixModel: |
There was a problem hiding this comment.
Is there some base class for this that Arize provides?
There was a problem hiding this comment.
followup: let's add a comment explaining why we don't use the base class.
There was a problem hiding this comment.
Phoenix has BaseModel in phoenix.evals.models.base but it requires implementing abstract methods that add complexity. Added comment explaining we use duck typing instead.
There was a problem hiding this comment.
Done - added comment at models.py:14-17 explaining duck typing approach.
| config = get_metric_config(metric_name) | ||
| positive_label = config["positive_label"] | ||
| negative_label = config["negative_label"] | ||
| self._label_map = { |
There was a problem hiding this comment.
I wonder if we should do this mapping or if this might be confusing - for instance, in Toxicity, toxic is mapped to no (red) and non-toxic is mapped to yes (green), but this gives mixed signals (e.g., no to toxicity implies non-toxic). For now, maybe worth to just use the labels as-is and we should figure out a good way to deal with the high-lighting of these values cc @daniellok-db or @danielseong1 if you have ideas
There was a problem hiding this comment.
Using raw Phoenix labels as discussed. We can revisit highlighting in a follow-up.
| mock_model = mock.MagicMock() | ||
|
|
||
| with ( | ||
| mock.patch( |
There was a problem hiding this comment.
do we have to mock all of these? generally we should aim to test with minimal mocks
There was a problem hiding this comment.
Done - now only mock create_phoenix_model and evaluator.evaluate. Real Phoenix evaluator classes are instantiated.
| assert result.metadata["score"] == 0.85 | ||
|
|
||
|
|
||
| def test_phoenix_toxicity_scorer_with_mock(): |
There was a problem hiding this comment.
maybe we can parameterize these tests if we're just testing the different scorers in the same way
There was a problem hiding this comment.
Done - tests are now parameterized.
|
|
||
| mock_modules = {"phoenix": mock.MagicMock(), "phoenix.evals": mock.MagicMock()} | ||
| with pytest.raises(MlflowException, match="Unknown Phoenix metric"): | ||
| with mock.patch.dict("sys.modules", mock_modules): |
There was a problem hiding this comment.
why do we need to mock the sys modules? Can we just use the phoenix library in our tests
There was a problem hiding this comment.
Removed - using real phoenix library directly.
| get_evaluator_class("InvalidMetric") | ||
|
|
||
|
|
||
| def test_phoenix_scorer_exports(): |
There was a problem hiding this comment.
nit: I don't think we need this TBH - the import is already tested and these won't return None if they are imported
There was a problem hiding this comment.
sorry if I missed this - did we address this?
| assert get_scorer is not None | ||
|
|
||
|
|
||
| def test_phoenix_assessment_source(): |
There was a problem hiding this comment.
can we add this assertion as part of the other tests and remove this?
There was a problem hiding this comment.
Kept as a dedicated test - test_phoenix_scorer_evaluator_is_real_instance verifies real Phoenix classes are used.
- Move check_phoenix_installed to utils.py for proper modularization - Simplify models.py to use LiteLLM for all non-Databricks providers - Use Phoenix evaluator labels directly as Feedback values - Add **evaluator_kwargs to allow passing extra params to Phoenix evaluators - Update expectation key priority: expected_response > context > reference - Update tests to match new label-based value assertions - Export get_metric_config from registry for proper test patching Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Live Test Results with OpenAII've tested the updated Phoenix scorers with real OpenAI API calls. All scorers work correctly: from mlflow.genai.scorers.phoenix import Hallucination, Toxicity, QA, Summarization
# Hallucination Detection
scorer = Hallucination(model='openai:/gpt-4o-mini')
result = scorer(
inputs='What is the capital of France?',
outputs='Paris is the capital of France.',
expectations={'expected_response': 'France is a country in Europe. Its capital city is Paris.'},
)
# Value: factual, Score: 0
# Toxicity Detection
scorer = Toxicity(model='openai:/gpt-4o-mini')
result = scorer(inputs='Thank you for your question! I would be happy to help.')
# Value: non-toxic, Score: 0
# QA Evaluation
scorer = QA(model='openai:/gpt-4o-mini')
result = scorer(
inputs='What is 2 + 2?',
outputs='The answer is 4.',
expectations={'expected_response': '2 + 2 = 4'},
)
# Value: correct, Score: 1
# Summarization
scorer = Summarization(model='openai:/gpt-4o-mini')
result = scorer(
inputs='Machine learning is a subset of AI that enables systems to learn from data...',
outputs='ML is AI that learns from data to make predictions.',
)
# Value: good, Score: 1Changes addressed in latest commit:
|
- Simplify utils.py: inline one-liner helpers into main function, remove unused metric_name parameter - Add separate test files following deepeval pattern: - test_models.py: tests for Databricks model adapters - test_registry.py: tests for evaluator class resolution - test_utils.py: tests for input mapping utilities - Refactor test_phoenix.py: - Parameterize scorer tests to reduce duplication - Reduce mocking with cleaner fixture approach - Remove redundant test cases All 36 tests pass with full coverage of utils, registry, and models. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Thanks for the thorough review @smoorjani! I've addressed all the feedback in the latest commit. Changes MadeCode Simplification
Test Restructuring (following deepeval pattern)
ClarificationsQ: Why using Q: Is there a base class from Arize? (models.py:21) All 36 tests pass locally. Ready for re-review! |
There was a problem hiding this comment.
thanks for iterating on this - mostly LGTM! just a few smaller follow-ups. Also for the tests, we should make sure they aren't passing right now as we want to rely on the actual phoenix library being called. Assuming they fail due to import errors, we can add in the dependency here: .github/workflows/master.yml around L296 similar to https://github.com/mlflow/mlflow/pull/18988/files#diff-38ee08b0d1916cb5b9d8a093e986366eaafd908bc1f516f4fced0033c155d842
| metadata={ | ||
| FRAMEWORK_METADATA_KEY: "phoenix", | ||
| "score": score, | ||
| "label": label, |
There was a problem hiding this comment.
nit: do we need this metadata since it's already in the value?
There was a problem hiding this comment.
score and value are different - value is the label (e.g. 'factual'), score is the numeric confidence (e.g. 0.9). Both are useful.
| self._verbose = False | ||
|
|
||
|
|
||
| class DatabricksPhoenixModel: |
There was a problem hiding this comment.
followup: let's add a comment explaining why we don't use the base class.
| _METRIC_REGISTRY = { | ||
| "Hallucination": { | ||
| "evaluator_class": "HallucinationEvaluator", | ||
| "positive_label": "factual", |
There was a problem hiding this comment.
do we need this in the config anymore?
There was a problem hiding this comment.
Simplified - now just maps metric name to evaluator class name.
| @@ -0,0 +1,87 @@ | |||
| """Utility functions for Phoenix integration.""" | |||
There was a problem hiding this comment.
nit: let's remove this one-liner file docstrings as they don't add much in terms of code readability
There was a problem hiding this comment.
Done - removed file-level docstring from utils.py.
|
|
||
| def test_databricks_phoenix_model_call(mock_call_chat_completions): | ||
| with patch("mlflow.genai.scorers.phoenix.models.check_phoenix_installed"): | ||
| from mlflow.genai.scorers.phoenix.models import DatabricksPhoenixModel |
There was a problem hiding this comment.
is it possible to put this import at the top-level? same with the ones below
| get_evaluator_class("InvalidMetric") | ||
|
|
||
|
|
||
| def test_phoenix_scorer_exports(): |
There was a problem hiding this comment.
sorry if I missed this - did we address this?
|
|
||
|
|
||
| def test_map_scorer_inputs_expected_response_priority(): | ||
| from mlflow.genai.scorers.phoenix.utils import map_scorer_inputs_to_phoenix_record |
There was a problem hiding this comment.
same thing with using top-level imports here
|
|
||
| mock_trace = Mock() | ||
|
|
||
| with ( |
There was a problem hiding this comment.
is there a way we can avoid using mocks for this? It'd be great to construct a realistic trace and then test that the record is constructed properly as right now there's no real data being passed into the map_scorer_inputs_to_phoenix_record method. same for below.
|
Documentation preview for a3f2902 is available at: More info
|
|
Thanks for the additional feedback @smoorjani! Addressed all comments in the latest commit: Changes:
All 32 tests pass locally. |
|
Addressed remaining feedback on test mocking: Q: "do we have to mock all of these? generally we should aim to test with minimal mocks" Updated the tests to follow the deepeval pattern - now using Q: "why do we delete these modules?" Removed the aggressive module clearing from Additional refactor:
Tests will be skipped in environments without Phoenix installed, similar to how deepeval tests work. |
af05590 to
db5914d
Compare
Code changes: - Remove duplicate 'label' from metadata (already in 'value') - Add comment explaining why not using Phoenix BaseModel - Simplify registry to only store evaluator class names - Remove unused get_metric_config function - Move _NoOpRateLimiter to utils.py Testing changes: - Move imports to top-level in test files - Use real trace objects instead of mocks in test_utils.py - Add arize-phoenix-evals to CI workflow (master.yml) - Tests use real Phoenix library with minimal mocking of LLM calls This approach mirrors how MLflow tests deepeval - the library must be installed to run tests, but we mock only the actual LLM calls. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
…tors - test_models.py: Replace global pytest.importorskip with per-test @pytest.mark.skipif so tests that don't need Phoenix (e.g., test_databricks_phoenix_model_get_model_name) run without it - test_phoenix.py: Use real Phoenix evaluator classes instead of mocking get_evaluator_class Only mock create_phoenix_model and evaluator.evaluate (the actual LLM call) - test_phoenix.py: Add test_phoenix_scorer_evaluator_is_real_instance to verify real evaluators - test_phoenix.py: Remove evaluator_kwargs test (Phoenix legacy evaluators don't accept kwargs) Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Hey @smoorjani - pushed the testing changes you mentioned:
Re-ran with real OpenAI calls, all 5 scorers still working. Let me know if there's anything else! |
| expectations={"expected_response": "test context"}, | ||
| ) | ||
|
|
||
| with patch.object( |
There was a problem hiding this comment.
I think this double-nested with will fail the CI
| assert model.get_model_name() == "databricks:/my-endpoint" | ||
|
|
||
|
|
||
| @pytest.mark.skipif(not HAS_PHOENIX, reason="requires phoenix.evals") |
There was a problem hiding this comment.
do we need this? can we just let it fail as we expect phoenix to be available
| inputs: dict[str, str] | None = None, | ||
| outputs: dict[str, str] | None = None, | ||
| ) -> Trace: | ||
| """Create a realistic trace for testing.""" |
There was a problem hiding this comment.
nit: let's clean up this one-liner
Per review feedback: tests should fail if dependencies aren't available, not skip silently. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
|
Reverted to Edit: Correction - |
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha maybe I'm misunderstanding, but doesn't this skip if phoenix isn't there? |
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Add evaluator_kwargs parameter documentation to get_scorer docstring - Use walrus operator for span_id_to_context assignment per MLF0048 - Mock litellm.validate_environment in test_create_phoenix_model_openai - Format test_phoenix.py per ruff formatting requirements Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Replace litellm.validate_environment mock with monkeypatch.setenv to provide the required OPENAI_API_KEY. This approach aligns with the minimal mocking principle - no mocking, just environment setup. Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Related Issues/PRs
Partial fix for #19062 (Phoenix portion - TruLens will follow in separate PR)
Split from #19237 per reviewer feedback to separate Phoenix and TruLens integrations.
What changes are proposed in this pull request?
This PR adds integration for Phoenix (Arize) evaluation framework as MLflow GenAI scorers.
Phoenix scorers (
mlflow.genai.scorers.phoenix):Hallucination: Detects hallucinations in model outputsRelevance: Evaluates context relevance to queriesToxicity: Assesses content toxicityQA: Evaluates QA correctnessSummarization: Assesses summarization qualityImplementation details:
CategoricalRating.YES/NOvalues with scores in metadataget_scorer()API for dynamic metric selectionUsage Examples with Real Output
Direct Scorer Call
Output:
With mlflow.genai.evaluate()
Output:
All 5 Phoenix Scorers
Output:
get_scorer() API
How is this PR tested?
tests/genai/scorers/phoenix/test_phoenix.pyTest coverage includes:
get_scorer()API for all metricsmlflow.genai.evaluate()batch evaluationCategoricalRating.YES/NOvalue validationAssessmentSource.LLM_JUDGEtype validationDoes this PR require documentation update?
Release Notes
Is this a user-facing change?
Added Phoenix (Arize) third-party scorer integration for MLflow GenAI evaluation. 5 new scorers available:
Phoenix scorers: Hallucination, Relevance, Toxicity, QA, Summarization
Install dependencies:
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/evaluation: MLflow model evaluation featuresHow should the PR be classified in the release notes?
rn/feature- A new user-facing feature worth mentioning in the release notesShould this PR be included in the next patch release?