Skip to content

Add Phoenix and TruLens third-party scorer integrations#19237

Closed
debu-sinha wants to merge 10 commits intomlflow:masterfrom
debu-sinha:feature/phoenix-trulens-judge-integrations
Closed

Add Phoenix and TruLens third-party scorer integrations#19237
debu-sinha wants to merge 10 commits intomlflow:masterfrom
debu-sinha:feature/phoenix-trulens-judge-integrations

Conversation

@debu-sinha
Copy link
Contributor

@debu-sinha debu-sinha commented Dec 5, 2025

Related Issues/PRs

Partial fix for #19062 (TruLens portion)

Note: This PR has been split per reviewer feedback. Phoenix scorers are now in #19473.

What changes are proposed in this pull request?

This PR adds integration for TruLens evaluation framework as MLflow GenAI scorers.

TruLens scorers (mlflow.genai.scorers.trulens):

  • Groundedness: Evaluates groundedness in context
  • ContextRelevance: Assesses context relevance to query
  • AnswerRelevance: Evaluates answer relevance to query
  • Coherence: Evaluates coherence and logical flow

Implementation details:

  • Lazy loading to avoid import overhead when scorers not used
  • Returns CategoricalRating.YES/NO values with scores in metadata
  • get_scorer() API for dynamic metric selection
  • Configurable model providers (OpenAI, Databricks, LiteLLM)
  • Trace input support for extracting context from retrieval spans
  • Clear error messages when optional dependencies are missing

Usage Examples with Real Output

Direct Scorer Call

from mlflow.genai.scorers.trulens import Groundedness

scorer = Groundedness(model="openai:/gpt-4o-mini")
feedback = scorer(
    outputs="The Eiffel Tower is 330 meters tall.",
    expectations={"context": "The Eiffel Tower is 330 metres tall and located in Paris."},
)
print(f"Value: {feedback.value}")
print(f"Metadata: {feedback.metadata}")

Output:

Value: yes
Metadata: {'mlflow.scorer.framework': 'trulens', 'score': 1.0, 'threshold': 0.5}

With mlflow.genai.evaluate()

import mlflow
from mlflow.genai.scorers.trulens import Groundedness, AnswerRelevance

eval_data = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": "Paris is the capital of France.",
        "expectations": {"context": "France is in Western Europe. Its capital is Paris."},
    },
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": "London is the capital of France.",  # Wrong - should fail
        "expectations": {"context": "France is in Western Europe. Its capital is Paris."},
    },
]

scorers = [
    Groundedness(model="openai:/gpt-4o-mini"),
    AnswerRelevance(model="openai:/gpt-4o-mini"),
]

results = mlflow.genai.evaluate(data=eval_data, scorers=scorers)
print(results.metrics)

Output:

{'Groundedness/mean': 0.5, 'AnswerRelevance/mean': 0.5}

All 4 TruLens Scorers

from mlflow.genai.scorers.trulens import Groundedness, ContextRelevance, AnswerRelevance, Coherence

all_scorers = [
    Groundedness(model="openai:/gpt-4o-mini"),
    ContextRelevance(model="openai:/gpt-4o-mini"),
    AnswerRelevance(model="openai:/gpt-4o-mini"),
    Coherence(model="openai:/gpt-4o-mini"),
]
results = mlflow.genai.evaluate(data=eval_data, scorers=all_scorers)

Output:

AnswerRelevance/mean: 0.5000
Coherence/mean: 0.5000
ContextRelevance/mean: 1.0000
Groundedness/mean: 0.5000

get_scorer() API

from mlflow.genai.scorers.trulens import get_scorer

groundedness = get_scorer("Groundedness", model="openai:/gpt-4o-mini")
context_relevance = get_scorer("ContextRelevance", model="databricks")
answer_relevance = get_scorer("AnswerRelevance", model="databricks:/my-endpoint")
coherence = get_scorer("Coherence", model="openai:/gpt-4o-mini")

How is this PR tested?

  • New unit tests in tests/genai/scorers/trulens/test_trulens.py
  • Real API integration tests with OpenAI (all tests passed)

Test coverage includes:

  • All 4 scorers with positive and negative test cases
  • Edge cases (empty strings, long text, special characters)
  • get_scorer() API for all metrics
  • mlflow.genai.evaluate() batch evaluation
  • CategoricalRating.YES/NO value validation
  • AssessmentSource.LLM_JUDGE type validation
  • Error handling for missing required fields

Does this PR require documentation update?

  • Yes. I've updated:
    • API references with comprehensive docstrings

Release Notes

Is this a user-facing change?

  • Yes

Added TruLens third-party scorer integration for MLflow GenAI evaluation. 4 new scorers available:

TruLens scorers: Groundedness, ContextRelevance, AnswerRelevance, Coherence

Install dependencies:

pip install trulens trulens-providers-openai

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/evaluation: MLflow model evaluation features

How should the PR be classified in the release notes?

  • rn/feature - A new user-facing feature worth mentioning in the release notes

Should this PR be included in the next patch release?

  • Yes
  • No (this PR will be included in the next minor release)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

@debu-sinha Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

@debu-sinha
Copy link
Contributor Author

@AveshCSingh Would you be able to take a look at this PR when you have a chance? It adds third-party scorer integrations for Phoenix (Arize) and TruLens evaluation frameworks, enabling seamless use of established LLM evaluation tools within MLflow's GenAI evaluation pipeline. Happy to address any feedback.

@debu-sinha debu-sinha force-pushed the feature/phoenix-trulens-judge-integrations branch 3 times, most recently from 09faf86 to 9a740c6 Compare December 6, 2025 15:02
}

# Third-party scorer integrations (Phoenix, TruLens)
_THIRDPARTY_IMPORTS = {
Copy link
Collaborator

@joelrobin18 joelrobin18 Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of third party, it would be better to split them into two different folders
One for Phoenix and other for Trulens

That would be better instead of using third party.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @joelrobin18! I've restructured the code as you recommended - Phoenix and TruLens now have their own separate folders:

  • mlflow/genai/scorers/phoenix/
  • mlflow/genai/scorers/trulens/

Each integration has its own __init__.py with proper exports. Updated the lazy loading in the main scorers/__init__.py accordingly.

Let me know if there's anything else you'd like me to adjust!

@debu-sinha debu-sinha force-pushed the feature/phoenix-trulens-judge-integrations branch from 9a740c6 to 2e0cb24 Compare December 6, 2025 15:13
@debu-sinha
Copy link
Contributor Author

Hi @joelrobin18,

Thanks for the feedback on the folder structure. I've restructured the code as suggested:

  • Created separate mlflow/genai/scorers/phoenix/ directory
  • Created separate mlflow/genai/scorers/trulens/ directory
  • Each integration has its own __init__.py with proper exports
  • Updated the lazy loading in the main scorers/__init__.py accordingly

Would appreciate your re-review when you get a chance. Also cc @AveshCSingh for visibility.

debu-sinha added a commit to debu-sinha/mlflow that referenced this pull request Dec 11, 2025
This PR now focuses exclusively on TruLens agent trace scorers for
goal-plan-action alignment evaluation. Basic TruLens scorers (Groundedness,
ContextRelevance, AnswerRelevance, Coherence) are already provided in PR mlflow#19237
(Phoenix/TruLens third-party scorer integrations).

Changes:
- Remove mlflow/genai/scorers/trulens/basic.py (moved to mlflow#19237)
- Update trulens/__init__.py with comprehensive examples for all 6 agent trace scorers
- Update scorers/__init__.py to only export agent trace scorers
- Update tests to only test agent trace scorers (19 tests remain)

Agent trace scorers provided:
- TruLensLogicalConsistencyScorer
- TruLensExecutionEfficiencyScorer
- TruLensPlanAdherenceScorer
- TruLensPlanQualityScorer
- TruLensToolSelectionScorer
- TruLensToolCallingScorer

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
debu-sinha added a commit to debu-sinha/mlflow that referenced this pull request Dec 11, 2025
This PR now focuses exclusively on TruLens agent trace scorers for
goal-plan-action alignment evaluation. Basic TruLens scorers (Groundedness,
ContextRelevance, AnswerRelevance, Coherence) are already provided in PR mlflow#19237
(Phoenix/TruLens third-party scorer integrations).

Changes:
- Remove mlflow/genai/scorers/trulens/basic.py (moved to mlflow#19237)
- Update trulens/__init__.py with comprehensive examples for all 6 agent trace scorers
- Update scorers/__init__.py to only export agent trace scorers
- Update tests to only test agent trace scorers (19 tests remain)

Agent trace scorers provided:
- TruLensLogicalConsistencyScorer
- TruLensExecutionEfficiencyScorer
- TruLensPlanAdherenceScorer
- TruLensPlanQualityScorer
- TruLensToolSelectionScorer
- TruLensToolCallingScorer

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha debu-sinha force-pushed the feature/phoenix-trulens-judge-integrations branch from 255af44 to a0fff3b Compare December 11, 2025 09:29
@debu-sinha
Copy link
Contributor Author

Note: I also contributed a fix to the TruLens project (truera/trulens#2308) to add additionalProperties: false to Pydantic schemas for Databricks structured output compatibility. This was merged today and ensures TruLens works correctly with Databricks endpoints - relevant context for this integration.

@debu-sinha
Copy link
Contributor Author

Update: Based on feedback from TruLens maintainers on the integration issue I opened (truera/trulens#2302), they requested adding their goal-plan-action alignment evaluators for agent trace analysis. I've created a follow-up PR for this: #19328

The TruLens team has been collaborative throughout - they merged my companion PR (truera/trulens#2308) to fix Databricks compatibility, and @sfc-gh-nvytla approved the integration on #19328.

@@ -0,0 +1,51 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring in the module will be shown in API doc, but not visible from the doc website. I think we can keep this simple like #19345 and file a follow-up PR to have a proper documentation page. @smoorjani is adding doc for DeepEval scorers in #19409, so let's address this once the PR is merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current docs PR is here: #19409
I'll be adding another one for RAGAS, but yes we can do this in a follow-up.

Comment on lines +85 to +88
from phoenix.evals import OpenAIModel

return OpenAIModel(model=self.model_name or "gpt-4o-mini")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phoenix support several different models not only OpenAI: https://arize-phoenix.readthedocs.io/en/arize-phoenix-v4.10.1/api/evals.models.html

Can we support them? LiteLLM model will be fallback for models that are not natively supported by them e.g., Gemini.

"""Get the Phoenix OpenAI model instance."""
_check_phoenix_installed()

from phoenix.evals import OpenAIModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support Databricks judge and model serving endpoint. Could you read #19345 and apply the same approach?

)
normalized_score = min(1.0, max(0.0, normalized_score))
else:
normalized_score = 1.0 if label == positive_label else 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use CategoricalRating.YES and CategoricalRating.NO?

if normalized_score < 0.0 or normalized_score > 1.0:
import logging

logging.getLogger(__name__).warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define logger at top level so we can reuse?

import logging

logger = logging.getLogger(__name__)


class _PhoenixScorerBase(Scorer):
   ...


logging.getLogger(__name__).warning(
f"Phoenix returned score {normalized_score} outside expected 0-1 range. "
"This may indicate a version incompatibility. Clamping to valid range."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to crop the score to 0-1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I've added clamping with a warning log. Phoenix metrics typically return 0-1, but defensive clamping ensures we don't pass unexpected values downstream. The warning helps surface any edge cases during debugging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any example on when the score wasn't between 0-1?

evaluator = HallucinationEvaluator(model=model)

# Build record dict as expected by Phoenix
query = inputs.get("query", str(inputs)) if inputs else ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to handle more input patterns (e.g. trace, messages, etc). Can you update the logic similarly to the map_scorer_inputs_to_ragas_sample in #18988?

Comment on lines +167 to +168
context: str | None = None,
**kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scorer should only have inputs, outputs, expectations, and trace as arguments. Other things like context should be derived from one of these.

# Already aligned with MLflow convention (higher = better)
score, rationale = self._parse_result(result, positive_label="factual")

return Feedback(name=self.name, value=score, rationale=rationale)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set assessment source as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also support get_judge() API?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small update - we called it get_scorer() now

Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some nits in addition to Yuki's comments - I think the general ask here is to follow the existing pattern from DeepEval/RAGAS where we have a generalized implementation for any scorer (get_scorer) and the classes themselves (e.g., FaithfulnessScorer) are super lightweight wrappers.

get_all_scorers,
)
from mlflow.genai.scorers.phoenix import (
PhoenixHallucinationScorer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we follow the existing pattern of mlflow.genai.scorers.phoenix import HallucinationScorer so as to not repeat Phoenix. We can follow-up later if this is verbose and renamespace as mlflow.geani.scorers import PhoenixHallucinationScorer.

@@ -0,0 +1,51 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current docs PR is here: #19409
I'll be adding another one for RAGAS, but yes we can do this in a follow-up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small update - we called it get_scorer() now

try:
import phoenix.evals # noqa: F401

return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this need to return anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was for _check_phoenix_installed() - it either raises an exception or returns implicitly. In the refactored code, the check happens in models.py during model creation, so it fails fast with a clear error message before any evaluation runs.



def _check_phoenix_installed():
"""Check if phoenix.evals is installed and raise a helpful error if not."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's avoid one-line docstring, I think the linter will complain about this anyhow


return OpenAIModel(model=self.model_name or "gpt-4o-mini")

def _parse_result(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be _construct_mlflow_assessment and this should return an assessment?

Copy link
Contributor Author

@debu-sinha debu-sinha Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - I've refactored this entirely. The method now directly constructs and returns a object with proper . No separate helper method needed since each scorer's evaluation logic is self-contained.

@debu-sinha
Copy link
Contributor Author

Thanks for the detailed feedback @B-Step62 @smoorjani! I've addressed all the comments in the latest push:

Structural changes:

  • Added get_scorer() API following the RAGAS pattern
  • Created lightweight wrapper classes: Hallucination, Relevance, Toxicity, QA, Summarization (without prefix)
  • Separated concerns into models.py, registry.py, utils.py for each integration

Model support:

  • Databricks managed judge: model="databricks"
  • Databricks serving endpoints: model="databricks:/<endpoint>"
  • Native providers: OpenAI, Azure, Bedrock, Anthropic, Gemini, Mistral
  • LiteLLM fallback for unsupported providers

Interface standardization:

  • Scorer args: inputs, outputs, expectations, trace only
  • Context derived from expectations or trace retrieval spans
  • CategoricalRating.YES/NO for binary feedback
  • Proper AssessmentSource attribution
  • Score clamping to 0-1 with warning logs

Ready for another look!

@debu-sinha debu-sinha force-pushed the feature/phoenix-trulens-judge-integrations branch from a33545e to 4938e31 Compare December 17, 2025 18:35
@debu-sinha
Copy link
Contributor Author

Renewed tests on Databricks:
Screenshot 2025-12-17 at 2 27 06 PM
Screenshot 2025-12-17 at 2 27 14 PM
Screenshot 2025-12-17 at 2 27 22 PM
Screenshot 2025-12-17 at 2 27 28 PM
Screenshot 2025-12-17 at 2 27 33 PM
Screenshot 2025-12-17 at 2 27 39 PM

Implements support for Phoenix (Arize) and TruLens evaluation frameworks
as MLflow GenAI scorers, enabling seamless integration of established
LLM evaluation tools within MLflow's evaluation pipeline.

Phoenix scorers (mlflow/genai/scorers/phoenix/):
- PhoenixHallucinationScorer: Detects hallucinations in model outputs
- PhoenixRelevanceScorer: Evaluates response relevance to queries
- PhoenixToxicityScorer: Assesses content toxicity
- PhoenixQAScorer: Evaluates QA correctness
- PhoenixSummarizationScorer: Assesses summarization quality

TruLens scorers (mlflow/genai/scorers/trulens/):
- TruLensGroundednessScorer: Evaluates groundedness in context
- TruLensContextRelevanceScorer: Assesses context relevance
- TruLensAnswerRelevanceScorer: Evaluates answer relevance
- TruLensCoherenceScorer: Evaluates logical flow of outputs

Features:
- Lazy loading to avoid import overhead when not used
- Configurable model providers (OpenAI, LiteLLM for TruLens)
- Consistent scorer interface returning Feedback objects
- Helpful error messages when optional dependencies missing

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Key changes:
1. Phoenix scorers: Remove incorrect score inversion for Hallucination and
   Toxicity evaluators. Phoenix natively returns scores aligned with MLflow
   convention (1.0 = good, 0.0 = bad).

2. Both Phoenix and TruLens scorers: Replace silent score clamping with
   validation that logs warnings when scores are outside expected 0-1 range.
   This helps detect potential version incompatibilities.

3. Update tests to reflect correct Phoenix score semantics:
   - factual = 1.0 (not inverted from 0)
   - non-toxic = 1.0 (not inverted from 0)

4. Update docstrings and comments to accurately describe score semantics.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Align with existing DeepEval/RAGAS patterns for consistency
- Add get_scorer() API for both Phoenix and TruLens integrations
- Create lightweight wrapper classes (Hallucination, Relevance, etc.)
- Add multi-model support: Databricks managed judge, serving endpoints, LiteLLM
- Use standard scorer arguments: inputs, outputs, expectations, trace
- Add AssessmentSource for proper attribution
- Use CategoricalRating.YES/NO for binary feedback values
- Add models.py, registry.py, utils.py for each integration
- Update tests to match new API patterns

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
- Use function-level patching instead of module-level mocking
- Import classes before applying mocks to ensure module is loaded
- Fix line length issues in test files

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Phoenix ToxicityEvaluator expects text in the 'input' field, but MLflow
scorers typically pass text via 'outputs' parameter. When Toxicity is
called with only outputs (no inputs), the text should be mapped to
Phoenix's 'input' field.

This fix ensures Toxicity()(outputs='text') works correctly by mapping
the output to Phoenix's expected 'input' field when no input is provided.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
The call_chat_completions function from databricks.rag_eval requires
the @context.eval_context decorator which sets up internal state. When
called directly from Phoenix/TruLens model adapters without this context,
it fails with 'cannot access local variable' errors.

Switch to using _invoke_databricks_serving_endpoint with the
databricks-meta-llama-3-3-70b-instruct foundation model endpoint,
which works in both notebook and external environments.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Phoenix evaluators use a set_verbosity context manager that expects
model adapters to have both _verbose and _rate_limiter._verbose attributes.
Without these, the context manager fails with UnboundLocalError.

Added:
- _NoOpRateLimiter stub class with _verbose attribute
- _verbose = False to both Databricks model adapter classes
- _rate_limiter = _NoOpRateLimiter() to both classes

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Phoenix:
- Add _verbose attribute required by set_verbosity context manager
- Add _rate_limiter with _NoOpRateLimiter stub for set_verbosity

TruLens:
- Create dynamic provider class that inherits from LLMProvider
- Implement _create_chat_completion method required by TruLens
- TruLens feedback methods (groundedness, relevance, etc.) are inherited
  from LLMProvider base class

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Phoenix fix:
- Handle MultimodalPrompt objects by converting to string before passing
  to Databricks endpoint (Phoenix evaluators pass MultimodalPrompt, not str)

TruLens fix:
- Properly initialize Pydantic base class with super().__init__(model_engine=...)
- Use class-level attributes for endpoint config to avoid Pydantic field issues

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Add Endpoint object to TruLens Databricks provider initialization.
TruLens LLMProvider requires an endpoint to be set for feedback methods
like groundedness_measure_with_cot_reasons to work correctly.

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha debu-sinha force-pushed the feature/phoenix-trulens-judge-integrations branch from 4b1eb0a to a1058df Compare December 17, 2025 21:47
@debu-sinha
Copy link
Contributor Author

Hi @B-Step62,

Thanks for the feedback! I've split this PR as requested:

The Phoenix PR includes:

  • All 5 Phoenix scorers (Hallucination, Relevance, Toxicity, QA, Summarization)
  • Simplified module docstrings per your feedback
  • get_scorer() API following RAGAS pattern
  • Trace input support using trace_utils
  • Databricks model support (managed judge and serving endpoints)
  • CategoricalRating.YES/NO values
  • AssessmentSource with LLM_JUDGE type

Please review #19473 when you have a chance. I'll close this PR and create a separate TruLens PR once Phoenix is merged.

@B-Step62 B-Step62 removed the v3.8.0 label Dec 18, 2025
@B-Step62 B-Step62 closed this Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants