Skip to content

Fix scorers issue in metaprompting#20173

Merged
chenmoneygithub merged 6 commits intomlflow:masterfrom
chenmoneygithub:metaprompting-fix-2
Jan 22, 2026
Merged

Fix scorers issue in metaprompting#20173
chenmoneygithub merged 6 commits intomlflow:masterfrom
chenmoneygithub:metaprompting-fix-2

Conversation

@chenmoneygithub
Copy link
Contributor

@chenmoneygithub chenmoneygithub commented Jan 21, 2026

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

We need to raise an explicit exception when only one of train_data and scorers is set in mlflow.genai.optimize_prompts(). Additionally, we allow scorers=None for a better developer experience.

When metaprompting optimizer receives dataset while not getting scorer, the metaprompting will still work, and an example looks like below:


You are an expert prompt engineer. Your task is to improve the following prompts to achieve better performance.
CURRENT PROMPTS: Prompt name: medical_section_classifier Template: Classify this medical research paper sentence into one of these sections: CONCLUSIONS, RESULTS, METHODS, OBJECTIVE, BACKGROUND.
Sentence: {{sentence}}
EVALUATION EXAMPLES: Below are examples showing how the current prompts performed. Study these to identify patterns in what worked and what failed.
Example 1: Input: {"sentence": "The emergence of HIV as a chronic condition means that people living with HIV are required to take more responsibility for the self-management of their condition , including making physical , emotional and social adjustments ."} Output: BACKGROUND Expected: {'expected_response': 'BACKGROUND'}
Example 2: Input: {"sentence": "This paper describes the design and evaluation of Positive Outlook , an online program aiming to enhance the self-management skills of gay men living with HIV ."} Output: METHODS Expected: {'expected_response': 'BACKGROUND'}
Example 3: Input: {"sentence": "This study is designed as a randomised controlled trial in which men living with HIV in Australia will be assigned to either an intervention group or usual care control group ."} Output: METHODS Expected: {'expected_response': 'METHODS'}
Example 4: Input: {"sentence": "The intervention group will participate in the online group program ` Positive Outlook ' ."} Output: METHODS Expected: {'expected_response': 'METHODS'}
Example 5: Input: {"sentence": "The program is based on self-efficacy theory and uses a self-management approach to enhance skills , confidence and abilities to manage the psychosocial issues associated with HIV in daily life ."} Output: BACKGROUND Expected: {'expected_response': 'METHODS'}
Example 6: Input: {"sentence": "Participants will access the program for a minimum of 90 minutes per week over seven weeks ."} Output: METHODS Expected: {'expected_response': 'METHODS'}
Example 7: Input: {"sentence": "Primary outcomes are domain specific self-efficacy , HIV related quality of life , and outcomes of health education ."} Output: METHODS Expected: {'expected_response': 'METHODS'}
Example 8: Input: {"sentence": "Secondary outcomes include : depression , anxiety and stress ; general health and quality of life ; adjustment to HIV ; and social support ."} Output: METHODS
Reason: It describes the outcomes being measured in the study (secondary outcomes), which is part of the study design/methods rather than results, objectives, background, or conclusions. Expected: {'expected_response': 'METHODS'}
Example 9: Input: {"sentence": "Data collection will take place at baseline , completion of the intervention ( or eight weeks post randomisation ) and at 12 week follow-up ."} Output: METHODS Expected: {'expected_response': 'METHODS'}

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Copilot AI review requested due to automatic review settings January 21, 2026 03:56
@github-actions
Copy link
Contributor

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20173/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20173/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20173/merge

@github-actions
Copy link
Contributor

@chenmoneygithub Thank you for the contribution! Could you fix the following issue(s)?

⚠ DCO check

The DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

@github-actions github-actions bot added area/prompts MLflow Prompt Registry and Optimization rn/bug-fix Mention under Bug Fixes in Changelogs. labels Jan 21, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the prompt optimization API to support zero-shot mode by making the scorers parameter optional and adding validation to ensure train_data and scorers are set together (both provided or both None/empty).

Changes:

  • Made scorers parameter optional (defaults to None) in optimize_prompts()
  • Added validation to ensure train_data and scorers are mutually required
  • Updated validate_train_data() to handle None scorers for zero-shot mode

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
mlflow/genai/optimize/optimize.py Made scorers optional, added mutual validation, set eval_fn to None in zero-shot mode
mlflow/genai/optimize/util.py Updated validate_train_data to accept None scorers
tests/genai/optimize/test_optimize.py Updated MockPromptOptimizer to handle None eval_fn, added validation tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 21, 2026

Documentation preview for 2aad211 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

has_train_data = train_data is not None and len(train_data) > 0
has_scorers = scorers is not None and len(scorers) > 0

if has_train_data and not has_scorers:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is't it possible to run few-shot metaprompting if tracing data exists and scorers is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, but it's not working well from my earlier experiments, potentially because there is no new information getting generated.

However, for model switching use case, where the inference model is different from the model that generates the traces, this setup (train_data + no scorer) does work, since we use the same API to cover both scenarios, let me remove this validation.


if train_data is None or len(train_data) == 0:
# Validate that train_data and scorers are set together
has_train_data = train_data is not None and len(train_data) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we allow users to pass train_data=None? The type hint does not support None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EvaluationDatasetTypes could be None:

        EvaluationDatasetTypes = (
            pd.DataFrame
            | pyspark.sql.dataframe.DataFrame
            | list[dict]
            | list[Trace]
            | ManagedEvaluationDataset
            | EntityEvaluationDataset
            | ConversationSimulator
            | None
        )

I went with this way because "EvaluationDatasetTypes" | None is invalid syntax.

metric_fn = create_metric_from_scorers(scorers, aggregation)
eval_fn = _build_eval_fn(predict_fn, metric_fn)
# Create metric function only if scorers are provided (few-shot mode)
if has_scorers:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if users don't pass dataset and scorers, and use GEPA? Maybe should we add a validation in each optimizer as the required fields may vary across optimizers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized the old code is a bit broken, refactored to make validation work better, and ensure that metaprompting with trian_data without scorers works well.

@chenmoneygithub chenmoneygithub added this pull request to the merge queue Jan 22, 2026
Merged via the queue into mlflow:master with commit 25833b7 Jan 22, 2026
46 of 47 checks passed
@chenmoneygithub chenmoneygithub deleted the metaprompting-fix-2 branch January 22, 2026 05:05
harupy pushed a commit to harupy/mlflow that referenced this pull request Jan 28, 2026
harupy pushed a commit to harupy/mlflow that referenced this pull request Jan 28, 2026
harupy pushed a commit that referenced this pull request Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/prompts MLflow Prompt Registry and Optimization rn/bug-fix Mention under Bug Fixes in Changelogs. v3.9.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants