Skip to content

Support metaprompting in mlflow.genai.optimize_prompts()#19762

Merged
chenmoneygithub merged 14 commits intomlflow:masterfrom
chenmoneygithub:metaprompting
Jan 12, 2026
Merged

Support metaprompting in mlflow.genai.optimize_prompts()#19762
chenmoneygithub merged 14 commits intomlflow:masterfrom
chenmoneygithub:metaprompting

Conversation

@chenmoneygithub
Copy link
Contributor

@chenmoneygithub chenmoneygithub commented Jan 6, 2026

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19762/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19762/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19762/merge

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Support metaprompting in mlflow.genai.optimize_prompts(). There are two modes:

  • zero-shot metaprompting when no train_data is provided.
  • few-shot metaprompting when train_data is provided.

Zero-shot is less useful on the SDK side, but will be useful on the optimization UI. I will update the tutorial in a separate PR to avoid gigantic PR.

e2e tested with the script below:

import random

import litellm
from datasets import load_dataset

import mlflow
from mlflow.genai.optimize.optimizers import MetaPromptOptimizer
from mlflow.genai.scorers import Correctness


def load_aime_data(num_train=10, num_test=32, seed=42):
    """Load AIME dataset from HuggingFace and split into train/test."""
    dataset = load_dataset("gneubig/aime-1983-2024", split="train")

    # Shuffle with fixed seed for reproducibility
    indices = list(range(len(dataset)))
    random.seed(seed)
    random.shuffle(indices)

    # Split into train and test
    train_indices = indices[:num_train]
    test_indices = indices[num_train : num_train + num_test]

    train_dataset = dataset.select(train_indices)
    test_dataset = dataset.select(test_indices)

    # Convert to MLflow format
    def convert_to_mlflow_format(dataset):
        mlflow_data = []
        for example in dataset:
            mlflow_data.append(
                {
                    "inputs": {"question": example["Question"]},
                    "expectations": {"expected_response": str(example["Answer"])},
                }
            )
        return mlflow_data

    return convert_to_mlflow_format(train_dataset), convert_to_mlflow_format(
        test_dataset
    )


def run_benchmark(
    reflection_model="gpt-4o",
    eval_model="gpt-4o-mini",
    num_train_examples=10,
    num_test_examples=32,
):
    """
    Benchmark MetaPromptOptimizer on AIME dataset.

    Args:
        reflection_model: Model to use for prompt optimization
        eval_model: Model to use for predictions
        num_train_examples: Number of training examples
        num_test_examples: Number of test examples for final evaluation
    """

    print("=" * 80)
    print("MetaPromptOptimizer Benchmark on AIME")
    print("=" * 80)

    # Load data
    print(
        f"\nLoading {num_train_examples} train examples and {num_test_examples} test examples..."
    )
    train_data, test_data = load_aime_data(
        num_train=num_train_examples, num_test=num_test_examples
    )

    print(f"Train size: {len(train_data)}")
    print(f"Test size: {len(test_data)}")

    try:
        prompt = mlflow.genai.load_prompt("prompts:/aime_solver/6")
    except Exception:
        prompt = mlflow.genai.register_prompt(
            name="aime_solver",
            template="Solve the following math problem. Provide only the numerical answer:\n\n{{question}}",
        )

    print(f"\nInitial prompt: {prompt.template}")

    # Define predict function
    def predict_fn(question: str) -> str:
        """Prediction function that uses the registered prompt."""
        prompt = mlflow.genai.load_prompt("prompts:/aime_solver/6")
        formatted_prompt = prompt.format(question=question)
        response = litellm.completion(
            model=f"openai/{eval_model}",
            messages=[{"role": "user", "content": formatted_prompt}],
            temperature=1.0,
        )
        return response.choices[0].message.content

    # Optimize prompts
    print("\n" + "-" * 80)
    print("Optimizing Prompts")
    print("-" * 80)

    result = mlflow.genai.optimize_prompts(
        predict_fn=predict_fn,
        train_data=train_data,
        prompt_uris=[prompt.uri],
        optimizer=MetaPromptOptimizer(
            reflection_model="openai:/gpt-5-mini",
            lm_kwargs={"temperature": 1.0, "max_tokens": 4096},
        ),
        scorers=[Correctness(model="openai:/gpt-5-mini")],
    )

    # result = mlflow.genai.optimize_prompts(
    #     predict_fn=predict_fn,
    #     train_data=[],
    #     prompt_uris=[prompt.uri],
    #     optimizer=MetaPromptOptimizer(
    #         reflection_model="openai:/gpt-5-mini",
    #         lm_kwargs={"temperature": 1.0, "max_tokens": 4096},
    #     ),
    #     scorers=[],
    # )

    print(f"\n{'=' * 80}")
    print("Optimization Results")
    print("=" * 80)
    print(f"Initial prompt:\n{prompt.template}\n")
    print(f"Optimized prompt:\n{result.optimized_prompts[0].template}\n")
    print(
        f"Training score improvement: {result.initial_eval_score:.4f} → {result.final_eval_score:.4f}"
    )

    # Check if prompt changed
    if result.optimized_prompts[0].template == prompt.template:
        print("\n⚠️  Note: The optimized prompt is identical to the initial prompt.")
        print("No improvement was found during optimization.")


def run_evaluation(prompt_version: str, data, eval_model: str):
    """
    Evaluate the performance of a prompt on a dataset.
    """

    def predict_fn(question: str) -> str:
        prompt = mlflow.genai.load_prompt(f"prompts:/aime_solver/{prompt_version}")
        formatted_prompt = prompt.format(question=question)
        response = litellm.completion(
            model=f"openai/{eval_model}",
            messages=[{"role": "user", "content": formatted_prompt}],
            temperature=1.0,
        )
        return response.choices[0].message.content

    return mlflow.genai.evaluate(
        data=data,
        predict_fn=predict_fn,
        scorers=[Correctness(model=eval_model)],
    )


def main():
    # # Configure MLflow tracking
    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment("metaprompt-optimizer-aime")

    print("Starting benchmark...")

    run_benchmark(
        reflection_model="gpt-5-mini",
        eval_model="gpt-5-nano",
        num_train_examples=12,
        num_test_examples=24,
    )


if __name__ == "__main__":
    main()

A sample output is as below:

================================================================================
Optimization Results
================================================================================
Initial prompt:
Solve the following math problem. Provide only the numerical answer:

{{question}}

Optimized prompt:
You are an expert contest mathematician. Solve the following math problem and output only the final numerical answer. Do not show any solution steps, reasoning, or explanations — perform all reasoning internally. Before answering, verify your result by recomputing or checking algebra/arithmetic.

Rules for the output (must be followed exactly):
- If the answer is an integer, output only the digits of the integer (no commas, no words, no punctuation).
- If the answer is a simplified fraction, output it as numerator/denominator in lowest terms (e.g., 3/7).
- If the problem asks for a specific expression (such as m+n), output that final integer only.
- Do not output any extra text, labels, or whitespace — only the exact answer characters.

and screenshot for the associated mlflow run:

image

Screenshot for the trace of metaprompting with few-shot data:

image

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Copilot AI review requested due to automatic review settings January 6, 2026 04:46
@chenmoneygithub chenmoneygithub marked this pull request as draft January 6, 2026 04:46
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

@chenmoneygithub Thank you for the contribution! Could you fix the following issue(s)?

⚠ DCO check

The DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces metaprompting support to MLflow's prompt optimization capabilities by adding a new MetaPromptOptimizer class. The optimizer uses LLMs to iteratively improve prompts through either zero-shot mode (applying general best practices without evaluation data) or few-shot mode (learning from evaluation feedback on training examples).

Key changes:

  • New MetaPromptOptimizer class with automatic mode detection based on training data availability
  • Support for custom guidelines to guide the optimization process
  • Comprehensive test suite covering initialization, template validation, sampling, and integration scenarios
  • Support for separate validation datasets to prevent overfitting

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
mlflow/genai/optimize/optimizers/metaprompt_optimizer.py New optimizer implementation with zero-shot and few-shot metaprompting modes, template variable validation, and MLflow tracking integration
tests/genai/optimize/optimizers/test_metaprompt_optimizer.py Comprehensive test suite covering initialization, template variables, sampling, meta-prompt building, LLM invocation, and integration scenarios
mlflow/genai/optimize/optimizers/init.py Exports the new MetaPromptOptimizer class
mlflow/genai/optimize/optimize.py Minor formatting improvements for better code readability (line breaks in function signatures)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

Documentation preview for 9c3284f is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@chenmoneygithub chenmoneygithub marked this pull request as ready for review January 7, 2026 04:15
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

@chenmoneygithub Thank you for the contribution! Could you fix the following issue(s)?

⚠ DCO check

The DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

@github-actions github-actions bot added area/prompts MLflow Prompt Registry and Optimization rn/feature Mention under Features in Changelogs. labels Jan 7, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)
# Check if train_data is empty (for zero-shot optimization)
if len(train_data) == 0:
# Zero-shot mode: no training data provided
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-shot is less useful on the SDK side, since people have easier way to do zero-shot metaprompting/optimization, but this will be the backbone for the UI solution.

@chenmoneygithub
Copy link
Contributor Author

@copilot redo the review from beginning, please cover all commits not just the commits since your last review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@chenmoneygithub chenmoneygithub changed the title [WIP] Support metaprompting in mlflow.genai.optimize_prompts() Support metaprompting in mlflow.genai.optimize_prompts() Jan 8, 2026

Args:
reflection_model: Name of the model to use for prompt optimization.
Format: "<provider>:/<model>" (e.g., "openai:/gpt-4o",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use newer models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure, done!

registered regardless of performance improvement.

Args:
reflection_model: Name of the model to use for prompt optimization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call it as prompt_model or optimizer_model? Metaprompting does not reflect eval results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about this, and few-shot metaprompting does use some "reflection" while zero-shot not, so prompt_model fits better here semantically. However, I also want to keep some consistency with the GepaPromptOptimizer so that users don't need to learn two concepts when picking up optimizers, so I decided to keep it as reflection_model. Please let me know if this makes sense, and happy to make changes!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since algorithm is totally different I think it's fine not to keep the same naming. Not a blocker though.

# Validate prompt names match
self._validate_prompt_names(target_prompts, improved_prompts)

# Validate template variables are preserved in improved prompts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

"[python]": {
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.formatOnSave": true,
"editor.formatOnSave": false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this. This is unrelated to this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw why do we need this change?

Copy link
Contributor Author

@chenmoneygithub chenmoneygithub Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh geez, I meant to edit it locally. command + shift + P put .vscode/settings.json as the first option.

Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments, otherwise LGTM


Automatically detects optimization mode based on training data:
- Zero-shot: No evaluation data - applies general prompt engineering best practices
- Few-shot: Has evaluation data - learns from feedback on examples
Copy link
Collaborator

@TomeHirata TomeHirata Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is feedback necessary, or can users pass just inputs/outputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this implementation, feedback is alway present. But feedback is a big misleading, should be evaluation results, changed

"""
_logger.info("Applying zero-shot prompt optimization with best practices")

# Build meta-prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we don't need this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!


content = None # Initialize to avoid NameError in exception handler

with mlflow.start_span(name="metaprompt_reflection", span_type=SpanType.LLM) as span:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should always enable tracing? Or should we conditionally enable if when enable_tracking=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, it makes sense to me to skip tracing the metaprompting call as well when enable_tracking=False, changed!

Copy link
Collaborator

@TomeHirata TomeHirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, otherwise LGTM

@chenmoneygithub chenmoneygithub added this pull request to the merge queue Jan 12, 2026
Merged via the queue into mlflow:master with commit 07ee84f Jan 12, 2026
49 of 50 checks passed
@chenmoneygithub chenmoneygithub deleted the metaprompting branch January 12, 2026 21:20
debu-sinha pushed a commit to debu-sinha/mlflow that referenced this pull request Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/prompts MLflow Prompt Registry and Optimization rn/feature Mention under Features in Changelogs. v3.9.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants