Prompt Optimization backend PR 3: Add Get, Search, and Delete prompt optimization job APIs#20197
Conversation
🛠 DevTools 🛠
Install mlflow from this PRFor Databricks, use the following command: |
There was a problem hiding this comment.
Pull request overview
This PR adds backend APIs for prompt optimization job management including Get, Search, Delete, and Cancel operations. It introduces new proto definitions for jobs and prompt optimization, handler implementations for all endpoints, and comprehensive unit tests.
Changes:
- New proto definitions for JobStatus, JobState, and PromptOptimizationJob messages
- Handler implementations for get, search, delete, and cancel prompt optimization job endpoints
- Unit tests covering pending, succeeded, failed, and edge case scenarios
- Support utilities including job error logging and prompt optimization run tagging
Reviewed changes
Copilot reviewed 16 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| mlflow/protos/jobs.proto | Defines generic JobStatus enum and JobState message for all job types |
| mlflow/protos/prompt_optimization.proto | Defines PromptOptimizationJob message and OptimizerType enum |
| mlflow/protos/service.proto | Adds 5 new RPC endpoints for prompt optimization job operations |
| mlflow/server/handlers.py | Implements handlers for get, search, delete, cancel, and create operations |
| tests/server/test_handlers.py | Adds 11 unit tests covering various job scenarios and edge cases |
| mlflow/utils/mlflow_tags.py | Adds MLFLOW_RUN_IS_PROMPT_OPTIMIZATION tag |
| mlflow/server/jobs/utils.py | Adds error logging and disables job re-enqueueing temporarily |
| mlflow/server/jobs/_job_subproc_entry.py | Adds detailed error logging with traceback |
| mlflow/genai/optimize/util.py | Tags optimization runs for UI filtering |
| dev/generate_protos.py | Updates proto generation to include new proto files |
Comments suppressed due to low confidence (1)
mlflow/server/jobs/utils.py:567
- The function has unreachable code after the early return statement. All code from line 545 onwards will never execute. If the TODO is temporary, consider using a feature flag or configuration instead of commenting out the code with an early return. If the code should remain disabled, remove the unreachable code below the return statement.
def _enqueue_unfinished_jobs(server_launching_timestamp: int) -> None:
# TODO: Job re-enqueueing is temporarily disabled. The current implementation
# has issues with job state management that can cause duplicate execution.
# This will be re-enabled once the job persistence layer is stabilized.
return None
from mlflow.server.handlers import _get_job_store
job_store = _get_job_store()
unfinished_jobs = job_store.list_jobs(
statuses=[JobStatus.PENDING, JobStatus.RUNNING],
# filter out jobs created after the server is launched.
end_timestamp=server_launching_timestamp,
)
for job in unfinished_jobs:
if job.status == JobStatus.RUNNING:
job_store.reset_job(job.job_id) # reset the job status to PENDING
params = json.loads(job.params)
timeout = job.timeout
# Look up exclusive flag from function metadata
fn_fullname = get_job_fn_fullname(job.job_name)
fn_metadata = _load_function(fn_fullname)._job_fn_metadata
_get_or_init_huey_instance(job.job_name).submit_task(
job.job_id, job.job_name, params, timeout, fn_metadata.exclusive
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Documentation preview for eca1a14 is available at: More info
|
This PR adds three new backend APIs for the prompt optimization feature: - getPromptOptimizationJob: Retrieve details of a single optimization job - searchPromptOptimizationJobs: List all optimization jobs for an experiment - deletePromptOptimizationJob: Delete an optimization job and its associated run Also includes: - New proto definitions for jobs and prompt optimization messages - Handler implementations for all three APIs - Unit tests for all new endpoints - Fix: Remove invalid target_prompt_uri field reference in handler Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: chenmoneygithub <chen.qian@databricks.com>
a500856 to
b7a97a8
Compare
mlflow/server/handlers.py
Outdated
| job_store.delete_jobs(job_ids=[job_id]) | ||
|
|
||
| # Delete the associated MLflow run if it exists. | ||
| # Ignore errors (e.g., run already deleted) to ensure job deletion succeeds. |
There was a problem hiding this comment.
why do we return success response even when the deletion actually failed?
There was a problem hiding this comment.
Yes it's a bit confusing - there is a chance that users delete the MLflow run associated with the optimization job from UI/client, then we want the DeletePromptOptimizationJob to just delete the job entity and skip the run. But with a second thought it is better to skip deleting the run if the run doesn't exist.
Changed!
| elif metric_name.startswith("final_eval_score."): | ||
| scorer_name = metric_name[len("final_eval_score.") :] | ||
| optimization_job.final_eval_scores[scorer_name] = metric_value | ||
|
|
There was a problem hiding this comment.
Q: Do we have a separate PR for adding APIs for intermediate results?
There was a problem hiding this comment.
This is actually a tricky thing. My current plan is not including intermediate results in the response of GetOptimizationJob because it only exists in GEPA optimization workflow, and we want to add some flexibility to cache the earlier fetched intermediate prompts/evaluation results for performance. Here is the workflow in my mind:
- When users open the prompt optimization detailed view page for a certain optimization job, the UI triggers a GetPromptOptimizationJob and fetch the associated run_id.
- UI pulls intermediate prompts/evaluation results via the artifact API.
- UI caches the earlier fetched prompts/evaluation results to avoid duplicated retrieval.
Let me know if this makes sense to you!
There was a problem hiding this comment.
UI pulls intermediate prompts/evaluation results via the artifact API
Does this mean you use the artifact API directly instead of adding a separate RCP method for fetching intermediate prompts?
There was a problem hiding this comment.
Yes I am thinking about letting frontend handle what to pull and what to cache for flexibility.
| optimization_job.initial_eval_scores["aggregate"] = metric_value | ||
| elif metric_name == "final_eval_score": | ||
| optimization_job.final_eval_scores["aggregate"] = metric_value | ||
| elif metric_name.startswith("initial_eval_score."): |
There was a problem hiding this comment.
As I commented in another PR, the response of Scorer is not always numerical, we may need to update the type on proto and allow string or other response types supported by Scorer
There was a problem hiding this comment.
This is a bit complex, let's discuss it - originally I was thinking that we should force only using numeric scorers, since for DSPy optimizer/GEPA optimizer non-numeric scorer is not valid. There is one situation where non-numeric scorer can add value IIUC, which is few-shot metaprompting. But for few-shot metaprompting, the initial score and final score will hit trouble because we don't have a reliable way to aggregate over non-numeric scorers' outputs on the validation dataset. With that, I am thinking that for prompt optimization job, we only allow scorers that output numeric values for simplicity. Otherwise, users may hit this situation:
- They quickly tried metaprompting on optimization UI, with some non-numeric scorers selected.
- They want to try GEPA, but hit an error that the non-numeric scorer cannot be used.
In this situation we are exposing some unnecessary internal logics to the user.
This may be too verbose, please let me know if this explanation makes sense!
There was a problem hiding this comment.
I think we should at least support the conversation from YES/NO to 1/0 since all the built-in scorers return "YES" or "NO" (CategoricalRating) right now. This aligns what we do in the optimize_prompts method (please see create_metric_from_scorers method). We can reject other types (e.g., dict[str, str], list[str]) in the initial version.
7a8f434 to
edf3f2d
Compare
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Base off #20115, will send out for review after #20115 is merged.
This PR adds three new backend APIs for the prompt optimization feature:
To test out the PR, you can use the following script, which has some incremental changes based on #20115:
Also includes:
How is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.