[Online Scoring][6.3/x] Add exclusive job flag and trace scorer job definition#19714
Merged
dbczumar merged 135 commits intomlflow:masterfrom Jan 9, 2026
Merged
[Online Scoring][6.3/x] Add exclusive job flag and trace scorer job definition#19714dbczumar merged 135 commits intomlflow:masterfrom
dbczumar merged 135 commits intomlflow:masterfrom
Conversation
This was referenced Jan 1, 2026
Merged
7decb5f to
a1c1303
Compare
29 tasks
a1c1303 to
2e6a8d7
Compare
29 tasks
7d932d7 to
85c26d4
Compare
The second handler for IS NOT NULL (checking for separate NOT and NULL tokens) is unreachable with sqlparse >= 0.5.3, which always parses "NOT NULL" as a single keyword token. Testing confirms this pattern holds across various whitespace configurations. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Add OnlineTraceScoringProcessor that orchestrates online scoring: - Fetches traces since last checkpoint - Applies sampling to select scorers per trace - Runs scoring in parallel using ThreadPoolExecutor - Logs assessments back to traces - Updates checkpoint after processing Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Verify that metadata.mlflow.sourceRun IS NULL filter is applied to exclude traces generated during evaluation runs. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Move the EvalItem import inside _execute_scoring() to avoid pulling in pandas at module load time. This allows OnlineTraceScoringProcessor to be exported from __init__.py without breaking the skinny client. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Move evaluation module imports inside _execute_scoring method to prevent pulling in pandas at module load time, which would break the skinny client. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Update test patches to target mlflow.genai.evaluation.harness module directly instead of trace_processor module, since imports are now lazy. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
When traces from multiple filters have overlapping timestamps, they are sorted alphabetically by trace_id at each timestamp. Updated test to expect correct interleaved ordering rather than sequential blocks. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Calculate overlap pairs and checkpoint values based on MAX_TRACES_PER_JOB constant rather than hardcoding values. The test now works correctly regardless of the MAX_TRACES_PER_JOB value. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Replace hardcoded timestamp ranges with dynamic calculations: - Staging filter starts at 0, covers MAX_TRACES_PER_JOB range - Prod filter starts at 0.8*MAX_TRACES_PER_JOB, creating 20% overlap - All expected values computed from these fractions Test now works correctly regardless of MAX_TRACES_PER_JOB value. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Changed loop to iterate over tasks.items() instead of tasks.values() to access trace_id for better debugging when a task has no trace. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Add exclusive flag support to the job decorator to prevent duplicate job execution for the same parameters. Add run_online_trace_scorer_job that uses OnlineTraceScoringProcessor to score individual traces. Signed-off-by: dbczumar <corey.zumar@databricks.com>
…ng job Signed-off-by: dbczumar <corey.zumar@databricks.com>
…ent module Signed-off-by: dbczumar <corey.zumar@databricks.com>
- Add unit test for _compute_exclusive_lock_key in test_utils.py - Add integration tests for exclusive job behavior in test_jobs.py Signed-off-by: dbczumar <corey.zumar@databricks.com>
Verify the job correctly instantiates OnlineTraceScoringProcessor and calls process_traces. Signed-off-by: dbczumar <corey.zumar@databricks.com>
The custom JSON serializer was failing when Huey tried to serialize lock values (plain strings) because it expected all data to be Message objects with _asdict(). Updated serializer to handle both Message objects and plain data, and improved deserializer to reconstruct Messages only when appropriate. Signed-off-by: dbczumar <corey.zumar@databricks.com>
…lain data" This reverts commit 11f8272. Signed-off-by: dbczumar <corey.zumar@databricks.com>
The JsonSerializer needs to handle two types of data: 1. Message objects (task data) with ._asdict() method 2. Plain data like lock values (e.g., '1' for exclusive locks) Without this fix, exclusive jobs fail when Huey tries to serialize lock values during lock acquisition. Signed-off-by: dbczumar <corey.zumar@databricks.com>
serena-ruan
reviewed
Jan 9, 2026
| ) as mock_create, | ||
| ): | ||
| online_scorers = [make_online_scorer_dict(Completeness())] | ||
| run_online_trace_scorer_job(experiment_id="exp1", online_scorers=online_scorers) |
Collaborator
There was a problem hiding this comment.
Could we verify the exclusiveness on the scorer job directly as well? Since it contains objects as params
serena-ruan
reviewed
Jan 9, 2026
Comment on lines
+16
to
+21
| "name": scorer.name, | ||
| "experiment_id": "exp1", | ||
| "serialized_scorer": json.dumps(scorer.model_dump()), | ||
| "sample_rate": sample_rate, | ||
| "filter_string": None, | ||
| } |
Collaborator
There was a problem hiding this comment.
asdict(OnlineScorer) should convert to {"name": ..., "serialized_scorer": ..., "online_config": ...} instead?
Collaborator
Author
There was a problem hiding this comment.
Yeah, this works / is better! thanks! :D
serena-ruan
reviewed
Jan 9, 2026
| ) | ||
| def run_online_trace_scorer_job( | ||
| experiment_id: str, | ||
| online_scorers: list[dict[str, Any]], |
Collaborator
There was a problem hiding this comment.
Just to clarify, we expect the input to be list of OnlineScorer's dictionary format with asdict?
Collaborator
Author
There was a problem hiding this comment.
Absolutely (added some information about that to the param docstring)
Allow the exclusive parameter in @job decorator to accept a list of parameter names that determine exclusivity, rather than just a boolean. This enables fine-grained control over which parameters are considered when determining if a job should be exclusive. - Change exclusive from bool to bool | list[str] in job decorator - Update _compute_exclusive_lock_key to accept exclusive_params - When exclusive is a list, only hash those specific parameters - Update run_online_trace_scorer_job to use exclusive=['experiment_id'] - This prevents simultaneous scoring jobs for the same experiment while allowing different scorer configurations to queue properly - Add comprehensive tests for parameter-specific exclusivity Signed-off-by: dbczumar <corey.zumar@databricks.com>
Add integration test that verifies run_online_trace_scorer_job uses exclusive=["experiment_id"] correctly. Test submits two jobs with same experiment_id but different scorers and verifies only one runs while the other is canceled due to the exclusive lock. Also simplify test helper by inlining OnlineScorer dict creation using asdict() instead of maintaining a helper function. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Move duplicated test utilities (_setup_job_runner, wait_job_finalize, etc.) from test_jobs.py and test_online_scoring_jobs.py into a shared helpers.py module for better code reuse across the test suite. Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🥞 Stacked PR
Use this link to review incremental changes.
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Introduce server job for online trace scoring
How is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.