Skip to content

[Online Scoring][6.3/x] Add exclusive job flag and trace scorer job definition#19714

Merged
dbczumar merged 135 commits intomlflow:masterfrom
dbczumar:stack/pr6c-exclusive-flag-job
Jan 9, 2026
Merged

[Online Scoring][6.3/x] Add exclusive job flag and trace scorer job definition#19714
dbczumar merged 135 commits intomlflow:masterfrom
dbczumar:stack/pr6c-exclusive-flag-job

Conversation

@dbczumar
Copy link
Collaborator

@dbczumar dbczumar commented Jan 1, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Introduce server job for online trace scoring

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@dbczumar dbczumar force-pushed the stack/pr6c-exclusive-flag-job branch from 7decb5f to a1c1303 Compare January 1, 2026 18:39
@dbczumar dbczumar force-pushed the stack/pr6c-exclusive-flag-job branch from a1c1303 to 2e6a8d7 Compare January 1, 2026 19:00
@dbczumar dbczumar force-pushed the stack/pr6c-exclusive-flag-job branch 16 times, most recently from 7d932d7 to 85c26d4 Compare January 2, 2026 18:53
The second handler for IS NOT NULL (checking for separate NOT and NULL
tokens) is unreachable with sqlparse >= 0.5.3, which always parses
"NOT NULL" as a single keyword token. Testing confirms this pattern
holds across various whitespace configurations.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Add OnlineTraceScoringProcessor that orchestrates online scoring:
- Fetches traces since last checkpoint
- Applies sampling to select scorers per trace
- Runs scoring in parallel using ThreadPoolExecutor
- Logs assessments back to traces
- Updates checkpoint after processing

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Verify that metadata.mlflow.sourceRun IS NULL filter is applied
to exclude traces generated during evaluation runs.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Move the EvalItem import inside _execute_scoring() to avoid pulling
in pandas at module load time. This allows OnlineTraceScoringProcessor
to be exported from __init__.py without breaking the skinny client.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Move evaluation module imports inside _execute_scoring method to
prevent pulling in pandas at module load time, which would break
the skinny client.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Update test patches to target mlflow.genai.evaluation.harness module
directly instead of trace_processor module, since imports are now lazy.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
When traces from multiple filters have overlapping timestamps, they are
sorted alphabetically by trace_id at each timestamp. Updated test to
expect correct interleaved ordering rather than sequential blocks.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Calculate overlap pairs and checkpoint values based on MAX_TRACES_PER_JOB
constant rather than hardcoding values. The test now works correctly
regardless of the MAX_TRACES_PER_JOB value.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Replace hardcoded timestamp ranges with dynamic calculations:
- Staging filter starts at 0, covers MAX_TRACES_PER_JOB range
- Prod filter starts at 0.8*MAX_TRACES_PER_JOB, creating 20% overlap
- All expected values computed from these fractions

Test now works correctly regardless of MAX_TRACES_PER_JOB value.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Changed loop to iterate over tasks.items() instead of tasks.values()
to access trace_id for better debugging when a task has no trace.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Add exclusive flag support to the job decorator to prevent duplicate
job execution for the same parameters. Add run_online_trace_scorer_job
that uses OnlineTraceScoringProcessor to score individual traces.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
…ng job

Signed-off-by: dbczumar <corey.zumar@databricks.com>
…ent module

Signed-off-by: dbczumar <corey.zumar@databricks.com>
- Add unit test for _compute_exclusive_lock_key in test_utils.py
- Add integration tests for exclusive job behavior in test_jobs.py

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Verify the job correctly instantiates OnlineTraceScoringProcessor
and calls process_traces.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
The custom JSON serializer was failing when Huey tried to serialize lock
values (plain strings) because it expected all data to be Message objects
with _asdict(). Updated serializer to handle both Message objects and
plain data, and improved deserializer to reconstruct Messages only when
appropriate.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
…lain data"

This reverts commit 11f8272.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
The JsonSerializer needs to handle two types of data:
1. Message objects (task data) with ._asdict() method
2. Plain data like lock values (e.g., '1' for exclusive locks)

Without this fix, exclusive jobs fail when Huey tries to serialize
lock values during lock acquisition.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
) as mock_create,
):
online_scorers = [make_online_scorer_dict(Completeness())]
run_online_trace_scorer_job(experiment_id="exp1", online_scorers=online_scorers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we verify the exclusiveness on the scorer job directly as well? Since it contains objects as params

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally - done!

Comment on lines +16 to +21
"name": scorer.name,
"experiment_id": "exp1",
"serialized_scorer": json.dumps(scorer.model_dump()),
"sample_rate": sample_rate,
"filter_string": None,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asdict(OnlineScorer) should convert to {"name": ..., "serialized_scorer": ..., "online_config": ...} instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this works / is better! thanks! :D

)
def run_online_trace_scorer_job(
experiment_id: str,
online_scorers: list[dict[str, Any]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, we expect the input to be list of OnlineScorer's dictionary format with asdict?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely (added some information about that to the param docstring)

Copy link
Collaborator

@serena-ruan serena-ruan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!

Allow the exclusive parameter in @job decorator to accept a list of
parameter names that determine exclusivity, rather than just a boolean.
This enables fine-grained control over which parameters are considered
when determining if a job should be exclusive.

- Change exclusive from bool to bool | list[str] in job decorator
- Update _compute_exclusive_lock_key to accept exclusive_params
- When exclusive is a list, only hash those specific parameters
- Update run_online_trace_scorer_job to use exclusive=['experiment_id']
- This prevents simultaneous scoring jobs for the same experiment while
  allowing different scorer configurations to queue properly
- Add comprehensive tests for parameter-specific exclusivity

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Add integration test that verifies run_online_trace_scorer_job uses
exclusive=["experiment_id"] correctly. Test submits two jobs with same
experiment_id but different scorers and verifies only one runs while the
other is canceled due to the exclusive lock.

Also simplify test helper by inlining OnlineScorer dict creation using
asdict() instead of maintaining a helper function.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Move duplicated test utilities (_setup_job_runner, wait_job_finalize, etc.)
from test_jobs.py and test_online_scoring_jobs.py into a shared helpers.py
module for better code reuse across the test suite.

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants