[ML-58760] Introduce Summarization Builtin Judge#19225
Conversation
| UserFrustration(), | ||
| ConversationCompleteness(), | ||
| Completeness(), | ||
| Summarization(), |
There was a problem hiding this comment.
I'm not sure whether this should be added to get_all_scorer list here because unlike the other generic builtin judges, Summarization's input and output are very specific (document and summary). I think it might be confusing if the user that currently evaluate on get_all_scorer now see that fail on all of the Summarization judge on the generic agent request/responses.
There was a problem hiding this comment.
Agreed - let's not add it in here.
| from mlflow.genai.scorers import Summarization | ||
|
|
||
| assessment = Summarization(name="my_summarization_check")( | ||
| inputs={"text": "MLflow is an open-source platform for managing ML workflows..."}, |
There was a problem hiding this comment.
qq: Is there a requirement of what key we allow in the inputs.
smoorjani
left a comment
There was a problem hiding this comment.
LGTM!
I think there's a broader question of the precedent we are setting for this metric as generally we think of our scorers as orthogonal, but this is a mix of lots of metrics. One alternative is we introduce separate scorers for each of the factors (e.g., faithfulness/groundedness, conciseness, coverage, etc) and then introduce a function like get_summarization_scorers(), but the question would be how to aggregate these scores. However, since this is an experimental API, we don't need to block on this.
| Summarization evaluates whether a summarization output is factually correct, grounded in | ||
| the input, and does not make any assumptions not present in the input. | ||
|
|
||
| This scorer focuses on three key aspects: |
There was a problem hiding this comment.
nit: let's not discuss implementation details in the docstring as this can change and we may forget to update it.
| UserFrustration(), | ||
| ConversationCompleteness(), | ||
| Completeness(), | ||
| Summarization(), |
There was a problem hiding this comment.
Agreed - let's not add it in here.
a06ea21 to
4523e9c
Compare
|
rebase and addressed @smoorjani 's comments |
d4871a7 to
6941d5e
Compare
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
6941d5e to
2892bf6
Compare
|
Documentation preview for 2892bf6 is available at: More info
|
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
Adding a new single-turn built-in judge that measures the quality of the summary for a document. The new judge checks the following aspects of the summarization: faithfulness, coverage, conciseness and coherence.
How is this PR tested?
Manual Testing
Tested with a small set of 20 document/summary pairs:
https://e2-dogfood.staging.cloud.databricks.com/editor/notebooks/659976464234660?o=6051921418418893
The default model achieved the following metrics:
Accuracy: 0.8889
Precision: 0.8182
Recall: 1.0000
F1 Score: 0.9000
Tested with one of the competitors' prompt with the same model:
Accuracy: 0.6500
Precision: 0.6000
Recall: 0.9000
F1 Score: 0.7200
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.