[ML-59305] Create new session-level builtin judge UserFrustration#18966
[ML-59305] Create new session-level builtin judge UserFrustration#18966xsh310 merged 2 commits intomlflow:masterfrom
Conversation
| def _get_judge(self) -> InstructionsJudge: | ||
| """Return a session-level InstructionsJudge with the user frustration prompt.""" | ||
| if self._judge is None: | ||
| self._judge = InstructionsJudge( |
There was a problem hiding this comment.
I'm instantiating an InstructionsJudge here directly instead of calling make_judge to prevent logging a make_judge event. This should be fine since make_judge is a only thin wrapper of InstructionsJudge with some output type validation checks.
There was a problem hiding this comment.
can we check if this still works with the align API?
There was a problem hiding this comment.
@smoorjani
I think align api works for the new single turn Completeness. But for multi-turn, I think we will need some API changes to align as well. Currently it only take a list of Trace def align(self, judge: Judge, traces: list[Trace]) -> Judge: We probably need to have a way to provide a list of list of Trace for multi-turn.
There was a problem hiding this comment.
I see, can we explicitly disable it? Just so users don't get confused or waste resources thinking it's possible
There was a problem hiding this comment.
Updated the PR to throw NotImplementedError explicitly
| - "Just answer directly." | ||
| - "Keep it simple." | ||
|
|
||
| If the user displays at least two of these signals, and they were triggered by AI errors -> frustrated. |
There was a problem hiding this comment.
Why do we require at least 2 of these signals, rather than just 1?
| Count as frustration if you see two or more of the following AI-caused signals: | ||
| - Repeated Corrections | ||
| - User must correct the AI's mistake or misunderstanding more than once. | ||
| - ("That's not what I asked," "No, I meant X," "Not that, the other thing.") |
There was a problem hiding this comment.
The format of examples in each of these signals is a little different, can we be consistent to avoid confusion?
|
also, for testing, I would recommend testing with a small and less powerful model. gpt-5 is likely to match our expectations, but is simply too expensive to run as a judge. Models like gpt-4.1-mini or gpt-5-nano are more suitable. |
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
cc26d9c to
f003910
Compare
|
Updated PR to output 3 classes and make prompt more concise. Tested with gpt4.1-mini with 97% overall accuracy on my example test set |
smoorjani
left a comment
There was a problem hiding this comment.
left some small nits/comments, but looks great! thanks for iterating!
| def _get_judge(self) -> InstructionsJudge: | ||
| """Return a session-level InstructionsJudge with the user frustration prompt.""" | ||
| if self._judge is None: | ||
| self._judge = InstructionsJudge( |
There was a problem hiding this comment.
I see, can we explicitly disable it? Just so users don't get confused or waste resources thinking it's possible
| *, | ||
| session: list[Trace] | None = None, | ||
| ) -> Feedback: | ||
| """ |
There was a problem hiding this comment.
nit/dumb q: do we need this docstring if we have the class-level one?
serena-ruan
left a comment
There was a problem hiding this comment.
LGTM! https://github.com/mlflow/mlflow/pull/18966/files#r2559136130 is not blocking
e5023b2 to
8a1fb91
Compare
… align API and resolved a few nits Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
8a1fb91 to
0169f8d
Compare
|
Documentation preview for 0169f8d is available at: More info
|
🥞 Stacked PR
Use this link to review incremental changes.
🛠 DevTools 🛠
Install mlflow from this PR
For Databricks, use the following command:
What changes are proposed in this pull request?
This PR creates a new session-level built-in judge UserFrustration. Users can instantiate a UserFrustration judge in one line just like the existing built-in judges, and can directly invoke (or pass into genai.evaluation soon) to evaluate whether the user is frustrated at the AI assistance given a conversation history. The judge return one of the following:
Note that this built-in judge is different from the previous built-in judge in terms of the following aspects:
How is this PR tested?
Manual Testing
Manually tests with the following small synthetic conversations,
openai:/gpt-5is able to get all the conversations correctly classified based on the judge prompt andopenai:/gpt4.1-miniis able to get an overall accuracy of 97%:Does this PR require documentation update?
Release Notes
Is this a user-facing change?
Adding a new built-in llm-as-judge that users can instantiate and invoke to evaluate whether user is frustrated by the AI assistant given a conversation history.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.