[ML-59678] Create new session-level built-in judge ConversationCompleteness#18967
Conversation
9761dd3 to
ccef273
Compare
| model: str | None = None, | ||
| description: str | None = None, | ||
| feedback_value_type: Any = str, | ||
| generate_rationale_first: bool = False, |
There was a problem hiding this comment.
Adding support for rationale-first evaluation here. For the ConversationCompleteness task, it requires some chain-of-thought since we need to list the user requests and check whether these are addressed by the AI.
When testing with label-first with gpt4.1-mini, we see the model typically get 4~5 wrong out of the 10 ground truth = incomplete conversations in the example test set. And I observed that on several cases, the label was predicted wrong, but the rationale provides the correct "incomplete" answer (see example attached below). After switching to rationale-first, we see gpt4.1-mini be able to get only 1 out of 10 wrong in the example test set. I think this is a significant enough boost to justify supporting rationale-first for this judge.
Example where rationale and label prediction doesn't match:
################################################################################
MISCLASSIFICATION 4/4
################################################################################
Conversation: headcount_plan_edge_case
Type: bad
Ground Truth: incomplete
Prediction: complete
Number of Traces: 3
--------------------------------------------------------------------------------
CONVERSATION HISTORY:
--------------------------------------------------------------------------------
Turn 1:
👤 User: I need to present a headcount plan to leadership; can you write a short narrative summary and also suggest a simple table structure I can use to show headcount by team and quarter?
🤖 Bot: You can frame the narrative around growth, alignment to strategy, and efficiency. For example: "Next year's headcount plan focuses on strengthening our core product teams while maintaining a disciplined approach to support and overhead functions. We're increasing capacity in engineering and customer success to support our product roadmap and expansion targets, while holding G&A mostly flat. This plan balances our revenue growth goals with careful cost management, and preserves enough flexibility to respond to market changes." Keeping it to one paragraph like this helps leadership grasp the story quickly.
Turn 2:
👤 User: This narrative works well. Can you make it a bit more concise and keep the emphasis on growth in engineering and customer success?
🤖 Bot: Here's a more concise version: "Next year's headcount plan increases capacity in engineering and customer success to support our product roadmap and expansion targets, while keeping G&A roughly flat. This approach focuses investment on revenue-impacting teams, maintains operational support, and preserves flexibility to adjust if market conditions change." This keeps the emphasis on where you're growing and why, in a form that's easy to drop into a slide or document.
Turn 3:
👤 User: Great, I'll paste this into my deck and finalize the rest.
🤖 Bot: Sounds good. If you later want help refining talking points for the leadership meeting or drafting speaker notes, I can help with that too.
--------------------------------------------------------------------------------
JUDGE'S RATIONALE:
--------------------------------------------------------------------------------
The user made three explicit requests: 1) write a short narrative summary for a headcount plan presentation, 2) suggest a simple table structure to show headcount by team and quarter, and 3) make a more concise version of the narrative focusing on growth in engineering and customer success. The assistant addressed the first request in the first response by providing a sample narrative. The second request, suggesting a simple table structure, is not addressed in the conversation; the user only acknowledges the narrative and requests a concise version, not the table. However, given the instruction to enumerate explicit user requests, the user only explicitly asked in the first message for both summary and table structure; the assistant responded only with the summary. The user then only requested a concise version of the narrative. The assistant delivered this in the next reply. Since the initial request also included the table structure and the assistant never responded to that, that request remains unaddressed. Therefore, because the assistant did not provide the suggested table structure anywhere, the conversation is incomplete.
There was a problem hiding this comment.
Is this change related to this PR? If not can we split it into a different PR?
There was a problem hiding this comment.
Yes, as my comment above suggests, without this change the new judge won't be able to reach the bar for weaker models.
smoorjani
left a comment
There was a problem hiding this comment.
Looks great! Left a few comments/questions to address
| @@ -0,0 +1,46 @@ | |||
| # NB: User-facing name for the completeness assessment. | |||
| COMPLETENESS_ASSESSMENT_NAME = "completeness" | |||
There was a problem hiding this comment.
is this meant to be in this PR? seems like this is old/for trace-level
There was a problem hiding this comment.
+1, let's make sure only relevant changes are in this PR
There was a problem hiding this comment.
yes, seems to be a git stack issue. Let me fix this
|
|
||
| CONVERSATION_COMPLETENESS_PROMPT = """\ | ||
| Consider the following conversation history between a user and an assistant. | ||
| Your task is to output exactly one label: "complete" or "incomplete" based on the criteria below. |
There was a problem hiding this comment.
should this map to yes/no? similar to other built-in judges
|
|
||
| First, list all explicit user requests made throughout the conversation in the rationale section. | ||
| Second, for each request, determine whether it was addressed by the assistant by the end of the conversation,\ | ||
| and **quote** the assistant's explicit response in the rationale section if you judge the request as addressed. |
There was a problem hiding this comment.
does quoting lead to a large number of output tokens? this may become expensive
There was a problem hiding this comment.
Yes potentially. I'm adding this because for weaker models it tends to infer and imagine responses. So I think it could be tradeoff between judge accuracy vs cost. Let me know if you prefer to not add this.
Not sure whether we have existing telemetry to track evaluation output tokens. I think this could be quite helpful to track.
888123f to
08b00ab
Compare
5432631 to
d91e421
Compare
|
Documentation preview for e2176b6 is available at: More info
|
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
… align API and resolved a few nits Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
…teness Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
…support rationale first Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
…lign API and address nits Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
e2176b6 to
7968ca3
Compare
🥞 Stacked PR
Use this link to review incremental changes.
Related Issues/PRs
What changes are proposed in this pull request?
This PR creates a new session-level built-in judge ConversationCompleteness. Users can instantiate a ConversationCompleteness judge in one line just like the existing built-in judges, and can directly invoke (or pass into genai.evaluation soon) to evaluate whether the AI assistant is able to address all user questions and requests throughout the entire conversation.
How is this PR tested?
Manual Testing
Manually tested with the following synthetic conversations,
openai:/gpt-5is able to get all the conversations correctly classified based on the ConversationCompleteness judge prompt andgpt-4.1 miniis able to get 19/20 of the conversations correctly classified when generating the rationale before lable.Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.