Introduce conversational guidelines scorer#19729
Conversation
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
|
Documentation preview for e84cc1b is available at: Changed Pages (2)
More info
|
| guidelines = self.guidelines | ||
| if isinstance(guidelines, str): | ||
| guidelines = [guidelines] | ||
| formatted_guidelines = "\n".join(f"<guideline>{g}</guideline>" for g in guidelines) | ||
| return CONVERSATIONAL_GUIDELINES_PROMPT.replace("{{ guidelines }}", formatted_guidelines) |
There was a problem hiding this comment.
will this throw if guidelines is not a string or a list of strings?
There was a problem hiding this comment.
Pydantic will throw an error in this case before when initializing the scorer:
scorer = ConversationalGuidelines(
File "/Users/samraj.moorjani/personal_repos/gwt-mlflow-ml-60372-sdk/.venv/lib/python3.10/site-packages/pydantic/main.py", line 250, in __init__
validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 2 validation errors for ConversationalGuidelines
guidelines.str
Input should be a valid string [type=string_type, input_value=0.1234, input_type=float]
For further information visit https://errors.pydantic.dev/2.12/v/string_type
guidelines.list[str]
Input should be a valid list [type=list_type, input_value=0.1234, input_type=float]
For further information visit https://errors.pydantic.dev/2.12/v/list_type
|
|
||
| Evaluation criteria: | ||
| - Assess whether EVERY assistant response in the conversation follows ALL the provided guidelines. | ||
| - Focus only on the assistant's responses, not the user's messages. |
There was a problem hiding this comment.
qq: is it possible for user to provide guideline like "Whether the assistant addressed all requests by users" where user message would be needed?
There was a problem hiding this comment.
I can make this clearer:
Focus on judging only the assistant's responses, not the user's messages.
| - Focus only on the assistant's responses, not the user's messages. | ||
| - Only focus on the provided guidelines and not the correctness, relevance, or effectiveness of the responses. | ||
| - A guideline violation at ANY point in the conversation means the entire conversation fails. | ||
| - If none of the guidelines apply to the given conversation, the result must be "yes". |
There was a problem hiding this comment.
Why should this be "yes"? What does "apply to" mean here?
There was a problem hiding this comment.
If none of the guidelines are relevant (e.g., "Don't discuss Snowflake" for a conversation about cats and dogs), the assessment should not fail. We could alternatively add another class, but that will likely decrease accuracy and has the same impact as a passing assessment (user doesn't really look at it).
xsh310
left a comment
There was a problem hiding this comment.
Overall LGTM, left 2 quick comments on the prompt.
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
🛠 DevTools 🛠
Install mlflow from this PR
For Databricks, use the following command:
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
As titled.
How is this PR tested?
output:
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
Introduce the conversational guidelines scorer.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.