[ML-59674] Create new single turn built-in judge Completeness by xsh310 · Pull Request #18968 · mlflow/mlflow

xsh310 · 2025-11-22T01:35:57Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/P0_builtin_judges_stack_ML_59674 [Files changed]
- stack/multi_turn_judge_align_handling_fix [Files changed]

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

This PR creates a new single-turn built-in judge Completeness. Users can instantiate a Completeness judge in one line just like the existing built-in judges, and can directly invoke or pass into genai.evaluation to evaluate whether the AI assistant is able to address all user questions and requests for a specific one user prompt.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Manual Testing

Manually tested with the following synthetic prompt and responses, openai:/gpt-5 is able to get all the examples correctly classified based on the Completeness judge prompt and gpt-4.1 mini is able to get 19/20 correctly classified.

good_examples = {
  "C1_simple_multi_question": {
    "What is 2 + 2, and what is 3 + 5? Please answer both explicitly.":"""
2 + 2 = 4.

3 + 5 = 8.
"""
  },

  "C2_multi_part_with_format": {
    "Explain photosynthesis in exactly 3 sentences, then give a bullet list of 3 key terms with a short definition for each.": """
Photosynthesis is the process by which plants convert light energy into chemical energy. It occurs mainly in the chloroplasts, using chlorophyll to capture sunlight. The result is the production of glucose and oxygen from carbon dioxide and water.

* **Chlorophyll** – The green pigment in plants that absorbs light energy.

* **Chloroplast** – The organelle in plant cells where photosynthesis takes place.

* **Glucose** – A simple sugar produced by photosynthesis that serves as an energy source for the plant.
"""
  },

  "C3_multiple_tasks_with_clarification": {
    "Help me improve my resume and also suggest 3 possible career paths for me. I have a background in data science and I also play music professionally, but I'm not sure how to combine them.": """
Here are some general ways to improve your resume:

1. Put a clear summary at the top that highlights both your data skills and your music/creative experience.

2. Use strong, quantified bullets (e.g., "Improved model accuracy by 12% using X technique").

3. Group experience into sections like "Data & Analytics" and "Music & Creative Work" so both sides of your profile are easy to see.

4. Add a concise skills section with tools (Python, SQL, DAWs, etc.), and keep the resume to 1–2 pages.

Based on what you shared, here are 3 possible career paths:

1. **Music Data Analyst** – Working with streaming platforms, labels, or venues to analyze listener data, ticket sales, and marketing performance.

2. **Recommendation Systems Engineer for Music/Media** – Building or improving recommendation algorithms for music apps and platforms.

3. **Data-Driven Creative Producer** – Using analytics to plan tours, releases, or content strategies for artists, including yourself.

To give more tailored resume edits (e.g., rewriting bullets), please share a few anonymized bullet points or a short excerpt from your current resume.
"""
  },

  "C4_partial_refusal_with_safe_part": {
    "Explain how to build an explosive device at home, and also give me a brief overview of World War II in 5 bullet points.": """
I can't help with instructions for building an explosive device. Providing that kind of information can cause serious harm and violates safety rules.

However, I can help with the history part. Here's a brief overview of World War II:

* It lasted from 1939 to 1945 and involved many nations, including the Allies and the Axis powers.

* The war began when Germany invaded Poland, leading Britain and France to declare war on Germany.

* Major turning points included the Battle of Stalingrad, the D-Day landings, and the Pacific island campaigns.

* The Holocaust, carried out by Nazi Germany, led to the systematic murder of six million Jews and millions of other victims.

* The war ended with Germany's surrender in May 1945 and Japan's surrender after the atomic bombings in August 1945.
"""
  },

  "C5_code_example_complexity": {
    "Write a Python function `is_palindrome(s: str) -> bool` that checks whether a string is a palindrome ignoring case and spaces. Then show a short example of calling it (with the printed output), and finally explain the time complexity in one sentence.": """
```python
def is_palindrome(s: str) -> bool:
    # Normalize: lowercase and remove spaces
    normalized = "".join(ch.lower() for ch in s if not ch.isspace())
    return normalized == normalized[::-1]

# Example usage:
test = "Never Odd Or Even"
print(is_palindrome(test))  # Expected output: True
python```

The time complexity is **O(n)**, where *n* is the length of the string, because we traverse the string a constant number of times.
"""
  },

  "C6_table_and_explanation": {
    "Create a Markdown table comparing three cloud providers (AWS, Azure, GCP) with columns: Provider, Strength, Common Use Case. Then write one short paragraph explaining when you might choose each.": """
| Provider | Strength                       | Common Use Case                          |
| -------- |


--- |



---- |
| AWS      | Largest ecosystem and services | Startups and enterprises needing breadth |
| Azure    | Strong Microsoft integration   | Companies using Windows/Office/AD        |
| GCP      | Data and ML tooling excellence | Analytics, big data, and ML workloads    |

You might choose **AWS** if you want the broadest range of services and mature tooling across many domains. **Azure** is often the best choice if your organization is already heavily invested in Microsoft technologies like Active Directory, Windows Server, or Office 365. **GCP** is attractive when your focus is on data engineering, big data processing, or machine learning, thanks to tools like BigQuery and Vertex AI.
"""
  },

  "C7_compare_and_recommend": {
    "Compare electric cars and gasoline cars: list 3 pros and 3 cons of each, and then recommend one option for a daily city commuter with a brief justification.": """
**Electric cars – Pros:**

1. Lower operating costs per mile (electricity is often cheaper than gasoline).

2. Zero tailpipe emissions, improving local air quality.

3. Smooth, quiet driving experience with instant torque.

**Electric cars – Cons:**

1. Higher upfront purchase price (though incentives can help).

2. Limited range compared to many gasoline cars, depending on battery size.

3. Charging infrastructure may be less convenient than gas stations in some areas.

**Gasoline cars – Pros:**

1. Refueling is fast and gas stations are widely available.

2. Lower upfront cost in many cases (especially used cars).

3. Wide variety of models and body types to choose from.

**Gasoline cars – Cons:**

1. Higher fuel and maintenance costs over time.

2. Tailpipe emissions contribute to air pollution and climate change.

3. Engine noise and vibration compared to electric vehicles.

**Recommendation for a daily city commuter:**

If you have access to home or workplace charging, an **electric car** is usually the better choice for city commuting: it's cheaper to run, quieter in stop-and-go traffic, and the typical daily city distance usually fits comfortably within an EV's range.
"""
  },

  "C8_with_followup_question": {
    "I'll be in Tokyo for 3 days. Suggest 5 things I should do, and then ask me one follow-up question at the end to help refine the plan.": """
Here are 5 things you could do in Tokyo over 3 days:

1. Visit **Senso-ji Temple** in Asakusa and explore the traditional streets around Nakamise-dori.

2. Spend an evening in **Shibuya**, seeing the famous crossing and dining at an izakaya.

3. Explore **Shinjuku Gyoen National Garden** for a calm break from the city bustle.

4. Visit **Akihabara** for electronics, anime, and game shops.

5. Go up **Tokyo Skytree** or **Tokyo Tower** for panoramic city views, ideally around sunset.

To tailor this plan better, do you prefer more food-focused experiences, cultural sights, or shopping?
"""
  },

  "C9_poor_quality_but_complete": {
    "Give me exactly 2 pros and exactly 2 cons of online learning, and then in one single sentence say whether you recommend it for busy working adults.": """
Pros:

you can study whenever you want basically and there are never any real deadlines, you just do stuff whenever, so it’s super free.

online learning is always cheaper than normal school, like you basically don’t have to pay much at all compared to regular classes.

Cons:

it can feel kind of lonely and easy to get distracted because nobody is watching you directly and you might just go on social media instead.

sometimes the internet or website breaks and then you literally can’t learn anything until it magically fixes itself.

I recommend online learning for busy working adults because even if it has some problems, it lets them “kind of” go to school without leaving home and that’s usually better for their schedule.
"""
  },

  "C10_clarification_only": {
    "What time is my flight tomorrow? I booked it last month but I forgot the details.": """
I don't have access to your personal bookings or email, so I can't see your flight information.

Please check your confirmation email or airline app, or share the airline and your flight number if you want help interpreting the schedule and time zones.
"""
  },
}

bad_examples = {
    "I1_missing_explicit_question": {
        "What is the capital of France and what is the capital of Germany? Please name both.": """
The capital of France is Paris. It's known for its art, culture, and history.
"""
    },

    "I2_ignores_format_requirements": {
        "Explain gravity in exactly 2 bullet points, and then give one sentence that uses an everyday analogy.": """
Gravity is a force that pulls objects with mass toward each other. It keeps planets in orbit around the sun and causes objects to fall toward the ground.
"""
    },

    "I3_refusal_ignores_safe_part": {
        "Tell me the easiest way to hack into my neighbor's Wi-Fi, and also give me 3 tips for securing my own home network.": """
I can't help you hack into anyone's Wi-Fi. That's illegal and unethical.
"""
    },

    "I4_one_sided_comparison": {
        "Summarize the pros and cons of remote work versus working in an office, and then give a neutral conclusion weighing both sides.": """
Remote work has many advantages, such as flexibility, no commuting, and the ability to work from anywhere. It can also reduce distractions for some people and save money on transportation and eating out. Overall, remote work is a great option for many people.
"""
    },

    "I5_missing_explanation": {
        "Show me a JavaScript code snippet that adds a click event listener to a button with id `submit`, and then explain in 2–3 sentences how it works.": """
```javascript
const button = document.getElementById("submit");
button.addEventListener("click", function () {
  alert("Button clicked!");
});
javascript```
"""
    },

    "I6_missing_shopping_list": {
        "Give me a step-by-step recipe for making basic pancakes for 4 people, and then provide a clear shopping list with exact quantities for all ingredients.": """
To make basic pancakes, mix flour, baking powder, sugar, and salt in a bowl. In another bowl, whisk together milk, eggs, and melted butter. Combine the wet and dry ingredients, then cook portions of the batter on a hot greased pan until bubbles form and the edges look set, then flip and cook until golden on both sides. Serve warm with syrup or fruit.
"""
    },

    "I7_no_ranking": {
        "List and rank your top 5 classic novels, ordered from 1 (best) to 5, and give a one-sentence justification for the ranking of each.": """
Here are five classic novels:

* Pride and Prejudice
* 1984
* Moby-Dick
* To Kill a Mockingbird
* Crime and Punishment

They are all highly regarded and have had a big impact on literature and culture.
"""
    },

    "I8_no_explicit_yes_no": {
        "Is 0 a natural number? You should answer yes or no and then explain your reasoning briefly.": """
The definition of natural numbers can vary depending on the context and the textbook. Some definitions include 0, while others start at 1, especially in older or more elementary texts. In more advanced mathematics and computer science, it's common to see 0 included in the set of natural numbers.
"""
    },

    "I9_unexplained_refusal": {
        "Give me detailed, step-by-step instructions on how to create a fake ID that could pass government verification checks.": """
        I can’t help with that request.
"""
    },

    "I10_missing_descriptions": {
        "Invent 5 original names for a fantasy city and give a 2-sentence description for each city's vibe and main feature.": """
1. **Lunaris Reach** – A shining city that glows softly under twin moons.

2. **Emberfall** – Built around a massive waterfall of glowing lava.

3. **Frostspire** – A towering metropolis carved into ancient ice cliffs, ruled by a council of mages.

4. **Verdanthollow**

5. **Stormwatch Keep**
"""
    },
}

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

[] No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

smoorjani

looks great! left a few small comments to address

smoorjani · 2025-11-25T06:48:53Z

mlflow/genai/judges/prompts/completeness.py

+COMPLETENESS_ASSESSMENT_NAME = "completeness"
+
+COMPLETENESS_PROMPT = """\
+Consider the following user prompt and AI response.


nit: let's be consistent in terminology (e.g., AI vs assistant) - my preference is assistant

smoorjani · 2025-11-25T06:49:32Z

mlflow/genai/scorers/builtin_scorers.py

+    model: str | None = None
+    required_columns: set[str] = {"inputs", "outputs"}
+    description: str = (
+        "Evaluate whether the AI fully addresses all user questions in a single turn."


smoorjani · 2025-11-25T06:49:53Z

mlflow/genai/scorers/builtin_scorers.py

+
+    @property
+    def instructions(self) -> str:
+        """Get the instructions of what this scorer evaluates."""


nit: let's remove one-line docstrings

smoorjani · 2025-11-25T06:50:03Z

mlflow/genai/scorers/builtin_scorers.py

+        outputs: Any | None = None,
+        trace: Trace | None = None,
+    ) -> Feedback:
+        """


nit: do we need this docstring again?

smoorjani · 2025-11-25T06:50:30Z

tests/genai/judges/test_make_judge.py

+    assert output_fields_default[0].name == "result"
+    assert output_fields_default[1].name == "rationale"
+
+    # Check response format field order (default: result first)


can we split this into different tests?

This is from the previous commit. I think this is the correct test file tho? test_make_judge currently handles all of the unit tests on InstructionsJudge APIs.

github-actions · 2025-11-26T04:01:59Z

Documentation preview for c3d10bb is available at:

https://pr-18968--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

serena-ruan

LGTM!

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

This was referenced Nov 22, 2025

[ML-59305] Create new session-level builtin judge UserFrustration #18966

Merged

[ML-59678] Create new session-level built-in judge ConversationCompleteness #18967

Merged

xsh310 requested review from AveshCSingh, B-Step62, alkispoly-db, dbczumar and smoorjani November 22, 2025 01:50

xsh310 marked this pull request as ready for review November 22, 2025 01:50

github-actions bot added v3.6.1 area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Nov 24, 2025

xsh310 force-pushed the stack/P0_builtin_judges_stack_ML_59674 branch 2 times, most recently from a362e45 to 90703c9 Compare November 25, 2025 03:00

xsh310 requested review from BenWilson2, harupy and serena-ruan November 25, 2025 05:39

smoorjani approved these changes Nov 25, 2025

View reviewed changes

xsh310 force-pushed the stack/P0_builtin_judges_stack_ML_59674 branch 4 times, most recently from 8408d5a to fa1fc87 Compare November 26, 2025 03:52

serena-ruan approved these changes Nov 26, 2025

View reviewed changes

xsh310 force-pushed the stack/P0_builtin_judges_stack_ML_59674 branch from fa1fc87 to 85ebe6c Compare November 26, 2025 07:09

xsh310 mentioned this pull request Nov 26, 2025

Throw exception for session-level judge align API at the base judge level instead of built-in judge level #19045

Merged

29 tasks

xsh310 added 3 commits November 25, 2025 23:22

[ML-59674] Create new single turn built-in judge Completeness

6025a21

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

[ML-59674] Make single turn completeness judge prompt more concise

ae4bde0

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

[ML-59674] Fix nits for single-turn Completeness judge

c3d10bb

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 force-pushed the stack/P0_builtin_judges_stack_ML_59674 branch from 85ebe6c to c3d10bb Compare November 26, 2025 07:23

xsh310 enabled auto-merge November 26, 2025 07:23

xsh310 added this pull request to the merge queue Nov 26, 2025

Merged via the queue into mlflow:master with commit 95780e0 Nov 26, 2025
48 checks passed

xsh310 deleted the stack/P0_builtin_judges_stack_ML_59674 branch November 26, 2025 08:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML-59674] Create new single turn built-in judge Completeness#18968

[ML-59674] Create new single turn built-in judge Completeness#18968
xsh310 merged 3 commits intomlflow:masterfrom
xsh310:stack/P0_builtin_judges_stack_ML_59674

xsh310 commented Nov 22, 2025 •

edited

Loading

Uh oh!

smoorjani left a comment

Uh oh!

smoorjani Nov 25, 2025

Uh oh!

smoorjani Nov 25, 2025

Uh oh!

smoorjani Nov 25, 2025

Uh oh!

smoorjani Nov 25, 2025

Uh oh!

smoorjani Nov 25, 2025

Uh oh!

xsh310 Nov 25, 2025

Uh oh!

github-actions bot commented Nov 26, 2025 •

edited

Loading

Uh oh!

serena-ruan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xsh310 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Manual Testing

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

smoorjani Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

smoorjani Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

smoorjani Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

smoorjani Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

smoorjani Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

xsh310 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serena-ruan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xsh310 commented Nov 22, 2025 •

edited

Loading

github-actions bot commented Nov 26, 2025 •

edited

Loading