Skip to content

[ML-59674] Create new single turn built-in judge Completeness#18968

Merged
xsh310 merged 3 commits intomlflow:masterfrom
xsh310:stack/P0_builtin_judges_stack_ML_59674
Nov 26, 2025
Merged

[ML-59674] Create new single turn built-in judge Completeness#18968
xsh310 merged 3 commits intomlflow:masterfrom
xsh310:stack/P0_builtin_judges_stack_ML_59674

Conversation

@xsh310
Copy link
Collaborator

@xsh310 xsh310 commented Nov 22, 2025

🥞 Stacked PR

Use this link to review incremental changes.


Related Issues/PRs

#xxx

What changes are proposed in this pull request?

This PR creates a new single-turn built-in judge Completeness. Users can instantiate a Completeness judge in one line just like the existing built-in judges, and can directly invoke or pass into genai.evaluation to evaluate whether the AI assistant is able to address all user questions and requests for a specific one user prompt.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Manual Testing

Manually tested with the following synthetic prompt and responses, openai:/gpt-5 is able to get all the examples correctly classified based on the Completeness judge prompt and gpt-4.1 mini is able to get 19/20 correctly classified.

good_examples = {
  "C1_simple_multi_question": {
    "What is 2 + 2, and what is 3 + 5? Please answer both explicitly.":"""
2 + 2 = 4.

3 + 5 = 8.
"""
  },

  "C2_multi_part_with_format": {
    "Explain photosynthesis in exactly 3 sentences, then give a bullet list of 3 key terms with a short definition for each.": """
Photosynthesis is the process by which plants convert light energy into chemical energy. It occurs mainly in the chloroplasts, using chlorophyll to capture sunlight. The result is the production of glucose and oxygen from carbon dioxide and water.

* **Chlorophyll** – The green pigment in plants that absorbs light energy.

* **Chloroplast** – The organelle in plant cells where photosynthesis takes place.

* **Glucose** – A simple sugar produced by photosynthesis that serves as an energy source for the plant.
"""
  },

  "C3_multiple_tasks_with_clarification": {
    "Help me improve my resume and also suggest 3 possible career paths for me. I have a background in data science and I also play music professionally, but I'm not sure how to combine them.": """
Here are some general ways to improve your resume:

1. Put a clear summary at the top that highlights both your data skills and your music/creative experience.

2. Use strong, quantified bullets (e.g., "Improved model accuracy by 12% using X technique").

3. Group experience into sections like "Data & Analytics" and "Music & Creative Work" so both sides of your profile are easy to see.

4. Add a concise skills section with tools (Python, SQL, DAWs, etc.), and keep the resume to 1–2 pages.

Based on what you shared, here are 3 possible career paths:

1. **Music Data Analyst** – Working with streaming platforms, labels, or venues to analyze listener data, ticket sales, and marketing performance.

2. **Recommendation Systems Engineer for Music/Media** – Building or improving recommendation algorithms for music apps and platforms.

3. **Data-Driven Creative Producer** – Using analytics to plan tours, releases, or content strategies for artists, including yourself.

To give more tailored resume edits (e.g., rewriting bullets), please share a few anonymized bullet points or a short excerpt from your current resume.
"""
  },

  "C4_partial_refusal_with_safe_part": {
    "Explain how to build an explosive device at home, and also give me a brief overview of World War II in 5 bullet points.": """
I can't help with instructions for building an explosive device. Providing that kind of information can cause serious harm and violates safety rules.

However, I can help with the history part. Here's a brief overview of World War II:

* It lasted from 1939 to 1945 and involved many nations, including the Allies and the Axis powers.

* The war began when Germany invaded Poland, leading Britain and France to declare war on Germany.

* Major turning points included the Battle of Stalingrad, the D-Day landings, and the Pacific island campaigns.

* The Holocaust, carried out by Nazi Germany, led to the systematic murder of six million Jews and millions of other victims.

* The war ended with Germany's surrender in May 1945 and Japan's surrender after the atomic bombings in August 1945.
"""
  },

  "C5_code_example_complexity": {
    "Write a Python function `is_palindrome(s: str) -> bool` that checks whether a string is a palindrome ignoring case and spaces. Then show a short example of calling it (with the printed output), and finally explain the time complexity in one sentence.": """
```python
def is_palindrome(s: str) -> bool:
    # Normalize: lowercase and remove spaces
    normalized = "".join(ch.lower() for ch in s if not ch.isspace())
    return normalized == normalized[::-1]

# Example usage:
test = "Never Odd Or Even"
print(is_palindrome(test))  # Expected output: True
python```

The time complexity is **O(n)**, where *n* is the length of the string, because we traverse the string a constant number of times.
"""
  },

  "C6_table_and_explanation": {
    "Create a Markdown table comparing three cloud providers (AWS, Azure, GCP) with columns: Provider, Strength, Common Use Case. Then write one short paragraph explaining when you might choose each.": """
| Provider | Strength                       | Common Use Case                          |
| -------- |


--- |



---- |
| AWS      | Largest ecosystem and services | Startups and enterprises needing breadth |
| Azure    | Strong Microsoft integration   | Companies using Windows/Office/AD        |
| GCP      | Data and ML tooling excellence | Analytics, big data, and ML workloads    |

You might choose **AWS** if you want the broadest range of services and mature tooling across many domains. **Azure** is often the best choice if your organization is already heavily invested in Microsoft technologies like Active Directory, Windows Server, or Office 365. **GCP** is attractive when your focus is on data engineering, big data processing, or machine learning, thanks to tools like BigQuery and Vertex AI.
"""
  },

  "C7_compare_and_recommend": {
    "Compare electric cars and gasoline cars: list 3 pros and 3 cons of each, and then recommend one option for a daily city commuter with a brief justification.": """
**Electric cars – Pros:**

1. Lower operating costs per mile (electricity is often cheaper than gasoline).

2. Zero tailpipe emissions, improving local air quality.

3. Smooth, quiet driving experience with instant torque.

**Electric cars – Cons:**

1. Higher upfront purchase price (though incentives can help).

2. Limited range compared to many gasoline cars, depending on battery size.

3. Charging infrastructure may be less convenient than gas stations in some areas.

**Gasoline cars – Pros:**

1. Refueling is fast and gas stations are widely available.

2. Lower upfront cost in many cases (especially used cars).

3. Wide variety of models and body types to choose from.

**Gasoline cars – Cons:**

1. Higher fuel and maintenance costs over time.

2. Tailpipe emissions contribute to air pollution and climate change.

3. Engine noise and vibration compared to electric vehicles.

**Recommendation for a daily city commuter:**

If you have access to home or workplace charging, an **electric car** is usually the better choice for city commuting: it's cheaper to run, quieter in stop-and-go traffic, and the typical daily city distance usually fits comfortably within an EV's range.
"""
  },

  "C8_with_followup_question": {
    "I'll be in Tokyo for 3 days. Suggest 5 things I should do, and then ask me one follow-up question at the end to help refine the plan.": """
Here are 5 things you could do in Tokyo over 3 days:

1. Visit **Senso-ji Temple** in Asakusa and explore the traditional streets around Nakamise-dori.

2. Spend an evening in **Shibuya**, seeing the famous crossing and dining at an izakaya.

3. Explore **Shinjuku Gyoen National Garden** for a calm break from the city bustle.

4. Visit **Akihabara** for electronics, anime, and game shops.

5. Go up **Tokyo Skytree** or **Tokyo Tower** for panoramic city views, ideally around sunset.

To tailor this plan better, do you prefer more food-focused experiences, cultural sights, or shopping?
"""
  },

  "C9_poor_quality_but_complete": {
    "Give me exactly 2 pros and exactly 2 cons of online learning, and then in one single sentence say whether you recommend it for busy working adults.": """
Pros:

you can study whenever you want basically and there are never any real deadlines, you just do stuff whenever, so it’s super free.

online learning is always cheaper than normal school, like you basically don’t have to pay much at all compared to regular classes.

Cons:

it can feel kind of lonely and easy to get distracted because nobody is watching you directly and you might just go on social media instead.

sometimes the internet or website breaks and then you literally can’t learn anything until it magically fixes itself.

I recommend online learning for busy working adults because even if it has some problems, it lets them “kind of” go to school without leaving home and that’s usually better for their schedule.
"""
  },

  "C10_clarification_only": {
    "What time is my flight tomorrow? I booked it last month but I forgot the details.": """
I don't have access to your personal bookings or email, so I can't see your flight information.

Please check your confirmation email or airline app, or share the airline and your flight number if you want help interpreting the schedule and time zones.
"""
  },
}

bad_examples = {
    "I1_missing_explicit_question": {
        "What is the capital of France and what is the capital of Germany? Please name both.": """
The capital of France is Paris. It's known for its art, culture, and history.
"""
    },

    "I2_ignores_format_requirements": {
        "Explain gravity in exactly 2 bullet points, and then give one sentence that uses an everyday analogy.": """
Gravity is a force that pulls objects with mass toward each other. It keeps planets in orbit around the sun and causes objects to fall toward the ground.
"""
    },

    "I3_refusal_ignores_safe_part": {
        "Tell me the easiest way to hack into my neighbor's Wi-Fi, and also give me 3 tips for securing my own home network.": """
I can't help you hack into anyone's Wi-Fi. That's illegal and unethical.
"""
    },

    "I4_one_sided_comparison": {
        "Summarize the pros and cons of remote work versus working in an office, and then give a neutral conclusion weighing both sides.": """
Remote work has many advantages, such as flexibility, no commuting, and the ability to work from anywhere. It can also reduce distractions for some people and save money on transportation and eating out. Overall, remote work is a great option for many people.
"""
    },

    "I5_missing_explanation": {
        "Show me a JavaScript code snippet that adds a click event listener to a button with id `submit`, and then explain in 2–3 sentences how it works.": """
```javascript
const button = document.getElementById("submit");
button.addEventListener("click", function () {
  alert("Button clicked!");
});
javascript```
"""
    },

    "I6_missing_shopping_list": {
        "Give me a step-by-step recipe for making basic pancakes for 4 people, and then provide a clear shopping list with exact quantities for all ingredients.": """
To make basic pancakes, mix flour, baking powder, sugar, and salt in a bowl. In another bowl, whisk together milk, eggs, and melted butter. Combine the wet and dry ingredients, then cook portions of the batter on a hot greased pan until bubbles form and the edges look set, then flip and cook until golden on both sides. Serve warm with syrup or fruit.
"""
    },

    "I7_no_ranking": {
        "List and rank your top 5 classic novels, ordered from 1 (best) to 5, and give a one-sentence justification for the ranking of each.": """
Here are five classic novels:

* Pride and Prejudice
* 1984
* Moby-Dick
* To Kill a Mockingbird
* Crime and Punishment

They are all highly regarded and have had a big impact on literature and culture.
"""
    },

    "I8_no_explicit_yes_no": {
        "Is 0 a natural number? You should answer yes or no and then explain your reasoning briefly.": """
The definition of natural numbers can vary depending on the context and the textbook. Some definitions include 0, while others start at 1, especially in older or more elementary texts. In more advanced mathematics and computer science, it's common to see 0 included in the set of natural numbers.
"""
    },

    "I9_unexplained_refusal": {
        "Give me detailed, step-by-step instructions on how to create a fake ID that could pass government verification checks.": """
        I can’t help with that request.
"""
    },

    "I10_missing_descriptions": {
        "Invent 5 original names for a fantasy city and give a 2-sentence description for each city's vibe and main feature.": """
1. **Lunaris Reach** – A shining city that glows softly under twin moons.

2. **Emberfall** – Built around a massive waterfall of glowing lava.

3. **Frostspire** – A towering metropolis carved into ancient ice cliffs, ruled by a council of mages.

4. **Verdanthollow**

5. **Stormwatch Keep**
"""
    },
}

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • [] No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@xsh310 xsh310 marked this pull request as ready for review November 22, 2025 01:50
@github-actions github-actions bot added v3.6.1 area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Nov 24, 2025
@xsh310 xsh310 force-pushed the stack/P0_builtin_judges_stack_ML_59674 branch 2 times, most recently from a362e45 to 90703c9 Compare November 25, 2025 03:00
Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! left a few small comments to address

COMPLETENESS_ASSESSMENT_NAME = "completeness"

COMPLETENESS_PROMPT = """\
Consider the following user prompt and AI response.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's be consistent in terminology (e.g., AI vs assistant) - my preference is assistant

model: str | None = None
required_columns: set[str] = {"inputs", "outputs"}
description: str = (
"Evaluate whether the AI fully addresses all user questions in a single turn."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


@property
def instructions(self) -> str:
"""Get the instructions of what this scorer evaluates."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's remove one-line docstrings

outputs: Any | None = None,
trace: Trace | None = None,
) -> Feedback:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need this docstring again?

assert output_fields_default[0].name == "result"
assert output_fields_default[1].name == "rationale"

# Check response format field order (default: result first)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we split this into different tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the previous commit. I think this is the correct test file tho? test_make_judge currently handles all of the unit tests on InstructionsJudge APIs.

@xsh310 xsh310 force-pushed the stack/P0_builtin_judges_stack_ML_59674 branch 4 times, most recently from 8408d5a to fa1fc87 Compare November 26, 2025 03:52
@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Documentation preview for c3d10bb is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

Copy link
Collaborator

@serena-ruan serena-ruan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
@xsh310 xsh310 force-pushed the stack/P0_builtin_judges_stack_ML_59674 branch from 85ebe6c to c3d10bb Compare November 26, 2025 07:23
@xsh310 xsh310 enabled auto-merge November 26, 2025 07:23
@xsh310 xsh310 added this pull request to the merge queue Nov 26, 2025
Merged via the queue into mlflow:master with commit 95780e0 Nov 26, 2025
48 checks passed
@xsh310 xsh310 deleted the stack/P0_builtin_judges_stack_ML_59674 branch November 26, 2025 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. v3.6.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants