Prevent nested scorer calls from recording duplicate telemetry by alkispoly-db · Pull Request #19493 · mlflow/mlflow

alkispoly-db · 2025-12-18T19:05:43Z

What changes are proposed in this pull request?

This PR prevents nested scorer calls from recording duplicate telemetry events. When a scorer calls another scorer internally (e.g., KnowledgeRetention calling last_turn_scorer), only the top-level call now records a ScorerCallEvent, eliminating noise in telemetry data.

Key changes:

Scorer telemetry enhancement: Added thread-safe context tracking using ContextVar to detect nested scorer calls and skip telemetry for them
Judge evaluation simplification: Removed InstructionsJudge._evaluate_impl() method (now unnecessary with context-aware telemetry)
Type system improvements:
- Generalized SessionLevelScorer judge types from InstructionsJudge to Judge
- Updated Fluency, Completeness, and Summarization scorers to use InstructionsJudge() directly
- Selectively updated session-level scorers (UserFrustration, _LastTurnKnowledgeRetention) to use InstructionsJudge() where compatible

How is this PR tested?

New test file: tests/genai/scorers/test_scorer_telemetry.py with 10 comprehensive tests:
- Nested scorer calls (parent → child)
- Multi-level nesting (grandparent → parent → child)
- Recursive scorer calls
- Thread safety with 10 concurrent scorers
- Error handling in nested calls
- Async scorer rejection
- Telemetry globally disabled
- Decorator scorer with nested calls
Existing tests: All 96 tests in test_builtin_scorers.py pass (92 passed, 4 skipped)
Code formatting: Verified with ruff

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the website builds locally and documentation renders correctly with the changes.

Documentation not changed, but code comments and docstrings were updated.

Release Notes

Evaluators

[Improvement] Nested scorer calls now skip telemetry recording to prevent duplicate events

Component(s)

area/evaluation: Evaluation tracking and datasets

Release Classification

rn/none: No mention in release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

Related Issue: ML-54280

Implementation Details:

Uses Python's ContextVar for thread-safe nested call detection
Token-based cleanup pattern ensures proper context restoration
Scorers that use InstructionsJudge-specific parameters (generate_rationale_first, include_tool_calls_in_conversation) continue using direct instantiation
Scorers without special parameters now use InstructionsJudge() for consistency and telemetry tracking

Testing Highlights:

Thread-safe: 10 concurrent scorers each record exactly one event
Nested calls: Only top-level call records telemetry
Error handling: Context properly cleaned up even when nested scorers fail
Async safety: Async scorers explicitly rejected with clear error message

- Rename MultiLevelGrandparentScorer to GrandparentScorer for clarity - Remove unnecessary one-line comments from test functions - Update thread safety test to use 10 threads instead of 2 for better concurrency testing - Update telemetry disabled test to verify no events recorded instead of mock call count These changes improve test readability and robustness while maintaining full test coverage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

This commit addresses two main improvements: 1. **Scorer Telemetry Enhancement (ML-54280):** - Add context-aware telemetry tracking to prevent nested scorer calls from recording duplicate telemetry events - Implement thread-safe ContextVar-based tracking in scorers/base.py - Only record telemetry for top-level scorer calls, skip nested calls - Add comprehensive test suite in tests/genai/scorers/test_scorer_telemetry.py - Add test fixtures in tests/genai/scorers/conftest.py 2. **Judge Evaluation Logic Simplification:** - Inline _evaluate_impl method into __call__ in InstructionsJudge - Remove the intermediate _evaluate_impl method as requested by ALKIS comment - Update all callers in builtin_scorers.py to call judge directly via __call__ - Simplifies the call chain and reduces unnecessary indirection All tests pass (92 passed, 4 skipped in test_builtin_scorers.py). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

…udge Changed the return type of _create_judge() from InstructionsJudge to Judge in the SessionLevelScorer class and all its subclasses. This makes the type system more flexible and accurate: - Updated SessionLevelScorer._judge attribute type: InstructionsJudge -> Judge - Updated SessionLevelScorer._create_judge() return type: InstructionsJudge -> Judge - Updated SessionLevelScorer._get_judge() return type: InstructionsJudge -> Judge - Updated all subclass implementations of _create_judge() (7 classes total): * UserFrustration * ConversationCompleteness * ConversationalSafety * ConversationalToolCallEfficiency * ConversationalRoleAdherence * _KnowledgeRetentionLastTurnScorer * KnowledgeRetention Since InstructionsJudge is a subclass of Judge, all existing implementations continue to work without changes. This change better reflects the intent that these scorers work with any Judge implementation, not just InstructionsJudge. All tests pass (92 passed, 4 skipped in test_builtin_scorers.py). Addresses ALKIS comment in builtin_scorers.py:1871. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

…zation scorers Updated three built-in scorers to use make_judge() instead of directly instantiating InstructionsJudge, and changed their type annotations from InstructionsJudge to Judge: - **Fluency**: Changed _judge attribute and _get_judge() return type to Judge - **Completeness**: Changed _judge attribute and _get_judge() return type to Judge - **Summarization**: Changed _judge attribute and _get_judge() return type to Judge Using make_judge() provides: - Consistent judge creation across the codebase - Telemetry tracking via MakeJudgeEvent - Validation of feedback_value_type - Better alignment with public API patterns The Judge type is more flexible and accurate since InstructionsJudge is a subclass of Judge. All existing functionality is preserved. All tests pass (92 passed, 4 skipped in test_builtin_scorers.py). Addresses ALKIS comment in builtin_scorers.py:1638. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

…pecific parameters Updated two session-level scorers to use make_judge() instead of directly instantiating InstructionsJudge, but only for scorers that don't use InstructionsJudge-specific parameters: **Changed to make_judge():** - UserFrustration: Uses only standard parameters - _LastTurnKnowledgeRetention: Uses only standard parameters **Kept InstructionsJudge (due to special parameters):** - ConversationCompleteness: Uses generate_rationale_first=True - ConversationalSafety: Uses generate_rationale_first=True - ConversationalToolCallEfficiency: Uses generate_rationale_first=True and include_tool_calls_in_conversation=True - ConversationalRoleAdherence: Uses generate_rationale_first=True Using make_judge() provides consistent judge creation and telemetry tracking, but only where it's compatible with the required parameters. Also removed obsolete docstring note about avoiding make_judge in the abstract SessionLevelScorer._create_judge() method. All tests pass (92 passed, 4 skipped in test_builtin_scorers.py). Addresses ALKIS comment in builtin_scorers.py:2002. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

github-actions · 2025-12-18T19:05:58Z

@alkispoly-db Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

github-actions · 2025-12-18T19:14:13Z

Documentation preview for 50ebe62 is available at:

https://pr-19493--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

Replace hacky sys.path manipulation with proper module import. This is cleaner and more maintainable: - No sys.path side effects - Clear module path shows where fixtures come from - Follows standard Python import conventions All tests still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

**builtin_scorers.py changes:** - Switch all make_judge() calls to InstructionsJudge() for consistency - Updated 5 scorers: Fluency, Completeness, Summarization, UserFrustration, _LastTurnKnowledgeRetention - Removed unused make_judge import **test_scorer_telemetry.py changes:** - Add explanatory comments for assertion constants - Refactor thread safety test to start all threads in separate loop (better concurrency testing) All tests pass (10 telemetry tests + 12 affected scorer tests). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

The bypass_env_check fixture is not used in any scorer tests and can be removed. Also clarified the comment to explain which fixtures are autouse and why they need to be imported even though they're not directly referenced. All 10 telemetry tests still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

xsh310 · 2025-12-19T17:53:47Z

tests/genai/scorers/conftest.py

+
+@pytest.fixture(autouse=True)
+def mock_get_telemetry_client(mock_telemetry_client):
+    """Auto-patch get_telemetry_client to return the mock client."""


nit: one line docstring

weirdly it looks like the linter rule for this disappeared, not sure why

Will remove.

xsh310 · 2025-12-19T18:04:39Z

tests/genai/scorers/test_scorer_telemetry.py

+    assert event_params[0]["scorer_class"] == "UserDefinedScorer"
+    assert event_params[0]["scorer_kind"] == "class"


I think these are probably not a good indicator of which scorer got logged since both parent and child have these attributes?

Good point -- switched to a more precise assertion.

xsh310 · 2025-12-19T18:11:27Z

tests/genai/scorers/conftest.py

+# Autouse fixtures (terminate_telemetry_client, mock_requests_get, is_mlflow_testing)
+# are imported for their side effects and run automatically
+from tests.telemetry.conftest import (  # noqa: F401
+    is_mlflow_testing,


It seems like this is_mlflow_testing will be autoused for all tests using this conftest after we import? I think we should only add fixture for the tests that require telemetry enabled.

Good catch -- switching to a scheme that enables it only as part of the mock_telemetry_client

xsh310 · 2025-12-19T18:14:02Z

tests/genai/scorers/test_scorer_telemetry.py

+
+    event_params = get_event_params(mock_requests)
+    assert len(event_params) == 1
+    assert event_params[0]["scorer_kind"] == "decorator"


I think this test also doesn't tell us whether it's the parent or the child class logging the event

True. The recorded events do not give us this detail anyways, however. Here we are just checking that the recorded events are compatible with a single scorer being recorded.

xsh310 · 2025-12-19T18:26:20Z

tests/genai/scorers/test_scorer_telemetry.py

+
+    result = parent(outputs="test output")
+    # Expected: len("test output") * 2 = 22
+    assert result.value == 22


It seems like asserting the result.value is tangential to the actual goal of the test?

I actually think that this is a useful assertion to check that all the scorers in the chain get called. Otherwise we can get the test to pass if the parent scorer just skips the call to the nested scorer.

smoorjani

LGTM, left one tiny comment on docstring

smoorjani · 2025-12-18T20:43:53Z

mlflow/genai/judges/instructions_judge/__init__.py

+        **Note on Trace Behavior**:
+        - If template uses {{ trace }}: The trace metadata is used by an agent-based judge that uses
+          tools to fetch aspects of the trace's span data. If inputs/outputs/expectations are also
+          provided, they can augment the agent's context if the template has corresponding


I don't think this is the case - I think we throw an error if you pass {{trace}} and {{inputs}}

I will adjust the docstring. BTW, for {{inputs}}/{{outputs}}/{{expectations}} -- is it correct that we replace the placeholders? I thought that we inject them but they are not replaced in the exact spot they are mentioned.

xsh310

Overall LGTM! I manually tested with UserFrustration and verified no duplicate is logged.

Can we address this comment before merge? #19493 (comment)

This commit addresses review comments from PR mlflow#19493: 1. Fix autouse fixture scope issue in scorer conftest - Remove is_mlflow_testing import (was affecting all scorer tests) - Enable telemetry explicitly in mock_get_telemetry_client fixture - Update comment to reflect removed fixture 2. Improve test verification of which scorer logs telemetry - Convert ChildScorer class to decorator-based child_scorer_func - Use decorator pattern (kind="decorator") for child scorers - Keep class pattern (kind="class") for parent scorers - Verify telemetry shows kind="class" proving parent logged - Add clarifying comments explaining verification strategy 3. Clarify Instructions Judge docstring - Explain that {{trace}} enables tool calling (not replaced in prompt) - Clarify that {{inputs}}/{{outputs}}/{{expectations}} can be used alongside {{trace}} and will be replaced in the prompt - Remove ambiguous wording about "augmenting context" All tests pass and formatting checks pass (ruff, clint, typos). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

This commit simplifies the test code without changing functionality: 1. Remove redundant docstring from mock_get_telemetry_client fixture - The function name and implementation are self-explanatory 2. Add parent_scorer_func helper for decorator-based parent scorers - Provides alternative to ParentScorer class for testing - Used in test_multi_level_nesting_skips_telemetry 3. Simplify test helper classes - Remove verbose docstrings where class names are clear - Rename ErrorNestedScorer to ErrorScorer (simpler) - Refactor ParentWithErrorChild to create ErrorScorer internally 4. Remove redundant inline comments - Keep only essential comments that explain non-obvious logic All tests continue to pass with these simplifications. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

Further cleanup of scorer telemetry tests: - Remove redundant GrandparentScorer docstring - Remove redundant "nested calls were skipped" comment - Remove redundant "Use decorator function" inline comments All removed text was redundant - the code is self-explanatory. Tests continue to pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

…w#19493) Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

alkispoly-db and others added 5 commits December 18, 2025 18:22

github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 18, 2025

alkispoly-db and others added 2 commits December 18, 2025 19:14

alkispoly-db force-pushed the ML-54280 branch from 3f33bd8 to e70ae52 Compare December 18, 2025 19:39

alkispoly-db requested review from smoorjani and xsh310 December 18, 2025 20:34

github-actions bot added the v3.8.0 label Dec 18, 2025

B-Step62 removed the v3.8.0 label Dec 19, 2025

xsh310 reviewed Dec 19, 2025

View reviewed changes

github-actions bot assigned xsh310 Dec 19, 2025

xsh310 reviewed Dec 19, 2025

View reviewed changes

smoorjani approved these changes Dec 19, 2025

View reviewed changes

github-actions bot added the v3.8.0 label Dec 19, 2025

xsh310 approved these changes Dec 19, 2025

View reviewed changes

github-actions bot assigned smoorjani Dec 19, 2025

alkispoly-db and others added 3 commits December 19, 2025 21:40

alkispoly-db enabled auto-merge December 19, 2025 22:04

alkispoly-db added this pull request to the merge queue Dec 19, 2025

Merged via the queue into mlflow:master with commit c52b1c2 Dec 19, 2025
47 checks passed

alkispoly-db deleted the ML-54280 branch December 19, 2025 22:36

github-actions bot added v3.8.1 and removed v3.8.0 labels Dec 22, 2025

WeichenXu123 pushed a commit that referenced this pull request Dec 26, 2025

Prevent nested scorer calls from recording duplicate telemetry (#19493)

fab349c

Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

		assert event_params[0]["scorer_class"] == "UserDefinedScorer"
		assert event_params[0]["scorer_kind"] == "class"

Conversation

alkispoly-db commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How is this PR tested?

Does this PR change the documentation?

Release Notes

Evaluators

Component(s)

Release Classification

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Dec 18, 2025

⚠ Invalid PR template

Uh oh!

github-actions bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xsh310 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alkispoly-db commented Dec 18, 2025 •

edited

Loading

github-actions bot commented Dec 18, 2025 •

edited

Loading