Skip to content

Prevent nested scorer calls from recording duplicate telemetry#19493

Merged
alkispoly-db merged 11 commits intomlflow:masterfrom
alkispoly-db:ML-54280
Dec 19, 2025
Merged

Prevent nested scorer calls from recording duplicate telemetry#19493
alkispoly-db merged 11 commits intomlflow:masterfrom
alkispoly-db:ML-54280

Conversation

@alkispoly-db
Copy link
Collaborator

@alkispoly-db alkispoly-db commented Dec 18, 2025

What changes are proposed in this pull request?

This PR prevents nested scorer calls from recording duplicate telemetry events. When a scorer calls another scorer internally (e.g., KnowledgeRetention calling last_turn_scorer), only the top-level call now records a ScorerCallEvent, eliminating noise in telemetry data.

Key changes:

  1. Scorer telemetry enhancement: Added thread-safe context tracking using ContextVar to detect nested scorer calls and skip telemetry for them
  2. Judge evaluation simplification: Removed InstructionsJudge._evaluate_impl() method (now unnecessary with context-aware telemetry)
  3. Type system improvements:
    • Generalized SessionLevelScorer judge types from InstructionsJudge to Judge
    • Updated Fluency, Completeness, and Summarization scorers to use InstructionsJudge() directly
    • Selectively updated session-level scorers (UserFrustration, _LastTurnKnowledgeRetention) to use InstructionsJudge() where compatible

How is this PR tested?

  • New test file: tests/genai/scorers/test_scorer_telemetry.py with 10 comprehensive tests:
    • Nested scorer calls (parent → child)
    • Multi-level nesting (grandparent → parent → child)
    • Recursive scorer calls
    • Thread safety with 10 concurrent scorers
    • Error handling in nested calls
    • Async scorer rejection
    • Telemetry globally disabled
    • Decorator scorer with nested calls
  • Existing tests: All 96 tests in test_builtin_scorers.py pass (92 passed, 4 skipped)
  • Code formatting: Verified with ruff

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the website builds locally and documentation renders correctly with the changes.

Documentation not changed, but code comments and docstrings were updated.

Release Notes

Evaluators

  • [Improvement] Nested scorer calls now skip telemetry recording to prevent duplicate events

Component(s)

  • area/evaluation: Evaluation tracking and datasets

Release Classification

  • rn/none: No mention in release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Related Issue: ML-54280

Implementation Details:

  • Uses Python's ContextVar for thread-safe nested call detection
  • Token-based cleanup pattern ensures proper context restoration
  • Scorers that use InstructionsJudge-specific parameters (generate_rationale_first, include_tool_calls_in_conversation) continue using direct instantiation
  • Scorers without special parameters now use InstructionsJudge() for consistency and telemetry tracking

Testing Highlights:

  • Thread-safe: 10 concurrent scorers each record exactly one event
  • Nested calls: Only top-level call records telemetry
  • Error handling: Context properly cleaned up even when nested scorers fail
  • Async safety: Async scorers explicitly rejected with clear error message

alkispoly-db and others added 5 commits December 18, 2025 18:22
- Rename MultiLevelGrandparentScorer to GrandparentScorer for clarity
- Remove unnecessary one-line comments from test functions
- Update thread safety test to use 10 threads instead of 2 for better concurrency testing
- Update telemetry disabled test to verify no events recorded instead of mock call count

These changes improve test readability and robustness while maintaining full test coverage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
This commit addresses two main improvements:

1. **Scorer Telemetry Enhancement (ML-54280):**
   - Add context-aware telemetry tracking to prevent nested scorer calls from recording duplicate telemetry events
   - Implement thread-safe ContextVar-based tracking in scorers/base.py
   - Only record telemetry for top-level scorer calls, skip nested calls
   - Add comprehensive test suite in tests/genai/scorers/test_scorer_telemetry.py
   - Add test fixtures in tests/genai/scorers/conftest.py

2. **Judge Evaluation Logic Simplification:**
   - Inline _evaluate_impl method into __call__ in InstructionsJudge
   - Remove the intermediate _evaluate_impl method as requested by ALKIS comment
   - Update all callers in builtin_scorers.py to call judge directly via __call__
   - Simplifies the call chain and reduces unnecessary indirection

All tests pass (92 passed, 4 skipped in test_builtin_scorers.py).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
…udge

Changed the return type of _create_judge() from InstructionsJudge to Judge
in the SessionLevelScorer class and all its subclasses. This makes the type
system more flexible and accurate:

- Updated SessionLevelScorer._judge attribute type: InstructionsJudge -> Judge
- Updated SessionLevelScorer._create_judge() return type: InstructionsJudge -> Judge
- Updated SessionLevelScorer._get_judge() return type: InstructionsJudge -> Judge
- Updated all subclass implementations of _create_judge() (7 classes total):
  * UserFrustration
  * ConversationCompleteness
  * ConversationalSafety
  * ConversationalToolCallEfficiency
  * ConversationalRoleAdherence
  * _KnowledgeRetentionLastTurnScorer
  * KnowledgeRetention

Since InstructionsJudge is a subclass of Judge, all existing implementations
continue to work without changes. This change better reflects the intent that
these scorers work with any Judge implementation, not just InstructionsJudge.

All tests pass (92 passed, 4 skipped in test_builtin_scorers.py).

Addresses ALKIS comment in builtin_scorers.py:1871.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
…zation scorers

Updated three built-in scorers to use make_judge() instead of directly instantiating
InstructionsJudge, and changed their type annotations from InstructionsJudge to Judge:

- **Fluency**: Changed _judge attribute and _get_judge() return type to Judge
- **Completeness**: Changed _judge attribute and _get_judge() return type to Judge
- **Summarization**: Changed _judge attribute and _get_judge() return type to Judge

Using make_judge() provides:
- Consistent judge creation across the codebase
- Telemetry tracking via MakeJudgeEvent
- Validation of feedback_value_type
- Better alignment with public API patterns

The Judge type is more flexible and accurate since InstructionsJudge is a subclass
of Judge. All existing functionality is preserved.

All tests pass (92 passed, 4 skipped in test_builtin_scorers.py).

Addresses ALKIS comment in builtin_scorers.py:1638.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
…pecific parameters

Updated two session-level scorers to use make_judge() instead of directly instantiating
InstructionsJudge, but only for scorers that don't use InstructionsJudge-specific parameters:

**Changed to make_judge():**
- UserFrustration: Uses only standard parameters
- _LastTurnKnowledgeRetention: Uses only standard parameters

**Kept InstructionsJudge (due to special parameters):**
- ConversationCompleteness: Uses generate_rationale_first=True
- ConversationalSafety: Uses generate_rationale_first=True
- ConversationalToolCallEfficiency: Uses generate_rationale_first=True and include_tool_calls_in_conversation=True
- ConversationalRoleAdherence: Uses generate_rationale_first=True

Using make_judge() provides consistent judge creation and telemetry tracking,
but only where it's compatible with the required parameters.

Also removed obsolete docstring note about avoiding make_judge in the abstract
SessionLevelScorer._create_judge() method.

All tests pass (92 passed, 4 skipped in test_builtin_scorers.py).

Addresses ALKIS comment in builtin_scorers.py:2002.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@github-actions
Copy link
Contributor

@alkispoly-db Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 18, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Documentation preview for 50ebe62 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

alkispoly-db and others added 2 commits December 18, 2025 19:14
Replace hacky sys.path manipulation with proper module import.
This is cleaner and more maintainable:
- No sys.path side effects
- Clear module path shows where fixtures come from
- Follows standard Python import conventions

All tests still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
**builtin_scorers.py changes:**
- Switch all make_judge() calls to InstructionsJudge() for consistency
- Updated 5 scorers: Fluency, Completeness, Summarization, UserFrustration, _LastTurnKnowledgeRetention
- Removed unused make_judge import

**test_scorer_telemetry.py changes:**
- Add explanatory comments for assertion constants
- Refactor thread safety test to start all threads in separate loop (better concurrency testing)

All tests pass (10 telemetry tests + 12 affected scorer tests).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
The bypass_env_check fixture is not used in any scorer tests and can be removed.

Also clarified the comment to explain which fixtures are autouse and why
they need to be imported even though they're not directly referenced.

All 10 telemetry tests still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

@pytest.fixture(autouse=True)
def mock_get_telemetry_client(mock_telemetry_client):
"""Auto-patch get_telemetry_client to return the mock client."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one line docstring

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weirdly it looks like the linter rule for this disappeared, not sure why

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove.

Comment on lines +124 to +125
assert event_params[0]["scorer_class"] == "UserDefinedScorer"
assert event_params[0]["scorer_kind"] == "class"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are probably not a good indicator of which scorer got logged since both parent and child have these attributes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- switched to a more precise assertion.

# Autouse fixtures (terminate_telemetry_client, mock_requests_get, is_mlflow_testing)
# are imported for their side effects and run automatically
from tests.telemetry.conftest import ( # noqa: F401
is_mlflow_testing,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is_mlflow_testing will be autoused for all tests using this conftest after we import? I think we should only add fixture for the tests that require telemetry enabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- switching to a scheme that enables it only as part of the mock_telemetry_client


event_params = get_event_params(mock_requests)
assert len(event_params) == 1
assert event_params[0]["scorer_kind"] == "decorator"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test also doesn't tell us whether it's the parent or the child class logging the event

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The recorded events do not give us this detail anyways, however. Here we are just checking that the recorded events are compatible with a single scorer being recorded.


result = parent(outputs="test output")
# Expected: len("test output") * 2 = 22
assert result.value == 22
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like asserting the result.value is tangential to the actual goal of the test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that this is a useful assertion to check that all the scorers in the chain get called. Otherwise we can get the test to pass if the parent scorer just skips the call to the nested scorer.

Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left one tiny comment on docstring

**Note on Trace Behavior**:
- If template uses {{ trace }}: The trace metadata is used by an agent-based judge that uses
tools to fetch aspects of the trace's span data. If inputs/outputs/expectations are also
provided, they can augment the agent's context if the template has corresponding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the case - I think we throw an error if you pass {{trace}} and {{inputs}}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will adjust the docstring. BTW, for {{inputs}}/{{outputs}}/{{expectations}} -- is it correct that we replace the placeholders? I thought that we inject them but they are not replaced in the exact spot they are mentioned.

Copy link
Collaborator

@xsh310 xsh310 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! I manually tested with UserFrustration and verified no duplicate is logged.

Can we address this comment before merge? #19493 (comment)

alkispoly-db and others added 3 commits December 19, 2025 21:40
This commit addresses review comments from PR mlflow#19493:

1. Fix autouse fixture scope issue in scorer conftest
   - Remove is_mlflow_testing import (was affecting all scorer tests)
   - Enable telemetry explicitly in mock_get_telemetry_client fixture
   - Update comment to reflect removed fixture

2. Improve test verification of which scorer logs telemetry
   - Convert ChildScorer class to decorator-based child_scorer_func
   - Use decorator pattern (kind="decorator") for child scorers
   - Keep class pattern (kind="class") for parent scorers
   - Verify telemetry shows kind="class" proving parent logged
   - Add clarifying comments explaining verification strategy

3. Clarify Instructions Judge docstring
   - Explain that {{trace}} enables tool calling (not replaced in prompt)
   - Clarify that {{inputs}}/{{outputs}}/{{expectations}} can be used
     alongside {{trace}} and will be replaced in the prompt
   - Remove ambiguous wording about "augmenting context"

All tests pass and formatting checks pass (ruff, clint, typos).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
This commit simplifies the test code without changing functionality:

1. Remove redundant docstring from mock_get_telemetry_client fixture
   - The function name and implementation are self-explanatory

2. Add parent_scorer_func helper for decorator-based parent scorers
   - Provides alternative to ParentScorer class for testing
   - Used in test_multi_level_nesting_skips_telemetry

3. Simplify test helper classes
   - Remove verbose docstrings where class names are clear
   - Rename ErrorNestedScorer to ErrorScorer (simpler)
   - Refactor ParentWithErrorChild to create ErrorScorer internally

4. Remove redundant inline comments
   - Keep only essential comments that explain non-obvious logic

All tests continue to pass with these simplifications.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Further cleanup of scorer telemetry tests:
- Remove redundant GrandparentScorer docstring
- Remove redundant "nested calls were skipped" comment
- Remove redundant "Use decorator function" inline comments

All removed text was redundant - the code is self-explanatory.
Tests continue to pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@alkispoly-db alkispoly-db added this pull request to the merge queue Dec 19, 2025
Merged via the queue into mlflow:master with commit c52b1c2 Dec 19, 2025
47 checks passed
@alkispoly-db alkispoly-db deleted the ML-54280 branch December 19, 2025 22:36
@github-actions github-actions bot added v3.8.1 and removed v3.8.0 labels Dec 22, 2025
WeichenXu123 pushed a commit to WeichenXu123/mlflow that referenced this pull request Dec 26, 2025
…w#19493)

Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
WeichenXu123 pushed a commit that referenced this pull request Dec 26, 2025
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. v3.8.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants