Prevent nested scorer calls from recording duplicate telemetry#19493
Prevent nested scorer calls from recording duplicate telemetry#19493alkispoly-db merged 11 commits intomlflow:masterfrom
Conversation
- Rename MultiLevelGrandparentScorer to GrandparentScorer for clarity - Remove unnecessary one-line comments from test functions - Update thread safety test to use 10 threads instead of 2 for better concurrency testing - Update telemetry disabled test to verify no events recorded instead of mock call count These changes improve test readability and robustness while maintaining full test coverage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
This commit addresses two main improvements: 1. **Scorer Telemetry Enhancement (ML-54280):** - Add context-aware telemetry tracking to prevent nested scorer calls from recording duplicate telemetry events - Implement thread-safe ContextVar-based tracking in scorers/base.py - Only record telemetry for top-level scorer calls, skip nested calls - Add comprehensive test suite in tests/genai/scorers/test_scorer_telemetry.py - Add test fixtures in tests/genai/scorers/conftest.py 2. **Judge Evaluation Logic Simplification:** - Inline _evaluate_impl method into __call__ in InstructionsJudge - Remove the intermediate _evaluate_impl method as requested by ALKIS comment - Update all callers in builtin_scorers.py to call judge directly via __call__ - Simplifies the call chain and reduces unnecessary indirection All tests pass (92 passed, 4 skipped in test_builtin_scorers.py). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
…udge Changed the return type of _create_judge() from InstructionsJudge to Judge in the SessionLevelScorer class and all its subclasses. This makes the type system more flexible and accurate: - Updated SessionLevelScorer._judge attribute type: InstructionsJudge -> Judge - Updated SessionLevelScorer._create_judge() return type: InstructionsJudge -> Judge - Updated SessionLevelScorer._get_judge() return type: InstructionsJudge -> Judge - Updated all subclass implementations of _create_judge() (7 classes total): * UserFrustration * ConversationCompleteness * ConversationalSafety * ConversationalToolCallEfficiency * ConversationalRoleAdherence * _KnowledgeRetentionLastTurnScorer * KnowledgeRetention Since InstructionsJudge is a subclass of Judge, all existing implementations continue to work without changes. This change better reflects the intent that these scorers work with any Judge implementation, not just InstructionsJudge. All tests pass (92 passed, 4 skipped in test_builtin_scorers.py). Addresses ALKIS comment in builtin_scorers.py:1871. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
…zation scorers Updated three built-in scorers to use make_judge() instead of directly instantiating InstructionsJudge, and changed their type annotations from InstructionsJudge to Judge: - **Fluency**: Changed _judge attribute and _get_judge() return type to Judge - **Completeness**: Changed _judge attribute and _get_judge() return type to Judge - **Summarization**: Changed _judge attribute and _get_judge() return type to Judge Using make_judge() provides: - Consistent judge creation across the codebase - Telemetry tracking via MakeJudgeEvent - Validation of feedback_value_type - Better alignment with public API patterns The Judge type is more flexible and accurate since InstructionsJudge is a subclass of Judge. All existing functionality is preserved. All tests pass (92 passed, 4 skipped in test_builtin_scorers.py). Addresses ALKIS comment in builtin_scorers.py:1638. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
…pecific parameters Updated two session-level scorers to use make_judge() instead of directly instantiating InstructionsJudge, but only for scorers that don't use InstructionsJudge-specific parameters: **Changed to make_judge():** - UserFrustration: Uses only standard parameters - _LastTurnKnowledgeRetention: Uses only standard parameters **Kept InstructionsJudge (due to special parameters):** - ConversationCompleteness: Uses generate_rationale_first=True - ConversationalSafety: Uses generate_rationale_first=True - ConversationalToolCallEfficiency: Uses generate_rationale_first=True and include_tool_calls_in_conversation=True - ConversationalRoleAdherence: Uses generate_rationale_first=True Using make_judge() provides consistent judge creation and telemetry tracking, but only where it's compatible with the required parameters. Also removed obsolete docstring note about avoiding make_judge in the abstract SessionLevelScorer._create_judge() method. All tests pass (92 passed, 4 skipped in test_builtin_scorers.py). Addresses ALKIS comment in builtin_scorers.py:2002. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
|
@alkispoly-db Thank you for the contribution! Could you fix the following issue(s)? ⚠ Invalid PR templateThis PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out. |
|
Documentation preview for 50ebe62 is available at: More info
|
Replace hacky sys.path manipulation with proper module import. This is cleaner and more maintainable: - No sys.path side effects - Clear module path shows where fixtures come from - Follows standard Python import conventions All tests still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
**builtin_scorers.py changes:** - Switch all make_judge() calls to InstructionsJudge() for consistency - Updated 5 scorers: Fluency, Completeness, Summarization, UserFrustration, _LastTurnKnowledgeRetention - Removed unused make_judge import **test_scorer_telemetry.py changes:** - Add explanatory comments for assertion constants - Refactor thread safety test to start all threads in separate loop (better concurrency testing) All tests pass (10 telemetry tests + 12 affected scorer tests). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
3f33bd8 to
e70ae52
Compare
The bypass_env_check fixture is not used in any scorer tests and can be removed. Also clarified the comment to explain which fixtures are autouse and why they need to be imported even though they're not directly referenced. All 10 telemetry tests still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
tests/genai/scorers/conftest.py
Outdated
|
|
||
| @pytest.fixture(autouse=True) | ||
| def mock_get_telemetry_client(mock_telemetry_client): | ||
| """Auto-patch get_telemetry_client to return the mock client.""" |
There was a problem hiding this comment.
weirdly it looks like the linter rule for this disappeared, not sure why
| assert event_params[0]["scorer_class"] == "UserDefinedScorer" | ||
| assert event_params[0]["scorer_kind"] == "class" |
There was a problem hiding this comment.
I think these are probably not a good indicator of which scorer got logged since both parent and child have these attributes?
There was a problem hiding this comment.
Good point -- switched to a more precise assertion.
tests/genai/scorers/conftest.py
Outdated
| # Autouse fixtures (terminate_telemetry_client, mock_requests_get, is_mlflow_testing) | ||
| # are imported for their side effects and run automatically | ||
| from tests.telemetry.conftest import ( # noqa: F401 | ||
| is_mlflow_testing, |
There was a problem hiding this comment.
It seems like this is_mlflow_testing will be autoused for all tests using this conftest after we import? I think we should only add fixture for the tests that require telemetry enabled.
There was a problem hiding this comment.
Good catch -- switching to a scheme that enables it only as part of the mock_telemetry_client
|
|
||
| event_params = get_event_params(mock_requests) | ||
| assert len(event_params) == 1 | ||
| assert event_params[0]["scorer_kind"] == "decorator" |
There was a problem hiding this comment.
I think this test also doesn't tell us whether it's the parent or the child class logging the event
There was a problem hiding this comment.
True. The recorded events do not give us this detail anyways, however. Here we are just checking that the recorded events are compatible with a single scorer being recorded.
|
|
||
| result = parent(outputs="test output") | ||
| # Expected: len("test output") * 2 = 22 | ||
| assert result.value == 22 |
There was a problem hiding this comment.
It seems like asserting the result.value is tangential to the actual goal of the test?
There was a problem hiding this comment.
I actually think that this is a useful assertion to check that all the scorers in the chain get called. Otherwise we can get the test to pass if the parent scorer just skips the call to the nested scorer.
smoorjani
left a comment
There was a problem hiding this comment.
LGTM, left one tiny comment on docstring
| **Note on Trace Behavior**: | ||
| - If template uses {{ trace }}: The trace metadata is used by an agent-based judge that uses | ||
| tools to fetch aspects of the trace's span data. If inputs/outputs/expectations are also | ||
| provided, they can augment the agent's context if the template has corresponding |
There was a problem hiding this comment.
I don't think this is the case - I think we throw an error if you pass {{trace}} and {{inputs}}
There was a problem hiding this comment.
I will adjust the docstring. BTW, for {{inputs}}/{{outputs}}/{{expectations}} -- is it correct that we replace the placeholders? I thought that we inject them but they are not replaced in the exact spot they are mentioned.
xsh310
left a comment
There was a problem hiding this comment.
Overall LGTM! I manually tested with UserFrustration and verified no duplicate is logged.
Can we address this comment before merge? #19493 (comment)
This commit addresses review comments from PR mlflow#19493: 1. Fix autouse fixture scope issue in scorer conftest - Remove is_mlflow_testing import (was affecting all scorer tests) - Enable telemetry explicitly in mock_get_telemetry_client fixture - Update comment to reflect removed fixture 2. Improve test verification of which scorer logs telemetry - Convert ChildScorer class to decorator-based child_scorer_func - Use decorator pattern (kind="decorator") for child scorers - Keep class pattern (kind="class") for parent scorers - Verify telemetry shows kind="class" proving parent logged - Add clarifying comments explaining verification strategy 3. Clarify Instructions Judge docstring - Explain that {{trace}} enables tool calling (not replaced in prompt) - Clarify that {{inputs}}/{{outputs}}/{{expectations}} can be used alongside {{trace}} and will be replaced in the prompt - Remove ambiguous wording about "augmenting context" All tests pass and formatting checks pass (ruff, clint, typos). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
This commit simplifies the test code without changing functionality: 1. Remove redundant docstring from mock_get_telemetry_client fixture - The function name and implementation are self-explanatory 2. Add parent_scorer_func helper for decorator-based parent scorers - Provides alternative to ParentScorer class for testing - Used in test_multi_level_nesting_skips_telemetry 3. Simplify test helper classes - Remove verbose docstrings where class names are clear - Rename ErrorNestedScorer to ErrorScorer (simpler) - Refactor ParentWithErrorChild to create ErrorScorer internally 4. Remove redundant inline comments - Keep only essential comments that explain non-obvious logic All tests continue to pass with these simplifications. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Further cleanup of scorer telemetry tests: - Remove redundant GrandparentScorer docstring - Remove redundant "nested calls were skipped" comment - Remove redundant "Use decorator function" inline comments All removed text was redundant - the code is self-explanatory. Tests continue to pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
…w#19493) Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
What changes are proposed in this pull request?
This PR prevents nested scorer calls from recording duplicate telemetry events. When a scorer calls another scorer internally (e.g.,
KnowledgeRetentioncallinglast_turn_scorer), only the top-level call now records aScorerCallEvent, eliminating noise in telemetry data.Key changes:
ContextVarto detect nested scorer calls and skip telemetry for themInstructionsJudge._evaluate_impl()method (now unnecessary with context-aware telemetry)SessionLevelScorerjudge types fromInstructionsJudgetoJudgeFluency,Completeness, andSummarizationscorers to useInstructionsJudge()directlyUserFrustration,_LastTurnKnowledgeRetention) to useInstructionsJudge()where compatibleHow is this PR tested?
tests/genai/scorers/test_scorer_telemetry.pywith 10 comprehensive tests:test_builtin_scorers.pypass (92 passed, 4 skipped)Does this PR change the documentation?
Documentation not changed, but code comments and docstrings were updated.
Release Notes
Evaluators
Component(s)
area/evaluation: Evaluation tracking and datasetsRelease Classification
rn/none: No mention in release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.
Related Issue: ML-54280
Implementation Details:
ContextVarfor thread-safe nested call detectionInstructionsJudge-specific parameters (generate_rationale_first,include_tool_calls_in_conversation) continue using direct instantiationInstructionsJudge()for consistency and telemetry trackingTesting Highlights: