Skip to content

Add KnowledgeRetention built-in scorer#19436

Merged
alkispoly-db merged 20 commits intomlflow:masterfrom
alkispoly-db:ML-60455
Dec 18, 2025
Merged

Add KnowledgeRetention built-in scorer#19436
alkispoly-db merged 20 commits intomlflow:masterfrom
alkispoly-db:ML-60455

Conversation

@alkispoly-db
Copy link
Collaborator

@alkispoly-db alkispoly-db commented Dec 16, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19436/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19436/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19436/merge

Related Issues/PRs

#60455

What changes are proposed in this pull request?

This PR adds a new built-in scorer for evaluating knowledge retention in conversational AI applications. The implementation uses two scorers:

  1. LastTurnKnowledgeRetention: Evaluates the last turn of a conversation to check if the AI response correctly retains information provided by users in earlier turns
  2. KnowledgeRetention: Orchestrates per-turn evaluation across an entire conversation using composition pattern

Key features:

  • Binary "yes"/"no" feedback with detailed per-turn rationale
  • Worst-case aggregation: any turn failure results in overall failure
  • Supports single-turn evaluation (evaluates from turn 0)
  • Session slicing: each turn receives conversation history up to that point
  • Flexible composition: users can customize per-turn evaluation via single_turn_scorer parameter

Implementation details:

  • Uses InstructionsJudge directly (following MLflow conventions)

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

New tests added:

  • test_last_turn_knowledge_retention_success: Single turn scorer success case
  • test_knowledge_retention_success: 2-turn success case with fake scorer pattern
  • test_knowledge_retention_failure: 2-turn failure case with rationale verification
  • test_knowledge_retention_single_turn: Single-turn evaluation edge case
  • test_knowledge_retention_empty_session: Empty session edge case

All tests use fake scorer pattern for better testability and avoid deep mocking of internal LLM calls.

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references (docstrings in code)
    • Instructions

Comprehensive docstrings are included in both scorer classes with usage examples.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Added two new built-in scorers for evaluating knowledge retention in conversational AI:

  • LastTurnKnowledgeRetention: Evaluates if the last AI response correctly retains information from earlier user inputs
  • KnowledgeRetention: Orchestrates per-turn knowledge retention evaluation across entire conversations with worst-case aggregation

These scorers help ensure AI assistants accurately remember and use information provided by users throughout a conversation without contradicting, distorting, or forgetting critical details.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

🤖 Generated with Claude Code

alkispoly-db and others added 5 commits December 16, 2025 17:59
Implements two new built-in scorers for evaluating knowledge retention
in conversational AI applications:

- LastTurnKnowledgeRetention: Evaluates whether the last AI response
  correctly retains information provided by users in earlier turns
- KnowledgeRetention: Orchestrates per-turn evaluation across entire
  conversations using composition pattern

Key features:
- Binary yes/no feedback with detailed per-turn rationale
- Worst-case aggregation (any turn failure = overall failure)
- Session slicing to provide correct conversation history to each turn
- Customizable via single_turn_scorer parameter
- Comprehensive test coverage (6 tests including multi-turn scenarios)

Implementation details:
- Uses InstructionsJudge directly (not make_judge) per conventions
- Follows BuiltInSessionLevelScorer pattern with lazy imports
- Clear prompt instructions distinguishing relevant vs irrelevant info
- Edge case handling for empty sessions and single-turn conversations

Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Replace mock patching approach with dependency injection pattern in
test_knowledge_retention_success and test_knowledge_retention_failure.

Changes:
- Create fake LastTurnKnowledgeRetention scorer using Mock(spec=Scorer)
- Pass fake scorer to KnowledgeRetention via single_turn_scorer parameter
- Add assertion to verify rationale includes specific turn rationale
- Add imports: Mock, AssessmentSource, AssessmentSourceType

Benefits:
- Tests composition pattern directly instead of mocking internal LLM calls
- More accurately tests the public API (single_turn_scorer parameter)
- Verifies session slicing behavior (fake scorer receives correct traces)
- Cleaner test structure without deep patching

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Verify that the judge receives the complete conversation with both
turns when evaluating knowledge retention.

Changes:
- Add assertion to check prompt includes both user questions
- Verify conversation includes "My name is Alice and I love Python"
- Verify conversation includes "What programming language do I like?"

This ensures the LastTurnKnowledgeRetention scorer correctly passes
the full session context to the underlying judge for evaluation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Allow KnowledgeRetention scorer to evaluate single-turn conversations
by starting evaluation from turn 0 instead of turn 1.

Changes:
- Remove 2-turn minimum requirement (only check for empty sessions)
- Update evaluation loop to start from turn 0: range(len(sorted_traces))
- Update comment to reflect evaluation starts from turn 0
- Update test_knowledge_retention_single_turn to expect actual evaluation
- Update test_knowledge_retention_empty_session to check "empty session"
- Update test_knowledge_retention_success to expect 2 calls (turns 0 and 1)
- Update test_knowledge_retention_multi_turn to include turn 0 evaluation

Behavior changes:
- Single-turn sessions: Now evaluated (was: "too short" message)
- Two-turn sessions: Both turns evaluated (was: only turn 1)
- Empty sessions: Still return "yes" with "empty session" message

The LLM judge now evaluates turn 0, checking if the response makes
unfounded assumptions or references non-existent prior information.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Remove test_knowledge_retention_multi_turn as requested in code review.
The remaining 5 tests provide sufficient coverage for knowledge retention:
- test_last_turn_knowledge_retention_success: Single turn scorer
- test_knowledge_retention_success: 2-turn success case
- test_knowledge_retention_failure: 2-turn failure case
- test_knowledge_retention_single_turn: 1-turn edge case
- test_knowledge_retention_empty_session: Empty session edge case

The multi-turn test was redundant as it tested the same aggregation
logic already covered by the 2-turn tests.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@github-actions github-actions bot added area/evaluation MLflow Evaluation area/tracing MLflow Tracing and its integrations rn/feature Mention under Features in Changelogs. labels Dec 16, 2025
alkispoly-db and others added 2 commits December 16, 2025 20:00
Remove explanatory comments from KnowledgeRetention implementation and
tests as they add unnecessary noise without providing value.

Changes in builtin_scorers.py:
- Removed inline comments explaining obvious operations
- Code is self-documenting through clear variable names and structure

Changes in test_builtin_scorers.py:
- Removed "Turn 0/Turn 1" comments from test setup
- Removed single-line assertion explanations
- Tests remain clear through descriptive test names and assertions

All 5 knowledge retention tests pass with no functionality changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Change docstring from "starting from turn 2" to accurately reflect that
all turns (including turn 0) are now evaluated after recent changes.

This aligns the documentation with the actual implementation behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@alkispoly-db alkispoly-db changed the title Add KnowledgeRetention and LastTurnKnowledgeRetention built-in scorers Add KnowledgeRetention built-in scorer Dec 16, 2025
alkispoly-db and others added 2 commits December 16, 2025 20:14
Update expected scorer count after adding KnowledgeRetention to
get_all_scorers():
- 12 scorers for Databricks (was 11)
- 10 scorers for OSS (was 9)

This fixes the failing test that was checking for the wrong count.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Add new KnowledgeRetention and LastTurnKnowledgeRetention scorers to
the API documentation inventory.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 16, 2025

Documentation preview for 8071aa3 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

alkispoly-db and others added 3 commits December 16, 2025 20:49
Creates a new SessionLevelScorer base class and makes LastTurnKnowledgeRetention
internal (_LastTurnKnowledgeRetention) to ensure BuiltInScorer and
BuiltInSessionLevelScorer are reserved for truly public scorers.

## Changes

### New SessionLevelScorer Base Class (builtin_scorers.py)
- Created SessionLevelScorer base class inheriting from Judge
- Provides common session-level functionality: judge caching, __call__ method,
  get_input_fields, is_session_level_scorer property
- Designed for use by both public and internal session-level scorers

### Updated BuiltInSessionLevelScorer (builtin_scorers.py)
- Now inherits from both BuiltInScorer and SessionLevelScorer using multiple inheritance
- BuiltInScorer provides special serialization for public API
- SessionLevelScorer provides session-level evaluation logic
- Reserved explicitly for public API scorers only per docstring

### Renamed LastTurnKnowledgeRetention (builtin_scorers.py)
- Renamed to _LastTurnKnowledgeRetention with underscore prefix
- Changed base class from BuiltInSessionLevelScorer to SessionLevelScorer
- Removed @experimental and @format_docstring decorators (internal class)
- Updated docstring to emphasize internal status and removed example
- KnowledgeRetention updated to use _LastTurnKnowledgeRetention as default

### Removed from Public API (__init__.py)
- Removed LastTurnKnowledgeRetention from _LAZY_IMPORTS set
- Removed from TYPE_CHECKING imports
- Removed from __all__ list
- No longer accessible via `from mlflow.genai.scorers import LastTurnKnowledgeRetention`

### Updated Tests (test_builtin_scorers.py)
- Removed LastTurnKnowledgeRetention from imports
- Renamed test_last_turn_knowledge_retention_success to
  test_knowledge_retention_uses_default_single_turn_scorer
- Refactored test to use KnowledgeRetention instead of direct instantiation
- Test now verifies integration with default _LastTurnKnowledgeRetention scorer
- Validates judge is called for each turn (2 turns = 2 calls)

### Documentation (api_inventory.txt)
- Removed all references to LastTurnKnowledgeRetention
- Cleaned up 3 entries from API inventory

## Architectural Benefits

1. **Clear Separation**: BuiltInScorer/BuiltInSessionLevelScorer explicitly for public API
2. **Reusable Base**: SessionLevelScorer available for future internal scorers
3. **No Duplication**: Common session-level logic extracted to one place
4. **Type Safety**: Multiple inheritance preserves both serialization and session logic
5. **Clean API**: Internal classes don't pollute public API exports

## Testing

- All existing session-level scorer tests pass (UserFrustration, ConversationCompleteness, etc.)
- Knowledge retention tests pass with new integration test approach
- Verified LastTurnKnowledgeRetention no longer accessible from public API
- Verified KnowledgeRetention uses _LastTurnKnowledgeRetention by default

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Add blank line after BuiltInSessionLevelScorer docstring per ruff format.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Address ALKIS comment requesting parameter rename for clarity.

Changes:
- Rename `single_turn_scorer` field to `last_turn_scorer` in KnowledgeRetention class
- Update `__init__` parameter name and docstring
- Update all references in error messages and method docstrings
- Update all test references to use new parameter name
- Rename test function to match: test_knowledge_retention_uses_default_last_turn_scorer

The new name better reflects that this scorer evaluates the last turn of a
conversation, making the API more intuitive.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
alkispoly-db and others added 3 commits December 17, 2025 18:11
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Change turn numbering in rationale from 0-based to 1-based to match
the UI display. Turns now appear as "Turn 1", "Turn 2", etc. instead
of "Turn 0", "Turn 1", etc.

This makes the rationale more intuitive and consistent with how the
UI presents conversation turns to users.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Resolve conflict in test_get_all_scorers by including all scorers:
- KnowledgeRetention (from ML-60455 branch)
- ToolCallEfficiency (from upstream master)
- ToolCallCorrectness (from upstream master)

All tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some initial comments, I think mostly just using existing stuff that other session level scorers use and ensuring the behavior is the same in edge cases (e.g., nothing passed in/empty session, etc).

),
)

sorted_traces = sorted(session, key=lambda t: t.info.timestamp_ms)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with resolve_conversation_from_session

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to call resolve_conversation_from_session? It seems that we would need to convert it back to Trace anyways in order to forward to the underlying scorer (?)

Implements 6 of 9 review comments from smoorjani:

**API Simplification (Comment #1):**
- Remove last_turn_scorer parameter from public API
- Simplify constructor to use default Pydantic initialization
- Parameter kept as internal field for composition pattern

**Code Quality (Comments #3, mlflow#6, mlflow#8):**
- Remove duplicated kwargs validation (already in base class)
- Remove unnecessary one-line docstring from _evaluate_turn
- Use CategoricalRating enum instead of string literals ("yes"/"no")

**Logic Improvements (Comments mlflow#7, mlflow#9):**
- Move empty results check from _compute_aggregate to __call__
- Add per-turn details for success case (symmetric with failure case)

**Code Reuse Analysis (Comments mlflow#4, mlflow#5):**
- Researched _validate_session: serves different purpose (session ID validation)
- Researched resolve_conversation_from_session: returns different type (list[dict])
- Keeping inline implementations as they serve different purposes

**Deferred (Comment #2):**
- _create_judge() architecture issue deferred per reviewer (non-blocking)

All 5 KnowledgeRetention tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
alkispoly-db and others added 2 commits December 17, 2025 22:38
Implements 5 ALKIS review comments:

**Code Reusability:**
- Refactor _validate_session from InstructionsJudge to trace_utils.py
- Add validate_session() utility function for session validation
- Update InstructionsJudge to call refactored utility
- Add validate_session() call in KnowledgeRetention.__call__

**Code Simplification:**
- Remove redundant empty per_turn_results check (session already validated)
- Refactor per-turn rationale assembly into _format_per_turn_rationale()
- Eliminate duplicate code between success/failure branches

**User Experience:**
- Remove truncation from turn summaries (show full rationale)
- Improves readability of evaluation feedback

All 5 KnowledgeRetention tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Addresses ALKIS review comment to extract kwargs validation logic.

**Changes:**
- Add _validate_kwargs() utility method to SessionLevelScorer base class
- Update SessionLevelScorer.__call__() to use the utility method
- Add _validate_kwargs() call to KnowledgeRetention.__call__()

**Benefits:**
- Eliminates code duplication
- Ensures consistent validation across all session-level scorers
- Makes validation behavior explicit and testable

All 5 KnowledgeRetention tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
alkispoly-db added a commit to alkispoly-db/mlflow that referenced this pull request Dec 18, 2025
Update conversational scorer documentation to include the new
KnowledgeRetention scorer introduced in PR mlflow#19436.

Changes:
- Add KnowledgeRetention to Multi-Turn Scorers table in predefined.mdx
- Add KnowledgeRetention to Built-in Scorers list in multi-turn.mdx
- Position alphabetically between ConversationalToolCallEfficiency and UserFrustration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM, left a few small comments to address before merging, but nothing critical

- Forgets or ignores information that is directly relevant to answering the current user question

IMPORTANT GUIDELINES:
1. Only evaluate information explicitly provided by the USER in prior turns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this judge also evaluate whether it forgets results it generates in previous turns? maybe this is too much to judge/conflating multiple judges together.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I think it conflates things. I think that tool efficiency would catch this.

Implementation improvements per review feedback:

1. Move imports to top-level per MLflow conventions
   - Add validate_session to top-level imports in all files
   - Add INVALID_PARAMETER_VALUE to top-level imports
   - Remove inline imports from functions
   - Remove TraceMetadataKey from __all__ exports

2. Replace empty session feedback with exception
   - Raise MlflowException for empty session instead of returning feedback
   - Consistent with InstructionsJudge error handling pattern

3. Refactor to use list[Feedback] directly
   - Update _compute_aggregate() to accept list[Feedback]
   - Update _format_per_turn_rationale() to accept list[Feedback]
   - Remove intermediate dict creation in __call__()
   - Cleaner type signatures and more maintainable code

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Copy link
Collaborator Author

@alkispoly-db alkispoly-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions!

- Forgets or ignores information that is directly relevant to answering the current user question

IMPORTANT GUIDELINES:
1. Only evaluate information explicitly provided by the USER in prior turns
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I think it conflates things. I think that tool efficiency would catch this.

Test improvements per review feedback:

1. Extract test helper function
   - Add _create_test_trace_with_session() helper
   - Reduces code duplication across 4 tests
   - Makes tests more readable and maintainable

2. Update test for exception handling
   - Add MlflowException import at top-level
   - Update test_knowledge_retention_empty_session to expect exception
   - Consistent with new empty session error handling

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@alkispoly-db alkispoly-db added this pull request to the merge queue Dec 18, 2025
Merged via the queue into mlflow:master with commit 8f4649c Dec 18, 2025
65 of 67 checks passed
@alkispoly-db alkispoly-db deleted the ML-60455 branch December 18, 2025 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation area/tracing MLflow Tracing and its integrations rn/feature Mention under Features in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants