[ML-59816][1/n] Add new fields to the genai_evaluation event telemetry#19018
[ML-59816][1/n] Add new fields to the genai_evaluation event telemetry#19018xsh310 merged 2 commits intomlflow:masterfrom
Conversation
|
|
||
| # If the input dataset is a managed dataset, we pass the original dataset | ||
| # to the evaluate function to preserve metadata like dataset name. | ||
| data = data if is_managed_dataset else df |
There was a problem hiding this comment.
Moving this logic into _run_harness so that we can log the original data type passed to the genai.evaluate call
There was a problem hiding this comment.
It's not efficient to convert data to df twice (line 246 and inside _run_harness), can you refactor?
mlflow/genai/evaluation/utils.py
Outdated
| # Check if records are loaded to avoid triggering expensive load | ||
| if data.has_records(): | ||
| df = data.to_df() | ||
| eval_data_size = len(df) |
There was a problem hiding this comment.
Let me know if there's a better way to get the dataset size
There was a problem hiding this comment.
Apparently this is too heavy to compute the dataset size by converting it to dataframe. Reading the code, this data is converted to pandas dataframe when evaluating anyways, can we calculate the dataset size after that?
| CLASS = "class" | ||
| BUILTIN = "builtin" | ||
| DECORATOR = "decorator" | ||
| INSTRUCTIONS = "instructions" |
There was a problem hiding this comment.
Expand the kind property of scorer to distinguish different types of scorer more easily.
Let me know if there's a better kind name for InstructionsJudge (ie. the judge created by make_judge)
mlflow/genai/evaluation/utils.py
Outdated
| from mlflow.entities.evaluation_dataset import EvaluationDataset as EntityEvaluationDataset | ||
| from mlflow.genai.datasets import EvaluationDataset as ManagedEvaluationDataset |
There was a problem hiding this comment.
Are these imported here to avoid circular imports? If so can you add comments, otherwise let's move it at top level
mlflow/genai/evaluation/utils.py
Outdated
| else: | ||
| # Records not loaded, don't trigger load for telemetry | ||
| eval_data_size = None | ||
| elif isinstance(data, EntityEvaluationDataset): |
There was a problem hiding this comment.
This is basically the same as above except data type, can we unify them as if isinstance(data, (xxx, xxx))?
mlflow/genai/evaluation/utils.py
Outdated
| if data and isinstance(data[0], Trace): | ||
| eval_data_type = "list[Trace]" | ||
| elif data and isinstance(data[0], dict): | ||
| eval_data_type = "list[dict]" | ||
| else: | ||
| eval_data_type = "list" |
There was a problem hiding this comment.
| if data and isinstance(data[0], Trace): | |
| eval_data_type = "list[Trace]" | |
| elif data and isinstance(data[0], dict): | |
| eval_data_type = "list[dict]" | |
| else: | |
| eval_data_type = "list" | |
| eval_data_type = f"list[{type(data[0].__name__}]" if data else ... |
Is it possible data is empty list though?
mlflow/genai/evaluation/utils.py
Outdated
| eval_data_type = "pyspark.DataFrame" | ||
| else: | ||
| eval_data_type = "pandas.DataFrame" | ||
| eval_data_size = len(data) if hasattr(data, "__len__") else data.count() |
There was a problem hiding this comment.
data.count() will trigger the spark job for calculating total count which is expensive, do we really need to know the count for spark dataframe?
mlflow/genai/evaluation/utils.py
Outdated
| if isinstance(data, (ManagedEvaluationDataset, EntityEvaluationDataset)): | ||
| try: | ||
| if data.has_records(): | ||
| data = data.to_df() |
There was a problem hiding this comment.
Similar as above, let's avoid multiple conversions to pandas dataframe on user's data
mlflow/genai/evaluation/utils.py
Outdated
|
|
||
| # Check pandas DataFrame | ||
| try: | ||
| import pandas as pd |
There was a problem hiding this comment.
When would this fail? I thought all dataset are converted to pandas dataframe?
mlflow/genai/evaluation/utils.py
Outdated
| try: | ||
| if not data[field].isna().all(): | ||
| provided_fields.add(field) | ||
| except Exception: |
mlflow/genai/evaluation/utils.py
Outdated
| # Check which columns are present in the DataFrame | ||
| for field in target_fields: | ||
| if field in data.columns: | ||
| # Check if the column has any non-null values |
There was a problem hiding this comment.
Actually why does this matter? What signal can we get by knowing whether inputs/outputs are none or not? I assume the scorers have expectations on the data anyways
mlflow/genai/evaluation/utils.py
Outdated
| if not data: | ||
| return provided_fields | ||
| # List of Trace objects | ||
| if hasattr(data[0], "__class__") and data[0].__class__.__name__ == "Trace": |
There was a problem hiding this comment.
| if hasattr(data[0], "__class__") and data[0].__class__.__name__ == "Trace": | |
| if instance(data[0], Trace): |
mlflow/genai/evaluation/utils.py
Outdated
| for field in target_fields: | ||
| if field in data[0]: | ||
| provided_fields.add(field) |
There was a problem hiding this comment.
| for field in target_fields: | |
| if field in data[0]: | |
| provided_fields.add(field) | |
| provided_fields = data[0].keys() & target_fields |
|
|
||
| @property | ||
| def kind(self) -> ScorerKind: | ||
| """Get the kind of this scorer.""" |
There was a problem hiding this comment.
| """Get the kind of this scorer.""" |
37c6a49 to
5937aed
Compare
5937aed to
234346f
Compare
mlflow/telemetry/events.py
Outdated
| scorer_kind_count = Counter() | ||
| for scorer in scorers: | ||
| if isinstance(scorer, Scorer): | ||
| try: | ||
| scorer_kind_count[scorer.kind.value] += 1 | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Does this work:
scorer_kind_count = Counter(
scorer.kind.value
for scorer in scorers
if isinstance(scorer, Scorer)
)
Why do we need try catch for the assignment?
| """Test that built-in scorers have the correct kind property. | ||
|
|
||
| Most built-in scorers should return ScorerKind.BUILTIN, except: | ||
| - Guidelines scorer which returns ScorerKind.GUIDELINES | ||
| """ |
There was a problem hiding this comment.
| """Test that built-in scorers have the correct kind property. | |
| Most built-in scorers should return ScorerKind.BUILTIN, except: | |
| - Guidelines scorer which returns ScorerKind.GUIDELINES | |
| """ |
| from mlflow.genai.judges import make_judge | ||
| from mlflow.genai.scorers import Guidelines, RelevanceToQuery |
There was a problem hiding this comment.
Let's avoid importing within the test
serena-ruan
left a comment
There was a problem hiding this comment.
Overall LGTM! Left some nits
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
…_evaluate event logging Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
234346f to
d605f37
Compare
|
Documentation preview for d605f37 is available at: More info
|
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
We would like to add a few new fields to the
genai_evaluateevent logging. This logging changes has been reviewed and approved in http://go/.c441888. We hope that these new fields will help us have a better understanding of the distribution of eval dataset size and distribution of usage across different scorer types. This information will help us better prioritize features that add the most user value.Logging Spec (Part One Covers Bold Items):
one of the str values: "builtin" | "decorator" | "instructions" | "guidelines",
int
>
How is this PR tested?
Manual Test Plan
Tested the telemetry by running the following the python notebook:
Record Parameter
{'builtin_scorers': ['Safety', 'RelevanceToQuery', 'Guidelines']}{'predict_fn_provided': True, 'builtin_scorers': ['Guidelines', 'RelevanceToQuery', 'Safety'], 'scorer_kind_count': {'builtin': 2, 'decorator': 1, 'guidelines': 1}}Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.