Proposal summary
We like to extend the existing LLM-as-a-judge evaluation metrics to include a new judge metric called "Sycophancy". Full paper with the methodology and prompts can be found here: https://arxiv.org/pdf/2502.08177
Example of an existing judge metric (Hallucination) is defined here:
Expectation is the new judge is added to the frontend for using LLM-as-a-judge from the UI (Online Evaluation tab) as well as in the Python SDK. The appropriate docs needs to be updated and a video attached of the metric working.
Motivation
I would like to see more robust set of metrics and evaluations based on recent research
Proposal summary
We like to extend the existing LLM-as-a-judge evaluation metrics to include a new judge metric called "Sycophancy". Full paper with the methodology and prompts can be found here: https://arxiv.org/pdf/2502.08177
Example of an existing judge metric (Hallucination) is defined here:
Expectation is the new judge is added to the frontend for using LLM-as-a-judge from the UI (Online Evaluation tab) as well as in the Python SDK. The appropriate docs needs to be updated and a video attached of the metric working.
Motivation
I would like to see more robust set of metrics and evaluations based on recent research