[issue-2520] [SDK] Added Sycophancy Evaluation Metric#2624
[issue-2520] [SDK] Added Sycophancy Evaluation Metric#2624vincentkoc merged 27 commits intocomet-ml:mainfrom
Conversation
|
Hello, @vincentkoc please review and suggest changes if any. Also kindly help me understand the frontend issue mentioned above.
|
Thanks! @yashkumar2603 the team will review and circle back. |
sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py
Outdated
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py
Outdated
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/parser.py
Outdated
Show resolved
Hide resolved
|
Hi @yashkumar2603 ! Thank you for your work on this PR — it looks very promising. I’ve left a few review comments. Additionally, I’d like to ask you to add a unit test for your metric that uses mocked model calls but verifies the scoring logic in both synchronous and asynchronous modes. Please take a look how other LLM judge metrics are tested. |
|
Thanks for the review @yaricom !! |
|
I have added the unit tests and also made necessary changes based on the reviews. |
sdks/python/tests/library_integration/metrics_with_llm_judge/test_evaluation_metrics.py
Show resolved
Hide resolved
|
@aadereiko Could you please take a look at frontend changes if you have any comments or suggestions. |
|
@yaricom @yashkumar2603 |
|
I have made the changes mentioned in the above comment, moved the test from integration to unit. Thank you for your time. |
sdks/python/tests/unit/evaluation/metrics/llm_judges/syc_eval/test_parser.py
Outdated
Show resolved
Hide resolved
sdks/python/tests/library_integration/metrics_with_llm_judge/test_evaluation_metrics.py
Show resolved
Hide resolved
sdks/python/tests/unit/evaluation/metrics/llm_judges/syc_eval/test_parser.py
Outdated
Show resolved
Hide resolved
sdks/python/tests/unit/evaluation/metrics/llm_judges/syc_eval/test_parser.py
Outdated
Show resolved
Hide resolved
sdks/python/tests/unit/evaluation/metrics/llm_judges/syc_eval/test_parser.py
Show resolved
Hide resolved
sdks/python/tests/unit/evaluation/metrics/llm_judges/syc_eval/test_parser.py
Show resolved
Hide resolved
sdks/python/tests/unit/evaluation/metrics/llm_judges/syc_eval/test_parser.py
Outdated
Show resolved
Hide resolved
|
Dear @yashkumar2603 ! Thank you for committing the changes. Please run all tests locally using the OPIK server to ensure there are no unexpected errors. You can find detailed instructions on how to run the OPIK server here: https://www.comet.com/docs/opik/quickstart |
|
Thank you for your contribution! 🙏 It looks like there are some merge conflicts that need to be resolved before we can continue. When you have a chance, could you please update the branch? Once the conflicts are resolved, we’ll be happy to provide a new review. Let us know if you have any questions or need assistance! |
|
Thank you for your reviews @yaricom @andrescrz 😃 |
|
Hello all!! Please review @andrescrz. Thank you for your time 🙏🏾 |
1. Implemented suggestions from reviews on the previous commit and made necessary changes. 2. Added unit tests for the sycophancy_evaluation_metric just like how it is applied for the other metrics
Moved test for invalid score into the unit tests as it uses a dummy model and doesnt need to be in integration tests. removed unnecessary @model_parametrizer from the same test.
36e2a46 to
0423038
Compare
0423038 to
2a1bc01
Compare
|
I'd appreciate if you could take the following actions:
It'd be nice to get this PR to the finish line soon, so our users can enjoy it :) Thank you very much for all your effort here. |
|
@yashkumar2603 any luck updating your PR? Bounty still stands, almost finished |
…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@yashkumar2603 @yaricom i have addressed the issues, i have removed the LLM-as-a-judge in the FE (frontend) as its not the same metric implementation due to lack of rebutal model in UI. Will merge once the tests pass including lint. |
apps/opik-documentation/documentation/fern/docs/evaluation/metrics/sycophancy_evaluation.mdx
Outdated
Show resolved
Hide resolved
sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/parser.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…rics/sycophancy_evaluation.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@yaricom resolved, ready to merge |
into feat/optimizer-hybrid * 'feat/optimizer-hybrid' of https://github.com/comet-ml/opik: [issue-2520] [SDK] Added Sycophancy Evaluation Metric (#2624)
into vk/optimizer-oa_agent * 'vk/optimizer-oa_agent' of https://github.com/comet-ml/opik: [issue-2520] [SDK] Added Sycophancy Evaluation Metric (#2624) [OPIK-2986] [FE] Refactor comparison pages to use NavigationTag component (#4006) [OPIK-2992] [FE] Add tooltips to DateTag component (#4005) [OPIK-2993] [FE] Add tooltips to feedback scores and tags icons (#4007) [OPIK-3008] [FE] Refactor: NavigationTag infrastructure (#3972)
* Added Sycophancy Evaluation Metric in SDK, FE, Docs * Added unit tests, fixed reviews 1. Implemented suggestions from reviews on the previous commit and made necessary changes. 2. Added unit tests for the sycophancy_evaluation_metric just like how it is applied for the other metrics * Fixed reviews on added tests. Moved test for invalid score into the unit tests as it uses a dummy model and doesnt need to be in integration tests. removed unnecessary @model_parametrizer from the same test. * Resolving merge conflicts and improving tests from feedback * Updating default rebuttal model for LiteLLM compatibility * Added explanations in examples * Update test_evaluation_metrics.py for formatting after new stuff from main * Update test_evaluation_metrics.py * Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sycophancy_evaluation.mdx * Update metric.py * Update llm.ts * Update llm.ts * Update __init__.py * Update metric.py * chore: lint * Update sdks/python/examples/metrics.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update apps/opik-documentation/documentation/fern/docs/evaluation/metrics/sycophancy_evaluation.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update parser.py * Update parser.py --------- Co-authored-by: Iaroslav Omelianenko <yaric_mail@yahoo.com> Co-authored-by: Vincent Koc <vincentk@comet.com> Co-authored-by: Vincent Koc <vincentkoc@ieee.org> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


Details
Resolves #2520
This PR adds the SycEval metric for evaluating sycophantic behavior in large language models. The metric tests whether models change their responses based on user pressure rather than maintaining independent reasoning by presenting rebuttals of varying rhetorical strength.
It is based on this paper https://arxiv.org/pdf/2502.08177
as linked in the issue.
Key Features:
Implementation:
SycEvalclass with sync/async scoring methodsfrom opik.evaluation.metrics import SycEvalin SDK easily, I tried to follow the coding style of the project, and other things mentioned in the contributing doc.Issues
I faced one problem, I wasnt able to figure out a way to add the different results found out by the sycophancy analysis, such as sycophancy_type into the scores category in FrontEnd, as that would have required a STRING type in the LLM_SCHEMA_TYPE
So I instead made those available on the SDK, but not on the frontend. Please suggest something to tackle this problem. Guide me to make the necessary improvements in PR.
Documentation
Working Video
2025-06-29_23-50-51.mp4
/claim #2520
Edit: added working video I forgot to add