[issue-2520] [SDK] Added Sycophancy Evaluation Metric by yashkumar2603 · Pull Request #2624 · comet-ml/opik

yashkumar2603 · 2025-06-29T17:01:21Z

Details

Resolves #2520
This PR adds the SycEval metric for evaluating sycophantic behavior in large language models. The metric tests whether models change their responses based on user pressure rather than maintaining independent reasoning by presenting rebuttals of varying rhetorical strength.
It is based on this paper https://arxiv.org/pdf/2502.08177
as linked in the issue.

Key Features:

Multi-step evaluation process: Initial classification then Rebuttal generation then Response evaluation then Sycophancy detection
Configurable rebuttal types: Simple, ethos, justification, and citation-based rebuttals
Context modes: In-context and preemptive rebuttal presentation
Separate rebuttal model: Uses dedicated model (defaults to llama3-8b) to avoid contamination
Binary scoring: Returns 0.0 (no sycophancy) or 1.0 (sycophancy detected)
Detailed metadata: Includes initial/rebuttal classifications and sycophancy type

Implementation:

SycEval class with sync/async scoring methods
Response classification and parsing
Error handling and validation for all classification types
can be imported using from opik.evaluation.metrics import SycEval in SDK easily, I tried to follow the coding style of the project, and other things mentioned in the contributing doc.

Issues

I faced one problem, I wasnt able to figure out a way to add the different results found out by the sycophancy analysis, such as sycophancy_type into the scores category in FrontEnd, as that would have required a STRING type in the LLM_SCHEMA_TYPE
So I instead made those available on the SDK, but not on the frontend. Please suggest something to tackle this problem. Guide me to make the necessary improvements in PR.

Documentation

Added comprehensive docstrings with usage examples
Updated evaluation metrics documentation
Added configuration parameter explanations
Included research context and score interpretation guidelines (a little when needed)

Working Video

2025-06-29_23-50-51.mp4

/claim #2520

Edit: added working video I forgot to add

yashkumar2603 · 2025-06-29T18:24:22Z

Hello, @vincentkoc please review and suggest changes if any. Also kindly help me understand the frontend issue mentioned above.

Thank you 😃

vincentkoc · 2025-07-02T19:40:14Z

Hello, @vincentkoc please review and suggest changes if any. Also kindly help me understand the frontend issue mentioned above.

Thank you 😃

Thanks! @yashkumar2603 the team will review and circle back.

sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py

sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/parser.py

yaricom · 2025-07-04T12:01:42Z

Hi @yashkumar2603 ! Thank you for your work on this PR — it looks very promising. I’ve left a few review comments. Additionally, I’d like to ask you to add a unit test for your metric that uses mocked model calls but verifies the scoring logic in both synchronous and asynchronous modes.

Please take a look how other LLM judge metrics are tested.

yashkumar2603 · 2025-07-04T14:02:11Z

Thanks for the review @yaricom !!
I am glad you liked the work. I will surely take a look at the unit tests, fix the reviews and update the PR.
Thank you again for your time !!

yashkumar2603 · 2025-07-04T17:32:02Z

I have added the unit tests and also made necessary changes based on the reviews.
Kindly review 🙏🏾

sdks/python/tests/library_integration/metrics_with_llm_judge/test_evaluation_metrics.py

yaricom · 2025-07-07T11:50:07Z

@aadereiko Could you please take a look at frontend changes if you have any comments or suggestions.

aadereiko · 2025-07-07T17:00:41Z

@yaricom @yashkumar2603
The FE part looks good :)

yashkumar2603 · 2025-07-07T18:01:52Z

I have made the changes mentioned in the above comment, moved the test from integration to unit.
You are right, I had misplaced it. Thank you for pointing out.
Kindly review, merge it.

Thank you for your time.

sdks/python/tests/unit/evaluation/metrics/llm_judges/syc_eval/test_parser.py

sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py

sdks/python/tests/library_integration/metrics_with_llm_judge/test_evaluation_metrics.py

sdks/python/tests/unit/evaluation/metrics/llm_judges/syc_eval/test_parser.py

yaricom · 2025-07-08T09:52:20Z

Dear @yashkumar2603 ! Thank you for committing the changes. Please run all tests locally using the OPIK server to ensure there are no unexpected errors. You can find detailed instructions on how to run the OPIK server here: https://www.comet.com/docs/opik/quickstart

andrescrz · 2025-07-15T09:03:36Z

Hi @yashkumar2603

Thank you for your contribution! 🙏 It looks like there are some merge conflicts that need to be resolved before we can continue. When you have a chance, could you please update the branch? Once the conflicts are resolved, we’ll be happy to provide a new review.

Let us know if you have any questions or need assistance!

yashkumar2603 · 2025-07-16T11:09:35Z

Thank you for your reviews @yaricom @andrescrz 😃
I will surely make the changes, resolve conflicts and then update the PR. Been a busy for a few days, I will surely do this when i get the time.

yashkumar2603 · 2025-07-20T10:51:41Z

Hello all!!
I have updated the code based on recommendations by @yaricom and also resolved the merge conflicts from main.
The llama-3 error was coming in tests because i was teesting with it on my local and forgot to change it to the model mentioned in the original paper. Really sorry for the confusion.

Please review @andrescrz.

Thank you for your time 🙏🏾

sdks/python/examples/metrics.py

1. Implemented suggestions from reviews on the previous commit and made necessary changes. 2. Added unit tests for the sycophancy_evaluation_metric just like how it is applied for the other metrics

Moved test for invalid score into the unit tests as it uses a dummy model and doesnt need to be in integration tests. removed unnecessary @model_parametrizer from the same test.

andrescrz · 2025-09-23T08:33:03Z

Hi @yashkumar2603

I'd appreciate if you could take the following actions:

Solve merge conflicts.
Fix the linting errors per @yaricom comment.
Review and address the co-pilot comments. Please use your best judgement to discard those comments that make no sense.

It'd be nice to get this PR to the finish line soon, so our users can enjoy it :)

Thank you very much for all your effort here.

vincentkoc · 2025-10-30T02:56:31Z

@yashkumar2603 any luck updating your PR? Bounty still stands, almost finished

…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

vincentkoc · 2025-11-10T00:21:20Z

@yashkumar2603 @yaricom i have addressed the issues, i have removed the LLM-as-a-judge in the FE (frontend) as its not the same metric implementation due to lack of rebutal model in UI. Will merge once the tests pass including lint.

Copilot

Pull Request Overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 3 comments.

apps/opik-documentation/documentation/fern/docs/evaluation/metrics/sycophancy_evaluation.mdx

sdks/python/examples/metrics.py

sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/parser.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…rics/sycophancy_evaluation.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

vincentkoc · 2025-11-10T00:33:05Z

Hi @yashkumar2603

I'd appreciate if you could take the following actions:
1. Solve merge conflicts.

2. Fix the linting errors per @yaricom comment.

3. Review and address the co-pilot comments. Please use your best judgement to discard those comments that make no sense.
It'd be nice to get this PR to the finish line soon, so our users can enjoy it :)

Thank you very much for all your effort here.

@yaricom resolved, ready to merge

unit test

intergration test:

into feat/optimizer-hybrid * 'feat/optimizer-hybrid' of https://github.com/comet-ml/opik: [issue-2520] [SDK] Added Sycophancy Evaluation Metric (#2624)

into vk/optimizer-oa_agent * 'vk/optimizer-oa_agent' of https://github.com/comet-ml/opik: [issue-2520] [SDK] Added Sycophancy Evaluation Metric (#2624) [OPIK-2986] [FE] Refactor comparison pages to use NavigationTag component (#4006) [OPIK-2992] [FE] Add tooltips to DateTag component (#4005) [OPIK-2993] [FE] Add tooltips to feedback scores and tags icons (#4007) [OPIK-3008] [FE] Refactor: NavigationTag infrastructure (#3972)

* Added Sycophancy Evaluation Metric in SDK, FE, Docs * Added unit tests, fixed reviews 1. Implemented suggestions from reviews on the previous commit and made necessary changes. 2. Added unit tests for the sycophancy_evaluation_metric just like how it is applied for the other metrics * Fixed reviews on added tests. Moved test for invalid score into the unit tests as it uses a dummy model and doesnt need to be in integration tests. removed unnecessary @model_parametrizer from the same test. * Resolving merge conflicts and improving tests from feedback * Updating default rebuttal model for LiteLLM compatibility * Added explanations in examples * Update test_evaluation_metrics.py for formatting after new stuff from main * Update test_evaluation_metrics.py * Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/metric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sycophancy_evaluation.mdx * Update metric.py * Update llm.ts * Update llm.ts * Update __init__.py * Update metric.py * chore: lint * Update sdks/python/examples/metrics.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update apps/opik-documentation/documentation/fern/docs/evaluation/metrics/sycophancy_evaluation.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update parser.py * Update parser.py --------- Co-authored-by: Iaroslav Omelianenko <yaric_mail@yahoo.com> Co-authored-by: Vincent Koc <vincentk@comet.com> Co-authored-by: Vincent Koc <vincentkoc@ieee.org> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yashkumar2603 requested review from a team as code owners June 29, 2025 17:01

algora-pbc bot added the 🙋 Bounty claim label Jun 29, 2025

algora-pbc bot mentioned this pull request Jun 29, 2025

[FR]: New Evaluaton Metric "LLM Sycophancy" (SycEval) #2520

Closed

alexkuzmik requested a review from yaricom July 2, 2025 11:56

yaricom requested changes Jul 4, 2025

View reviewed changes

yashkumar2603 requested a review from yaricom July 4, 2025 17:37

yaricom requested changes Jul 7, 2025

View reviewed changes

sdks/python/tests/library_integration/metrics_with_llm_judge/test_evaluation_metrics.py Show resolved Hide resolved

yashkumar2603 requested a review from yaricom July 7, 2025 17:59

yaricom requested changes Jul 8, 2025

View reviewed changes

yaricom requested changes Jul 21, 2025

View reviewed changes

sdks/python/examples/metrics.py Outdated Show resolved Hide resolved

yashkumar2603 added 4 commits August 3, 2025 20:45

Added Sycophancy Evaluation Metric in SDK, FE, Docs

7b90a46

Added unit tests, fixed reviews

929fb4c

1. Implemented suggestions from reviews on the previous commit and made necessary changes. 2. Added unit tests for the sycophancy_evaluation_metric just like how it is applied for the other metrics

Fixed reviews on added tests.

deefa53

Moved test for invalid score into the unit tests as it uses a dummy model and doesnt need to be in integration tests. removed unnecessary @model_parametrizer from the same test.

Resolving merge conflicts and improving tests from feedback

ea49673

yashkumar2603 force-pushed the add-sycophancy-evaluation branch from 36e2a46 to 0423038 Compare August 3, 2025 15:23

Updating default rebuttal model for LiteLLM compatibility

2a1bc01

yashkumar2603 force-pushed the add-sycophancy-evaluation branch from 0423038 to 2a1bc01 Compare August 3, 2025 15:28

vincentkoc and others added 11 commits November 9, 2025 16:03

Merge branch 'main' into add-sycophancy-evaluation

8c35d82

Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/me…

2331f1a

…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/me…

3fb6050

…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/me…

ac9b5af

…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sdks/python/src/opik/evaluation/metrics/llm_judges/syc_eval/me…

bc5c646

…tric.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sycophancy_evaluation.mdx

449a87f

Update metric.py

060cbb6

Update llm.ts

2e131b5

Update llm.ts

2874077

Update __init__.py

62ce1b8

Update metric.py

e2a05a1

vincentkoc changed the title ~~Added Sycophancy Evaluation Metric in SDK, FE, Docs~~ [issue-2520] [SDK] Added Sycophancy Evaluation Metric in SDK, FE, Docs Nov 10, 2025

chore: lint

437fa92

vincentkoc requested a review from Copilot November 10, 2025 00:28

Copilot AI reviewed Nov 10, 2025

View reviewed changes

vincentkoc and others added 4 commits November 9, 2025 16:31

Update sdks/python/examples/metrics.py

1c503df

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update apps/opik-documentation/documentation/fern/docs/evaluation/met…

c281ec9

…rics/sycophancy_evaluation.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update parser.py

8a48967

Update parser.py

8cb2f3e

vincentkoc approved these changes Nov 10, 2025

View reviewed changes

vincentkoc requested a review from yaricom November 10, 2025 00:33

vincentkoc changed the title ~~[issue-2520] [SDK] Added Sycophancy Evaluation Metric in SDK, FE, Docs~~ [issue-2520] [SDK] Added Sycophancy Evaluation Metric Nov 10, 2025

vincentkoc merged commit be02cc5 into comet-ml:main Nov 10, 2025
8 of 96 checks passed

Conversation

yashkumar2603 commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Issues

Documentation

Working Video

Uh oh!

yashkumar2603 commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentkoc commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaricom commented Jul 4, 2025

Uh oh!

yashkumar2603 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yashkumar2603 commented Jul 4, 2025

Uh oh!

Uh oh!

yaricom commented Jul 7, 2025

Uh oh!

aadereiko commented Jul 7, 2025

Uh oh!

yashkumar2603 commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaricom commented Jul 8, 2025

Uh oh!

andrescrz commented Jul 15, 2025

Uh oh!

yashkumar2603 commented Jul 16, 2025

Uh oh!

yashkumar2603 commented Jul 20, 2025

Uh oh!

Uh oh!

andrescrz commented Sep 23, 2025

Uh oh!

vincentkoc commented Oct 30, 2025

Uh oh!

vincentkoc commented Nov 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vincentkoc commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yashkumar2603 commented Jun 29, 2025 •

edited

Loading

yashkumar2603 commented Jun 29, 2025 •

edited

Loading

yashkumar2603 commented Jul 4, 2025 •

edited

Loading

vincentkoc commented Nov 10, 2025 •

edited

Loading