Skip to content

[issue-2202] [BE] [SDK] feat: Support OpenAI TTS models tracking (audio.speech)#5010

Merged
andrescrz merged 8 commits intocomet-ml:mainfrom
Samoppakiks:feature/openai-tts-tracking
Feb 23, 2026
Merged

[issue-2202] [BE] [SDK] feat: Support OpenAI TTS models tracking (audio.speech)#5010
andrescrz merged 8 commits intocomet-ml:mainfrom
Samoppakiks:feature/openai-tts-tracking

Conversation

@Samoppakiks
Copy link
Copy Markdown
Contributor

@Samoppakiks Samoppakiks commented Jan 31, 2026

Details

Adds support for tracking OpenAI TTS (audio.speech.create) calls in both the Java backend and Python SDK.

Backend (Java):

  • inputCostPerCharacter field in ModelCostData
  • audioSpeechCost() calculator in SpanCostCalculator
  • AUDIO_SPEECH mode wired in CostService

Python SDK:

  • tts_create_decorator.py — tracks audio.speech.create (sync)
  • tts_streaming_response_decorator.py — tracks streaming variant
  • Character-based usage tracking (input_charactersprompt_tokens, completion_tokens=0)
  • Patching wired into opik_tracker.py

Change checklist

  • Java backend: cost calculation for TTS models
  • Python SDK: sync + streaming decorators
  • Tests: unit tests for cost calculation + integration tests for TTS tracking
  • Linting: pre-commit run --all-files passes
  • Addressed review feedback from @andrescrz (parameterized tests) and @petrotiurin (usage key alignment)

Issues

Closes #2202

Documentation

No documentation changes needed — TTS tracking works automatically when track_openai() is called, same as other OpenAI integrations.

- Create audio/ module with TTSCreateTrackDecorator and
  TTSStreamingResponseCreateTrackDecorator
- Patch audio.speech.create and
  audio.speech.with_streaming_response.create in opik_tracker.py
- Track input parameters, character-based usage, model and provider
- Follow existing patterns from videos integration
- Add inputCostPerCharacter field to ModelCostData
- Add audioInputCharacterPrice field to ModelPrice
- Add audioSpeechCost calculator to SpanCostCalculator
- Wire AUDIO_SPEECH mode in CostService.resolveCalculator()
- Cost = inputCostPerCharacter * input_characters from usage
Python SDK tests:
- test_openai_audio_speech_create__happyflow
- test_openai_audio_speech_with_streaming_response__happyflow
- test_openai_audio_speech_create__tts_1_hd (model name tracking)
- test_openai_audio_speech_create__with_optional_params
- test_openai_audio_speech_create__character_count_usage

Backend tests:
- audioSpeechCost calculator tests (zero price, zero chars, validation)
- audioSpeechCost for tts-1 and tts-1-hd pricing
- CostService integration tests for tts-1 and tts-1-hd
- Updated existing video tests for new ModelPrice constructor
@Samoppakiks
Copy link
Copy Markdown
Contributor Author

Hi team! 👋 Friendly nudge on this PR — it's been about a week since submission. Just checking if there's anything I can address or clarify to move this forward. Happy to make any adjustments needed!

Thanks for your time! 🙏

Copy link
Copy Markdown
Member

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Samoppakiks

Thank you so much for your contribution!

I've reviewed the backend part and it mostly looks good to me (LGTM). I just left some minor comments about parameterizing similar tests to reduce duplication.

Someone from the team will review the Python SDK part soon.

We really appreciate your work on this!

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added python Pull requests that update Python code java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. Python SDK labels Feb 19, 2026
@Samoppakiks
Copy link
Copy Markdown
Contributor Author

Hey @andrescrz, thanks for the review! Parameterized the tests and cleaned up the unused imports in 41150a8. Let me know if anything else needs fixing.

@Samoppakiks
Copy link
Copy Markdown
Contributor Author

Hi team! 👋 Just a heads-up — I noticed another attempt on this issue (#5325) that covers only the Python SDK side. Our PR here includes both the Java backend (cost calculation, inputCostPerCharacter, AUDIO_SPEECH mode) and the Python SDK (sync + streaming decorators with character-based usage tracking), so it's the more complete implementation.

Backend is already LGTM from @andrescrz, and I've addressed his parameterization feedback in 41150a8. Would love to get the SDK portion reviewed when someone gets a chance — happy to iterate on any feedback! 🙏

@andrescrz
Copy link
Copy Markdown
Member

andrescrz commented Feb 20, 2026

Hi team! 👋 Just a heads-up — I noticed another attempt on this issue (#5325) that covers only the Python SDK side. Our PR here includes both the Java backend (cost calculation, inputCostPerCharacter, AUDIO_SPEECH mode) and the Python SDK (sync + streaming decorators with character-based usage tracking), so it's the more complete implementation.

Backend is already LGTM from @andrescrz, and I've addressed his parameterization feedback in 41150a8. Would love to get the SDK portion reviewed when someone gets a chance — happy to iterate on any feedback! 🙏

Hi @Samoppakiks ! Thanks for the heads-up — and appreciate you calling that out.

Totally agree this PR is the more complete implementation, we will prioritize it over other similar attempts.

Backend side is already LGTM from my end, and thanks for addressing the previous feedback!

We’ll prioritize getting someone to review the SDK portion next. @alexkuzmik @petrotiurin @yaricom Thanks again for the thorough work here, and we’ll follow up soon with any feedback

Copy link
Copy Markdown
Contributor

@petrotiurin petrotiurin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the comments below looks good from the SDK side, thank you for making an effort and putting this change together!

Also please run the linters with pre-commit run --all-files.

Add completion_tokens:0 to TTS usage dicts so they correctly pass
through the OpenAI completions usage pipeline instead of falling
through to unknown provider format. Update test expectations to
match the actual backend-compatible usage format (original_usage.*
prefixed keys). Update backend audioSpeechCost to read
original_usage.input_characters with fallback for backward compat.
Fix ruff formatting issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Samoppakiks Samoppakiks changed the title feat: Support OpenAI TTS models tracking (audio.speech) [issue-2202] [BE] [SDK] feat: Support OpenAI TTS models tracking (audio.speech) Feb 21, 2026
@Samoppakiks
Copy link
Copy Markdown
Contributor Author

Hey @petrotiurin, thanks for the review! Addressed your feedback in f1ee1e8 — aligned the usage keys with the OpikUsage pipeline and ran the linters. Tests passing. Let me know if anything else needs attention!

@andrescrz
Copy link
Copy Markdown
Member

Hey @petrotiurin, thanks for the review! Addressed your feedback in f1ee1e8 — aligned the usage keys with the OpikUsage pipeline and ran the linters. Tests passing. Let me know if anything else needs attention!

Hi @Samoppakiks,

CI is currently running. If everything passes, I’ll go ahead and approve and merge the PR.

There’s nothing else needed from you at the moment — we’ll reach out if anything comes up.

Thanks again for your contribution. We really appreciate it!

Copy link
Copy Markdown
Contributor

@petrotiurin petrotiurin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks again for the contribution!

@andrescrz andrescrz merged commit ab1a0cc into comet-ml:main Feb 23, 2026
40 of 44 checks passed
@vincentkoc
Copy link
Copy Markdown
Member

/tip $200 Samoppakiks

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Feb 23, 2026

@Samoppakiks: You've been awarded a $200 by Comet! 👉 Complete your Algora onboarding to collect the tip.

@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Feb 23, 2026

@Samoppakiks: You've been awarded a $200 by Comet! 👉 Complete your Algora onboarding to collect the tip.

@comet-ml comet-ml deleted a comment from algora-pbc bot Feb 23, 2026
anastasiapyzhik pushed a commit that referenced this pull request Feb 23, 2026
…io.speech) (#5010)

* feat(sdk): add TTS create decorator and audio patching

- Create audio/ module with TTSCreateTrackDecorator and
  TTSStreamingResponseCreateTrackDecorator
- Patch audio.speech.create and
  audio.speech.with_streaming_response.create in opik_tracker.py
- Track input parameters, character-based usage, model and provider
- Follow existing patterns from videos integration

* feat(backend): add audio_speech cost calculation

- Add inputCostPerCharacter field to ModelCostData
- Add audioInputCharacterPrice field to ModelPrice
- Add audioSpeechCost calculator to SpanCostCalculator
- Wire AUDIO_SPEECH mode in CostService.resolveCalculator()
- Cost = inputCostPerCharacter * input_characters from usage

* test: add unit tests for TTS integration and audio speech cost

Python SDK tests:
- test_openai_audio_speech_create__happyflow
- test_openai_audio_speech_with_streaming_response__happyflow
- test_openai_audio_speech_create__tts_1_hd (model name tracking)
- test_openai_audio_speech_create__with_optional_params
- test_openai_audio_speech_create__character_count_usage

Backend tests:
- audioSpeechCost calculator tests (zero price, zero chars, validation)
- audioSpeechCost for tts-1 and tts-1-hd pricing
- CostService integration tests for tts-1 and tts-1-hd
- Updated existing video tests for new ModelPrice constructor

* fix: address PR review - parameterize tests and remove unused imports

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: align TTS usage keys with OpikUsage pipeline and fix linting

Add completion_tokens:0 to TTS usage dicts so they correctly pass
through the OpenAI completions usage pipeline instead of falling
through to unknown provider format. Update test expectations to
match the actual backend-compatible usage format (original_usage.*
prefixed keys). Update backend audioSpeechCost to read
original_usage.input_characters with fallback for backward compat.
Fix ruff formatting issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix TTS streaming

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Andres Cruz <andresc@comet.com>
Co-authored-by: Petro Tiurin <2856640+petrotiurin@users.noreply.github.com>
@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Feb 23, 2026

🎉🎈 @Samoppakiks has been awarded $200 by Comet! 🎈🎊

1 similar comment
@algora-pbc
Copy link
Copy Markdown

algora-pbc bot commented Feb 23, 2026

🎉🎈 @Samoppakiks has been awarded $200 by Comet! 🎈🎊

itamargolan pushed a commit that referenced this pull request Feb 25, 2026
…io.speech) (#5010)

* feat(sdk): add TTS create decorator and audio patching

- Create audio/ module with TTSCreateTrackDecorator and
  TTSStreamingResponseCreateTrackDecorator
- Patch audio.speech.create and
  audio.speech.with_streaming_response.create in opik_tracker.py
- Track input parameters, character-based usage, model and provider
- Follow existing patterns from videos integration

* feat(backend): add audio_speech cost calculation

- Add inputCostPerCharacter field to ModelCostData
- Add audioInputCharacterPrice field to ModelPrice
- Add audioSpeechCost calculator to SpanCostCalculator
- Wire AUDIO_SPEECH mode in CostService.resolveCalculator()
- Cost = inputCostPerCharacter * input_characters from usage

* test: add unit tests for TTS integration and audio speech cost

Python SDK tests:
- test_openai_audio_speech_create__happyflow
- test_openai_audio_speech_with_streaming_response__happyflow
- test_openai_audio_speech_create__tts_1_hd (model name tracking)
- test_openai_audio_speech_create__with_optional_params
- test_openai_audio_speech_create__character_count_usage

Backend tests:
- audioSpeechCost calculator tests (zero price, zero chars, validation)
- audioSpeechCost for tts-1 and tts-1-hd pricing
- CostService integration tests for tts-1 and tts-1-hd
- Updated existing video tests for new ModelPrice constructor

* fix: address PR review - parameterize tests and remove unused imports

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: align TTS usage keys with OpikUsage pipeline and fix linting

Add completion_tokens:0 to TTS usage dicts so they correctly pass
through the OpenAI completions usage pipeline instead of falling
through to unknown provider format. Update test expectations to
match the actual backend-compatible usage format (original_usage.*
prefixed keys). Update backend audioSpeechCost to read
original_usage.input_characters with fallback for backward compat.
Fix ruff formatting issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix TTS streaming

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Andres Cruz <andresc@comet.com>
Co-authored-by: Petro Tiurin <2856640+petrotiurin@users.noreply.github.com>
anastasiapyzhik added a commit that referenced this pull request Feb 26, 2026
* [OPIK-4359] refactor permissions helpers

* [OPIK-4359] cinditionally display Experiments tab + refactor for scalability

* Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4359] prevent permissions request if no org id or user name

* [OPIK-4359] Hide/disable experiments view

* [OPIK-4359] Hide/disable experiments widgets in a dashboard

* [OPIK-4359] Fix eslint and jest

* [OPIK-4359] Rename StartPreference

* [OPIK-4359] Move StartPreference to plugin

* [OPIK-4359] remove plugin usage from WidgetConfigPreview

* [OPIK-4359] Move WidgetConfigDialogAddStep to plugin

* [OPIK-4359] rename StartPreference files to preserve git history

* [OPIK-4359] remove plugin usage from WidgetConfigDialog

* [OPIK-4359] Move DashboardWidgetGrid to plugin

* [OPIK-4359] Move PromptPage to plugin

* [OPIK-4359] Move EvaluationSection to plugin

* [OPIK-4359] Move ExperimentsGuard to plugin

* [OPIK-4359] do not disable api calls

* [OPIK-4359] Move sidebar menu items to plugin

* Revert "[OPIK-4359] rename StartPreference files to preserve git history"

This reverts commit 80462bccb7cf8cc8839645e93e979b1542c90b34.

* Revert "[OPIK-4359] Move StartPreference to plugin"

This reverts commit ac400f46edb2a32fd8968afa6d5dfcea0a64c398.

* Revert "[OPIK-4359] Rename StartPreference"

This reverts commit ef63fba155df560234c1da63832e7a5bc59d2abc.

* [OPIK-4359] Move start preference experiments link to plugin

* Revert "[OPIK-4359] Move PromptPage to plugin"

This reverts commit e772d5fc1c10971cabfcb0332f3fbd24bfcd2298.

* [OPIK-4359] Move prompt experiments tab to plugin

* [OPIK-4359] Restructure EvaluationSection files

* [OPIK-4359] Restructure DashboardWidgetGrid files

* [OPIK-4359] Restructure WidgetConfigDialogAddStep files

* [OPIK-4359] Fix typo

* [OPIK-4359] Fix eslint

* [OPIK-4359] Fix eslint

* [OPIK-4357] Hide/disable dashboards if no permission

* [OPIK-4357] Guard dashboars page & refactor

* [OPIK-4357] Dashboard view guard

* [OPIK-4357] Rename guard files

* [OPIK-4357] Disable sidebar dashboard and experiments requests if no permissions

* [OPIK-4357] Move view selector to plugin

* [OPIK-4357] Fix dependency validation issue

* [OPIK-4359] Refactor page guard

* [OPIK-4359] Rename guard files

* [OPIK-4359] Hide Experiments link in the playground

* [OPIK-4359] Move GetStartedSection conponent to folder

* [OPIK-4359] Hide Get started experiments button

* [OPIK-4359] Disable experiments dashboard template option

* [OPIK-4359] Update known violations list

* [OPIK-4359] Capitalize compare breadcrumb

* [OPIK-4359] Update page guard

* [OPIK-4359] Rename component

* [OPIK-4359] Fix lint errors

* [OPIK-4357] Add permission to the hook

* [OPIK-4359] Update text

* [OPIK-4359] Fix loading

* [OPIK-4359] Fix lint errors

* [OPIK-4359] Refactor plugins with HOC

* [OPIK-4357] Pass permission through HOC

* [OPIK-4357] Address review comment

* [OPIK-4359] Make HOC generic

* [OPIK-4357] Make permissions props optional

* [OPIK-4357] Fix linting

* [OPIK-4357] Rewrite from plugins to context

* [OPIK-4359] Do not return nullish menu items

* [OPIK-4357] Handle loading, hide experiments card on new home page, separate widget resolver concerns

* [issue-5007] [SDK] Fix metadata merging in trace and span updates (#5335)

* fix(ts-sdk): merge metadata in trace.update() and span.update() instead of replacing

When trace.update({ metadata: {...} }) or span.update({ metadata: {...} })
is called, the new metadata now merges with existing metadata instead of
silently replacing it. This fixes a bug where metadata set via update()
was lost, and only metadata from the initial creation call was persisted.

Also fixes Span.update() to use processed updates when syncing local data,
matching the behavior already present in Trace.update().

Closes #5007

* [5007] [SDK] Fix metadata merging in trace and span updates

- Consolidate duplicate metadata merging logic in UpdateService
- Extract generic processUpdate method for shared implementation
- Ensure trace.update() and span.update() both properly merge metadata
- Reduces code duplication while maintaining consistent update behavior

---------

Co-authored-by: hzt <3061613175@qq.com>

* [OPIK-4303] [SDK] Add EvaluationSuite and LLMJudge for regression testing (#5205)

* [OPIK-4303] [SDK] Add suite_evaluators namespace with AssertionEvaluator

- Add BaseSuiteEvaluator protocol with to_config/from_config methods
- Add AssertionEvaluator for LLM-based assertion evaluation
- Support flexible input/output types (str, dict, list, etc.)
- Include serialization compatible with backend's LLM-as-Judge format
- Add comprehensive unit tests for evaluator, parser, and template

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Draft: LLMJudge evaluator for evaluation suites

Work in progress implementation of LLMJudge evaluator with:
- LLMJudge class with score/ascore methods returning boolean results
- Serialization support via to_config/from_config for backend integration
- EvaluationSuite API with add_item and run methods (run not yet implemented)
- Configuration models in opik_llm_judge_config module for backend JSON format
- Unit tests for LLMJudge attributes and config serialization
- Library integration tests for LLMJudge scoring

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Implement EvaluationSuite.run() on top of Dataset and evaluate

Temporary implementation of EvaluationSuite execution:
- EvaluationSuite now wraps a Dataset (created via opik_client factory)
- Items are immediately inserted into dataset when add_item() is called
- run() uses opik.evaluate() with suite-level evaluators as scoring_metrics
- Task output provides input/output for LLMJudge evaluation
- Evaluators stored in instance, will sync with backend DB table later

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Refactor EvaluationSuite and EvaluationEngine for per-item evaluators

- Add per-item evaluator support: evaluators stored in dataset item content
  under __evaluators__ key are extracted and run during evaluation
- Refactor EvaluationEngine to build MetricsEvaluator per item, combining
  suite-level and item-level metrics
- Add split_into_regular_and_task_span_metrics() helper function
- Improve type hints and change imports to use modules instead of names
- Add name field to LLMJudgeConfig for preserving evaluator names
- Simplify EvaluationSuite by removing unused types module

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Simplify LLMJudge to use assertion text directly as name

- Remove Assertion TypedDict and _generate_assertion_name function
- Store assertions as plain strings instead of structured objects
- Use assertion text directly as the name in ScoreResult
- Add dynamic response format generation in LLM prompt
- Add EvaluationSuiteResult with pass/fail status based on execution_policy
- Add validation to ensure only LLMJudge evaluators are used in suites
- Add E2E tests for evaluation suites
- Add unit tests for evaluator validation

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix lint errors

* [OPIK-4303] [SDK] Align LLMJudge config with backend schema and add dynamic response format

- Update LLMJudgeConfig to match backend's LlmAsJudgeCode structure:
  - Rename expected_behavior to description in schema items
  - Change FLOAT to DOUBLE in type enum (backend compatibility)
  - Make temperature required with default 0.0
  - Add optional custom_parameters and metadata fields
- Implement dynamic structured output (response_format) for LLM calls
  instead of embedding JSON example in prompt text
- Add AssertionResult model matching ScoreResult structure (name, value,
  reason, metadata with confidence)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Consolidate evaluation config and improve e2e tests

- Consolidate evaluators and execution_policy under single
  __evaluation_config__ key in dataset item content
- Rewrite e2e tests to focus on main flow (item-level evaluators)
- Add proper feedback score name validation in all LLM tests
- Add test for combined suite and item level evaluators
- Add missing suite_result assertions to all tests
- Use gpt-5-nano model in e2e tests

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Update LLMJudge library integration tests

Update tests to use the new string-based assertion format instead of
dict-based format. The assertion text is now used directly as the
feedback score name.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Refactor evaluation suite module and LLMJudge model handling

- Refactor evaluation suite module structure with types.py and suite_result_constructor.py
- Make EvaluationSuiteResult a class with read-only properties (all_items_passed, pass_rate, etc.)
- Make model name optional in LLMJudge evaluator schema
- LLMJudge always uses DEFAULT_MODEL_NAME (gpt-5-nano), no public model parameter
- to_config() doesn't save model name, from_config() uses default
- Add DEFAULT_MODEL_NAME constant exported from suite_evaluators
- Update all tests to reflect new model handling

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Store suite-level config in dataset and add get_evaluation_suite

- Store suite-level evaluators and execution_policy in a special dataset item
- Add SUITE_CONFIG_ITEM_ID constant for the config item identifier
- Add get_evaluation_suite() method to Opik client to retrieve existing suites
- Suite config is loaded from dataset when _load_from_dataset=True
- Filter out config item when running evaluations
- Add e2e test for get_evaluation_suite flow
- Delete evaluation_suite_example.py (local-only file)

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Refactor evaluation suite to store config in Dataset

- Move ExecutionPolicy type to evaluation_suite.types module
- Add get_evaluators/set_evaluators and get_execution_policy/set_execution_policy
  methods to Dataset and DatasetVersion classes
- Simplify EvaluationSuite to be a thin wrapper around Dataset
- Engine now reads suite config directly from dataset methods
- Remove get_evaluation_suite from opik_client (not persistable without backend)
- Extract validators to separate module in evaluation_suite namespace
- Update __init__.py to only export EvaluationSuite

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Fix trial_count priority and evaluation config filtering

- Prioritize explicit trial_count parameter over dataset execution policy
- Filter __evaluation_config__ from scoring inputs to prevent leaking to metrics
- Keep __evaluation_config__ in dataset_item_content for suite result construction
- Update unit tests with mock dataset methods for get_execution_policy/get_evaluators

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Add evaluate_suite function and evaluator_model parameter

- Add evaluate_suite() function for evaluation suites with dedicated parameters
- Add evaluator_model parameter to specify LLM model for LLMJudge evaluators
- Add model parameter to LLMJudge.__init__ and from_config methods
- Update EvaluationSuite.run() to use evaluate_suite and accept evaluator_model

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Use direct API fields for evaluators and execution_policy on DatasetItem

Remove EVALUATION_CONFIG_KEY workaround and use explicit evaluators and
execution_policy fields on DatasetItem objects. This aligns the SDK with
the backend API that now supports these fields directly.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Include evaluators and execution_policy in DatasetItem content hash

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Refactor LLMJudge into llm_judge package with improved response parsing

- Move llm_judge.py -> llm_judge/metric.py
- Move opik_llm_judge_config.py -> llm_judge/config.py
- Extract parsers to llm_judge/parsers.py with unit tests
- Use assertion text as field name in response schema for reliable parsing
- Add comprehensive unit tests for build_response_format_model and parse_model_output

Co-authored-by: Cursor <cursoragent@cursor.com>

* [OPIK-4303] [SDK] Replace model param with generic init_kwargs dict in from_config

- Change BaseSuiteEvaluator.from_config signature to use init_kwargs dict
- Update LLMJudge.from_config to extract model from init_kwargs
- More flexible API for future evaluators with different init parameters

Co-authored-by: Cursor <cursoragent@cursor.com>

* Improved terminal logging during the evaluation. (TBD - EvaluationEngine refactoring)

* [OPIK-4303] [SDK] Unify evaluation execution with single execution policy method

Replace separate trial-based and execution-policy-based execution paths
with a single _compute_test_results_with_execution_policy method that uses
full parallelism via ThreadPoolExecutor and item-based progress tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix lint errors

* Update unit tests for evaluation.

* Move metric related attributes from the EvaluationEngine state to the methods inputs

* Fix lint errors

* Refactor the EvaluationEngine API, unload it from datasets details

* Fix the bug with progress bar missing detailes

* Fix the bug with progress bar not showing live metric averages for regular evaluations

* Rename progress bar parameter for showing live metrics

* Move progress bar update responsibility to future callbacks from get_results() to avoid a dead time span with no updates for big datasets

* Address PR review feedback: fail-fast evaluator config, by_alias serialization, parse error logging

- Re-raise exceptions in _extract_item_evaluators instead of silently
  swallowing malformed evaluator configs
- Add by_alias=True to model_dump() in evaluation_suite.py so Pydantic
  aliases (schema_, customParameters) match the backend API contract
- Log parse failures in LLMJudge parsers at ERROR level and use 0.0
  instead of False for scoring_failed ScoreResult values

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Cap the guardrails-ai version for integration tests (temporary) due to broken API

* Update the LLMJudge json config schema

* Add config version to JSON

* [OPIK-4303] [SDK] Align LLMJudge response format with backend and add confidence field

Align AssertionResultItem fields with the backend's OnlineScoringEngine
(score/reason instead of value/metadata) and add a forward-compatible
confidence field that the SDK uses but the backend ignores.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4303] [SDK] Restructure evaluation_suite under dataset namespace and enforce import discipline

Move evaluation_suite package under api_objects/dataset/ since it's tightly
coupled to Dataset. Extract ExecutionPolicy into its own module, move validators
to dataset level, and relocate create_suite_version to dataset/rest_operations.

Key changes:
- Move api_objects/evaluation_suite/ → api_objects/dataset/evaluation_suite/
- Create dataset/execution_policy.py with ExecutionPolicy TypedDict
- Create dataset/validators.py with validate_evaluators()
- Move create_suite_version from evaluation_suite/rest_helpers to dataset/rest_operations
- Remove description param from EvaluationSuite.__init__ (delegates to dataset)
- Remove return self from add_item() (chaining not used)
- Add get_items() to EvaluationSuite returning dicts with data/evaluators/execution_policy
- Enforce import discipline: only opik.evaluation imports deferred inside functions
- Prefer module imports over name imports throughout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4303] [SDK] Add EvaluationSuite helper methods and e2e tests

Add delete_items(), get_execution_policy(), get_evaluators(evaluator_model)
to EvaluationSuite. Add evaluator_model param to get_items(). Rename
create_suite_version to create_evaluation_suite_dataset in rest_operations.

New e2e tests cover delete_items, get_evaluators, get_execution_policy,
get_items, and the full create→get→verify→run flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4303] [SDK] Use proper LLMJudge type in create_evaluation_suite_dataset

Replace List[Any] with List[llm_judge.LLMJudge] via TYPE_CHECKING import
since inputs are validated before reaching this function.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4303] [SDK] Add EvaluationSuite.update() for changing evaluators and execution policy

Add update(*, execution_policy, evaluators) method that creates a new
dataset version based on the current latest version (override=False,
base_version passed in request). Both arguments are mandatory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4303] [SDK] Break circular import chain so evaluation_suite can be imported directly in opik_client

Move opik.evaluation imports to TYPE_CHECKING in types.py and evaluation_suite.py,
allowing opik_client to import evaluation_suite at the top level instead of deferring it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Refactor e2e tests

* Add missing testlib module

* [OPIK-4303] [SDK] Include item ID in EvaluationSuite.get_items() return dicts

get_items() now includes "id" key in each returned dict, making it
possible to reference items by ID without reaching into the underlying
dataset. Simplified test_delete_items to use this directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Bump anthropic model

* Relax adk test

* Fix lint errors

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4418] [SDK] Include parent_span_id in start span messages (#5313)

* [OPIK-4418] [SDK] Include parent_span_id in start span messages

When log_start_trace_span is enabled, the @track decorator sends two
CreateSpanMessages per span: a START message at function entry and an
END message at function exit. The START message was missing parent_span_id,
so if the two messages ended up in different batches, ClickHouse would
create rows with different sorting keys that FINAL cannot merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [OPIK-4418] [SDK] Fix emulator span dedup for child spans with parent_span_id in start message

The _save_span dedup logic called _span_trees.remove(existing_span)
unconditionally, but child spans (parent_span_id != None) are never
added to _span_trees. This caused a ValueError that silently prevented
the END message from overwriting the START message in _span_observations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Aliaksandr Kuzmik <98702584+alexkuzmik@users.noreply.github.com>

* Update versions to 1.10.18 and bump base version to 1.10.19

* fix: make OpikExporter compatible with @vercel/otel (#5328)

@vercel/otel uses OpenTelemetry v1 which provides spans with
`instrumentationLibrary` instead of `instrumentationScope` (used in
OTel v2). The direct property access on `instrumentationScope` caused
a TypeError when the property was undefined.

Add a helper function `getInstrumentationScopeName` that safely checks
both `instrumentationScope` (OTel v2) and `instrumentationLibrary`
(OTel v1) with optional chaining, ensuring backward compatibility.

Fixes #3361

* [OPIK-3511] Implement replay manager framework (#5031)

* [OPIK-3837] Add Connection Probe and Monitor with Unit Tests

- Introduced `ConnectionProbe` to evaluate server health via lightweight probes.
- Added `OpikConnectionMonitor` for monitoring server connectivity, handling state transitions (e.g., disconnect, reconnect).
- Implemented comprehensive unit tests for both modules, covering edge cases and integration scenarios.

* [OPIK-3511] Refactor unit test names for improved clarity and consistency

- Updated test function names in `test_connection_monitor.py` and `test_connection_probe.py` to follow consistent and descriptive naming conventions.
- Consolidated repetitive test cases in `test_connection_probe.py` using parameterization to reduce redundancy.

* [OPIK-3511] Refactor `OpikConnectionMonitor` to simplify state management

- Replaced `_last_beat` with `last_beat` as a public instance variable.
- Updated unit tests in `test_connection_monitor.py` to reflect the changes.
- Improved naming consistency across `ConnectionMonitor` implementation and tests.

* [OPIK-3837] Refactor `ConnectionProbe` error handling and update related tests

- Consolidated `httpx.ConnectError` and `httpx.TimeoutException` into a single handler with updated error messaging for clarity.
- Simplified test cases by reusing error variables and ensuring consistency in assertions for error messages.
- Added log verification for unexpected exceptions in `test_check_connection__unexpected_exception`.

* [OPIK-3837] Refactor `message_processing` module and add `ReplayManager`

- Added `ReplayManager` for handling message lifecycle (registering, updating, replaying failed messages).
- Introduced `message_type` attribute to all message classes for consistent identification.
- Added `from_db_message_dict` for easier conversion of dictionary data to message objects.
- Refactored `BaseMessage` serialization methods to support additional attributes.

* [OPIK-3837] Enhance `OpikConnectionMonitor` connection handling and update `tick` docstring

- Added `_has_server_connection` reset logic in `reset` method to ensure clean state initialization.
- Updated `tick` method docstring for improved clarity around connection monitoring behavior and return values.

* [OPIK-3837] Add comprehensive unit tests and enhance serialization for `message_processing` module

- Added extensive unit tests covering serialization/deserialization for all core message types, including nested objects.
- Improved `from_db_message_dict` to handle fields with `init=False`.
- Enhanced message classes (`AddFeedbackScoresBatchMessage`, `CreateTraceBatchMessage`, etc.) with `_deserialize` and `_serialize` methods for batch processing.
- Updated `ReplayManager` to register additional message types.

* [OPIK-3837] Add JSON-based serialization/deserialization for messages and update tests

- Refactored serialization logic to serialize messages to JSON strings and deserialize them back via `message_serialization` module.
- Replaced `AttachmentSupportingMessage` with `CreateAttachmentMessage` for streamlined handling.
- Updated `ReplayManager` to utilize `serialize_message` and `deserialize_message` functions.
- Renamed and expanded test functions for detailed coverage of JSON-based round-trip serialization.

* [OPIK-3837] Enhance `ReplayManager` with batch processing and improve error handling

- Introduced a `batch_size` parameter with a default value of 1000 for efficient batch processing of messages.
- Added `_fetch_failed_messages_batched` for cursor-based pagination of failed messages to avoid OOM issues.
- Updated `register_messages` and `replay_failed_messages` to support batch operations.
- Improved logging by including exception details for better debugging.
- Ensured database schema creation uses `IF NOT EXISTS` for idempotency.

* [OPIK-3837] Add `batch_replay_delay` to `ReplayManager` and improve batch processing

- Introduced `batch_replay_delay` parameter to control delays between batch replays for better memory management.
- Renamed `ReplayManager._fetch_failed_messages_batched` to `fetch_failed_messages_batched` for public access.
- Fixed SQL typo in `CREATE TABLE` statement.
- Added tests for `ReplayManager` to validate new batch replay logic, including delay handling and batch processing behavior.
- Enhanced error handling for file cleanup in `_clean_message_leftovers`.

* [OPIK-3837] Add tests for `datetime_object_hook` and improve datetime handling

- Implemented unit tests for `datetime_object_hook` to ensure proper conversion of known datetime fields while preserving non-datetime fields.
- Introduced `DATETIME_FIELD_NAMES` to restrict datetime conversions to specific fields, preventing accidental conversions of ISO-like strings.
- Updated related deserialization logic to use the new `datetime_object_hook`.

* [OPIK-3837] Add detailed docstrings to `ReplayManager` methods for clarity

- Included comprehensive docstrings for all public and critical methods in `ReplayManager` to improve readability and maintainability.
- Documented arguments, return values, and method functionality, ensuring clear usage guidelines for developers.

* [OPIK-3837] Refactor file upload flow and remove `FileUploadPreprocessor`

- Replaced `FileUploadPreprocessor` with `FileUploadManager` for improved attachment handling.
- Updated `OpikClient` and `Streamer` to directly utilize `FileUploadManager` without the intermediary preprocessor.
- Refactored related unit tests to mock and test `FileUploadManager`.
- Simplified message processing by removing redundant file upload preprocessor logic.

* [OPIK-3837] Updated design docs

* [OPIK-3837] Fixed linter errors

* [OPIK-3838] Implement integration of replay manager with Opik message processing (#5195)

* [OPIK-3838] Refactor `ReplayManager` to `DBManager` and update dependencies

- Renamed `ReplayManager` to `DBManager` for better alignment with its database management responsibilities.
- Updated all references, imports, and tests to reflect the new class name.
- Standardized usage of `DBManagerStatus` instead of `ManagerStatus` for state tracking.

* [OPIK-3838] Add `ReplayManager` for handling offline message replay

- Introduced `ReplayManager` to manage the replay of failed messages when the connection is restored.
- Added `ReplayCallback` type definition for defining replay logic.
- Enhanced `DBManager` to integrate with `ReplayManager` via callback handling.

* [OPIK-3838] Add `ReplayManager` for handling offline message replay

- Introduced `ReplayManager` to manage the replay of failed messages when the connection is restored.
- Added `ReplayCallback` type definition for defining replay logic.
- Enhanced `DBManager` to integrate with `ReplayManager` via callback handling.

* [OPIK-3838] Refactor `ReplayManager` initialization and optimize tick loop

- Updated `ReplayManager` to accept `DBManager` as a constructor parameter, improving dependency injection.
- Refined `_loop` method to include sleep logic, reducing CPU usage during idle periods.
- Enhanced `DBManager` to correctly handle `MessageStatus` when processing messages.

* [OPIK-3895] Refactor `ReplayManager` for improved synchronization and reliability

- Added new defaults for `batch_size`, `batch_replay_delay`, and `tick_interval_seconds` in `ReplayManager`.
- Introduced `threading.RLock` for synchronized message replay and integrated it with `DBManager`.
- Enhanced `_loop` method with interruptible sleep via `threading.Event` for clean shutdown.
- Improved error handling and message replay logic, ensuring failed messages are properly updated.

* [OPIK-4321] Add unit tests for `ReplayManager`

- Introduced comprehensive test suite for `ReplayManager` covering initialization, message registration, reconciliation, and lifecycle management.
- Validated functionality across scenarios including failed messages, connection restoration, and concurrency handling.

* [OPIK-3838] Add `db_manager` property to `ReplayManager` and update tests

- Introduced `db_manager` property for cleaner access to the database manager.
- Refactored test cases to use the new property, improving code readability.

* [OPIK-3838] Refine test to validate loop behavior with replay callback errors

- Updated test name for better clarity.
- Adjusted test to simulate connection restoration and ensure thread resilience to replay callback errors.

* [OPIK-3838] Add `failed_messages_count` method to `DBManager` and integrate with `Streamer`

- Implemented `failed_messages_count` in `DBManager` to retrieve the count of failed messages.
- Added unit tests to verify behavior across various scenarios, including initialization, mixed statuses, and database errors.
- Integrated `ReplayManager` with `Streamer`, enabling replay callback handling and lifecycle management.
- Updated `Streamer` to close `ReplayManager` gracefully and handle failed message replay during shutdown.

* [OPIK-3838] Integrate `ReplayManager` into `Streamer` and add lifecycle unit tests

- Connected `ReplayManager` to `Streamer` for fallback processing of failed messages.
- Added unit and integration tests to validate `ReplayManager` interaction with `Streamer` lifecycle events (init, flush, close).
- Introduced configuration options for replay-related parameters such as batch size, delay, and tick interval.
- Refactored constructors and fixtures to support `ReplayManager` injection.

* [OPIK-3838] Add unit tests for `OpikMessageProcessor` with `ReplayManager` integration

- Added test coverage for `OpikMessageProcessor` methods interacting with `ReplayManager`, including message registration, unregistration, and error handling.
- Enhanced error handling in `ReplayManager` and `DBManager` to log failures and raise appropriate exceptions.
- Updated process lifecycle for better resilience under connection failures and server errors.
- Adjusted `ReplayManager` to handle failed messages gracefully, ensuring unprocessed messages remain registered for retries.

* [OPIK-3838] Fixed typo

- Added test coverage for `OpikMessageProcessor` methods interacting with `ReplayManager`, including message registration, unregistration, and error handling.
- Enhanced error handling in `ReplayManager` and `DBManager` to log failures and raise appropriate exceptions.
- Updated process lifecycle for better resilience under connection failures and server errors.
- Adjusted `ReplayManager` to handle failed messages gracefully, ensuring unprocessed messages remain registered for retries.

* [OPIK-3838] Add message ID validation in `ReplayManager` and update references

- Introduced `_check_message_id` method in `ReplayManager` to validate message IDs in critical methods.
- Updated `OpikMessageProcessor` to include type ignores for MyPy compliance.
- Ensured robust error handling by verifying message ID presence at registration, unregistration, and failure points.

* [OPIK-3838] Automatically assign message IDs in `ReplayManager` if missing

- Added logic to auto-assign unique message IDs when registering messages with missing IDs.
- Extended unit tests to confirm correct ID assignment and database registration behavior.

* [OPIK-3838] Add offline fallback handling and refactor `ReplayManager` interactions

- Introduced offline handling in `OpikMessageProcessor` to register messages as failed when there is no server connection.
- Added unit tests for offline scenarios to validate message registration and ensure handlers are not invoked without connectivity.
- Renamed `db_manager` to `database_manager` in `ReplayManager` for improved clarity.
- Updated related methods and test cases to reflect the renamed property.
- Enhanced `ReplayManager` to support message registration with specific statuses, including `MessageStatus.failed`.

* [OPIK-3838] Add upload success and failure callbacks across file upload flow

- Enhanced file upload system to support `on_upload_success` and `on_upload_failed` callbacks.
- Updated related classes and methods, including `BaseFileUploadManager`, `DBManager`, and `OpikMessageProcessor`.
- Introduced callback types and modified `upload` method signatures.
- Updated S3 error handling to distinguish connection errors.
- Added unit tests for callback integration and error scenarios.

* [OPIK-3838] Add error handling and unit tests for message replay and attachment uploads

- Enhanced `OpikMessageProcessor` with additional error handling for API errors, validation errors, retry errors, and generic exceptions during message processing.
- Added robust unit tests to validate `ReplayManager` callbacks and message state changes across various error scenarios.
- Introduced support for attachment upload callbacks (`on_upload_success` and `on_upload_failed`), validating their interactions with the replay lifecycle.
- Updated file uploader methods to accept upload callbacks and modified related test cases to align with the changes.

* [OPIK-3838] Reduced unnecessary large sleep times to reduce test execution time

* [OPIK-3838] Refactor message registration and processing logic in `OpikMessageProcessor`

- Simplified `ReplayManager` interaction for message registration based on server connection status.
- Consolidated logic to ensure proper handling of `CreateAttachmentMessage` and other message types.
- Updated file upload callbacks with type ignores for MyPy compliance.

* [OPIK-3838] Remove unnecessary sleep in UUID generation for unit tests

- Replaced `time.sleep` with timestamp generation using `datetime.fromtimestamp` to improve test execution speed.

* [OPIK-3511] Remove default parameter values from `ReplayManager` constructor

- Eliminated default values for `batch_size`, `batch_replay_delay`, and `tick_interval_seconds` in `ReplayManager` to enforce explicit configuration.

* [OPIK-3511] Add upsert behavior for message registration in `DBManager`

- Modified `register_message` and `register_messages` to use `ON CONFLICT` for updating existing records.
- Enhanced unit tests to validate upsert functionality for single and batch message operations.

* [OPIK-3511] Add ignored message types handling in `OpikMessageProcessor`

- Introduced `_ignored_message_types_for_replay` to bypass replay registration for specific message types.
- Added `_should_ignore_replay_for_message_type` helper method for streamlined type checking.
- Updated processing logic to skip replay manager interaction for ignored types while ensuring handler execution.
- Enhanced unit tests to verify correct behavior for ignored types under online and offline conditions.
- Fixed typo in replay log message ("was" to "were").
- Added explicit configuration options to `ReplayManager` in tests for `batch_size`, `batch_replay_delay`, and `tick_interval_seconds`.

* [OPIK-3511] Ensure `DBManager` is properly closed after assertion in unit test

- Added a `finally` block to close `DBManager` after the `assert` statement to ensure resource cleanup.

* [OPIK-4132] Enhance thread-safety and concurrency handling in `DBManager`

- Introduced `_replay_mutex` to prevent concurrent `replay_failed_messages` calls from duplicating message fetches.
- Modified lock logic to ensure `self.__lock__` is only held for critical sections, avoiding blockage of producers during replay callbacks or delays.
- Added unit tests to verify concurrency behavior, ensuring producers remain unblocked during sleep and callbacks.
- Improved documentation in `replay_failed_messages` to clarify locking strategy and concurrency guarantees.
- Updated unit tests to validate lock behavior during multi-threaded operations.

* [OPIK-3511] Refactor locking and synchronization in `ReplayManager`

- Replaced `_replay_lock` with `_message_id_lock` for improved thread-safety during message registration.
- Removed unnecessary locking around `_replay_failed_messages` to simplify synchronization and improve efficiency.
- Updated `DBManager` initialization to remove dependency on shared locks.

* [OPIK-3511] Improve message replay handling and add E2E test coverage

- Fixed `strip()` logic in feedback reason handling to avoid errors when `reason` is None.
- Adjusted replay log message for proper singular/plural grammar.
- Added comprehensive E2E tests for offline fallback and failed message replay functionalities.

* [OPIK-3511] Improve thread-safety in `fetch_failed_messages_batched` and fix test fixture docstrings

- Wrapped `fetch_failed_messages_batched` logic with `self.__lock__` to ensure proper synchronization during database operations.
- Fixed incorrect docstrings in test fixtures to refer to `DBManager` instead of `ReplayManager`.

* [OPIK-3511] Fixed docstring for batch_size

* [OPIK-3511] Fixed docstring for DBManager.closed property

* [OPIK-3511] Fixed docstring for DBManager

* [OPIK-3511] Update e2e tests to include batching and ensure correct replay behavior

- Refactored `non_batching_opik_client` to `not_batching_opik_client` for naming consistency.
- Added comprehensive e2e test cases for batching mode: replaying `CreateTraceBatchMessage`, `CreateSpansBatchMessage`, feedback score batches, and simultaneous offline operations.
- Verified attachment-related operations in both batching and non-batching modes.
- Updated and clarified docstrings across tests.

* [OPIK-3511] Add E2E test for `CreateExperimentItemsBatchMessage` replay and refactor attachment usage

- Introduced a new E2E test to verify the successful replay of `CreateExperimentItemsBatchMessage` stored during offline mode.
- Refactored `Attachment` references to use `attachment.Attachment` for consistency and improved readability.
- Updated imports to accommodate the new test case enhancements.

* [OPIK-3511] Added additional debug logging for DBManager for clarity

* [OPIK-3839] Write documentation about offline fallback (#5316)

* [OPIK-3839] Add documentation for offline fallback and message replay feature

- Introduced new section on offline fallback and message replay in `docs.yml`.
- Added detailed documentation outlining the offline fallback mechanism, supported message types, configuration options, and troubleshooting steps.

* [OPIK-3839] Update tracing documentation to include `log_threads_feedback_scores`

- Added `client.log_threads_feedback_scores()` mapping to `AddThreadsFeedbackScoresBatchMessage` in supported message types table.

* [OPIK-3838] Added commentary to explain the need for OpikMessageProcessor._ignored_message_types_for_replay field

* [OPIK-3838] Ensure UTC timezone is explicitly set for datetime fields in serialization tests and update JSON encoding logic

- Updated unit tests to include `tzinfo=datetime.timezone.utc` for datetime fields to ensure consistent handling of timezone information.
- Replaced `MessageJSONEncoder` with `jsonable_encoder.encode` for improved JSON serialization logic.

* [OPIK-3838] Add tests for non-JSON-serializable types and cyclic references in message serialization

- Introduced `TestNonJsonSerializableTypesInMessageFields` with cases for handling non-serializable fields like bytes, sets, tuples, and numpy arrays.
- Verified proper serialization of datetime, custom classes, and cyclic references without crashes.
- Updated `serialize_message` logic to leverage `jsonable_encoder.encode` for robust encoding of special field types.

* [OPIK-3838] Ensure datetime fields in DBManager tests explicitly include UTC timezone

- Updated `test_db_manager.py` to set `tzinfo=datetime.timezone.utc` for `start_time` and `end_time` datetime fields in unit tests.

* [OPIK-4508] [BE] Fix OTEL GenAI semantic conventions mapping gap (#5336)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [SDK] [FE] [BE] Fix deprecated Anthropic Haiku models in tests and infra code (#5322)

* [OPIK-4501] [BE] Add type filter to datasets list endpoint (#5312)

* [OPIK-4501] [BE] Add type filter to datasets list endpoint

Add `type` query parameter to GET /v1/private/datasets to filter by
dataset type (dataset or evaluation_suite). This enables the frontend
to separate evaluation suites from regular datasets in the table view.

Changes:
- DatasetsResource: Add @QueryParam("type") DatasetType type
- DatasetCriteria: Add DatasetType type field
- DatasetDAO: Add type filter to all 6 SQL queries
- DatasetService: Pass type through all code paths
- DatasetsResourceTest: Add 4 integration tests for type filtering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revision 2: Consolidate type filter tests into parameterized test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revision 3: Add DatasetTypeParamConverter for lowercase query param binding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revision 4: Extract AbstractParamConverterProvider to share converter boilerplate

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revision 5: Derive allowed values from DatasetType.values() in error message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revision 6: Use isBlank instead of isEmpty in AbstractParamConverterProvider

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revision 7: Address PR nits — text blocks, @NonNull, SQL skill

- Convert 6 string-concatenated SQL queries to text blocks in DatasetDAO
- Add @NonNull to targetType in AbstractParamConverterProvider
- Add SQL text blocks gotcha to backend SKILL.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [NA] [DOCS] Update docs announcement banner  (#5345)

* [NA] [BE] Update model prices file (#5359)

* Bump com.jayway.jsonpath:json-path in /apps/opik-backend (#5360)

Bumps [com.jayway.jsonpath:json-path](https://github.com/jayway/JsonPath) from 2.10.0 to 3.0.0.
- [Release notes](https://github.com/jayway/JsonPath/releases)
- [Changelog](https://github.com/json-path/JsonPath/blob/master/changelog.md)
- [Commits](https://github.com/jayway/JsonPath/compare/json-path-2.10.0...json-path-3.0.0)

---
updated-dependencies:
- dependency-name: com.jayway.jsonpath:json-path
  dependency-version: 3.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Andres Cruz <andresc@comet.com>

* [NA] [SDK] [DOCS] Update automatically OpenAPI spec and Fern code (#5358)

Co-authored-by: Andres Cruz <andresc@comet.com>

* [OPIK-4512] [FE] Fix section header labels hidden in dark mode (#5319)

* Revert "[OPIK-4437] [BE] Workspace permissions generic solution (#5207)" (#5362)

This reverts commit 1f607a05f54a6c5ad4c7d003ef850614cbf37d9b.

* Update versions to 1.10.19 and bump base version to 1.10.20

* [OPIK-4547][HELM] Add support for additional profiles and users for clickhouse (#5356)

* OPIK-4547: add support for additional profiles and users configuration for clickhouse

* keep old clickhouse monitoring configuration for backward compatability

* add option of inline password

* [issue-2202] [BE] [SDK] feat: Support OpenAI TTS models tracking (audio.speech) (#5010)

* feat(sdk): add TTS create decorator and audio patching

- Create audio/ module with TTSCreateTrackDecorator and
  TTSStreamingResponseCreateTrackDecorator
- Patch audio.speech.create and
  audio.speech.with_streaming_response.create in opik_tracker.py
- Track input parameters, character-based usage, model and provider
- Follow existing patterns from videos integration

* feat(backend): add audio_speech cost calculation

- Add inputCostPerCharacter field to ModelCostData
- Add audioInputCharacterPrice field to ModelPrice
- Add audioSpeechCost calculator to SpanCostCalculator
- Wire AUDIO_SPEECH mode in CostService.resolveCalculator()
- Cost = inputCostPerCharacter * input_characters from usage

* test: add unit tests for TTS integration and audio speech cost

Python SDK tests:
- test_openai_audio_speech_create__happyflow
- test_openai_audio_speech_with_streaming_response__happyflow
- test_openai_audio_speech_create__tts_1_hd (model name tracking)
- test_openai_audio_speech_create__with_optional_params
- test_openai_audio_speech_create__character_count_usage

Backend tests:
- audioSpeechCost calculator tests (zero price, zero chars, validation)
- audioSpeechCost for tts-1 and tts-1-hd pricing
- CostService integration tests for tts-1 and tts-1-hd
- Updated existing video tests for new ModelPrice constructor

* fix: address PR review - parameterize tests and remove unused imports

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: align TTS usage keys with OpikUsage pipeline and fix linting

Add completion_tokens:0 to TTS usage dicts so they correctly pass
through the OpenAI completions usage pipeline instead of falling
through to unknown provider format. Update test expectations to
match the actual backend-compatible usage format (original_usage.*
prefixed keys). Update backend audioSpeechCost to read
original_usage.input_characters with fallback for backward compat.
Fix ruff formatting issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix TTS streaming

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Andres Cruz <andresc@comet.com>
Co-authored-by: Petro Tiurin <2856640+petrotiurin@users.noreply.github.com>

* [OPIK-4380] [BE] Fix dataset deadlock issue by wrapping version creation in reactive retry (#5364)

Wrap all `versionService.createVersionFromDelta()` calls inside a canonical
private `createVersionFromDelta()` helper that uses `Mono.fromCallable()` +
`retryWhen(RetryUtils.handleOnDeadLocks())` + `subscribeOn(Schedulers.boundedElastic())`.

This ensures MySQL deadlock exceptions are caught and retried with exponential
backoff (up to 5 attempts, 250ms base, 2s max, 0.5 jitter) instead of
propagating to the caller and causing request failures.

Also adds `RetryUtils.handleOnDeadLocks()` — a reusable `RetryBackoffSpec`
that matches `MySQLTransactionRollbackException` with "Deadlock found" in the
message (checked recursively through causes).

Implements OPIK-4380: fix dataset dead lock issue

* [OPIK-3137][OPIK-3161] [BE][FE] Add bulk tag adding and removing operations (#5122)

* [OPIK-3137][OPIK-3161] [BE][FE] Add bulk tag adding and removing operations

Implement single-call batch operations for adding and removing tags across multiple entities:

Backend:
- Add tagsToAdd/tagsToRemove fields to API models (Experiments, Traces, Spans, Threads, Dataset Items)
- Update ClickHouse DAOs with array operations (arrayConcat, arrayFilter, arrayDistinct)
- Add backwards compatibility with existing tags/mergeTags parameters
- Add validation (100 char max per tag, 50 max tags for experiments)
- Add integration tests for batch tag operations

Frontend:
- Update mutation hooks to support tagsToAdd/tagsToRemove
- Simplify AddTagDialog components to use single API call
- Add shared ManageTagsDialog component with client-side validation
- Add maxEntities limit (1000) with improved UX
- Add comprehensive unit tests for ManageTagsDialog

Implements OPIK-3137: Allow multiple tags to be added in bulk
Implements OPIK-3161: Allow tag removal from multiple items at once

* Revision 1: Address PR feedback - code quality improvements

Backend:
- Extract duplicated SQL tags logic into SqlFragments.tagUpdateFragment()
- Refactor test naming from test* to batchUpdateWhen* for clarity
- Parameterize duplicate validation tests using @MethodSource
- Fix import order in TraceThreadDAO

Frontend:
- Optimize tag operations from O(n²) to O(n) using Set data structures
- Fix React state update bug by moving setOpen to useEffect
- Prevent adding tags that already exist on all selected entities
- Add deduplication in draft mode tag additions
- Fix import ordering violations

* Revision 2: Cleanup tags/mergeTags approach and improve tag limits

Backend:
- Add comprehensive documentation explaining dual tag update strategy
  (NEW: tagsToAdd/tagsToRemove for frontend vs OLD: tags+mergeTags for SDK)
- Clarify backwards compatibility requirements in all DAOs
- Document mutual exclusivity and precedence rules

Frontend:
- Remove dead code: eliminate tags/mergeTags from mutation hooks
  (experiments, traces, spans, threads, dataset items)
- Add comment to usePromptVersionsUpdateMutation explaining why it still
  uses the legacy approach (SQL complexity for bulk removals)
- Convert maxTags, maxEntities, maxTagLength from props to constants
  (never overridden in production code)
- Update tests to reflect validation on submit instead of on add

* Revision 3: Add validation limits for tag fields in Update DTOs

- Add @Valid and @Size annotations to tags, tagsToAdd, and tagsToRemove fields
- Limit tags to max 50 per operation and 100 characters per tag
- Applied to TraceUpdate, SpanUpdate, ExperimentUpdate, DatasetItemUpdate, TraceThreadUpdate
- Provides per-operation validation at API boundary level

* Revision 4: Enforce 50-tag limit via SQL throwIf and refactor tag operations

- Rename SqlFragments to TagOperations; add TagUpdatable interface, shared
  configureTagTemplate/bindTagParams helpers, and mapTagLimitError for 422 mapping
- Add ClickHouse throwIf validation in tagUpdateFragment with short_circuit_function_evaluation
- Add @Valid @Size annotations on all Update DTOs (tags, tagsToAdd, tagsToRemove)
- Add server-side tag limit check in DatasetItemService.applyDeltaChanges for versioned inserts
- Wire mapTagLimitError into all service reactive chains (experiments, traces, spans, threads, dataset items)
- Update 6 frontend mutation hooks to parse ErrorMessage.errors[0] with message fallback
- Add integration test for sequential tag limit enforcement

* Revision 5: Redesign ManageTagsDialog with simpler and only common tags UI

- Rename to "Manage shared tags", show only tags common to all items
- Replace accordion layout with flat inline tag list
- Add inline "+ Add tag" editable tag (click to input, Enter to add)
- Removed tags shown as strikethrough with faded original color
- New tags distinguished with outline border
- RemovableTag: overlay close icon with gradient mask blend
- RemovableTag: tooltip on truncated tags, max-w-40 constraint
- Scrollable tag area for many tags

* Revision 6: Extract shared TagUpdateFields type and buildTagUpdatePayload helper

* Revision 7: Improve ManageTagsDialog UX - inline save/discard, Enter shortcut, Escape handling

* Revision 8: Improve error handling - join multi-error responses and extract to shared utility

- Add extractErrorMessage utility to lib/tags.ts that handles multi-error responses
- Join multiple backend errors with comma separator instead of only showing first error
- Add fallback chain: errors array → message → error.message → generic message
- Update 6 mutation hooks to use shared extractErrorMessage utility
- Remove lodash/get imports where no longer needed
- Improve toast message in ManageTagsDialog to show item count
- Update tagUpdateFragment in TagOperations.java: avoid running throwIf when not needed

* Revision 9: Fix operator precedence issue in tagUpdateFragment

The .replace() method was binding to the last text block instead of the
concatenated result due to operator precedence (. has higher precedence than +).

Added parentheses to force concatenation before calling .replace(), ensuring
TAGS_COL placeholder is properly replaced throughout the SQL fragment.

Fixes: BatchTagOperationsTest

* Revision 10: Add more batch tag tests, simplify RemovableTag tooltip, fix removed tag truncation

[BE] Add 3 tests to BatchTagOperationsTest: tags with mergeTags=false, tagsToAdd
with tags simultaneously, and no tag fields provided. Format SQL in all DAOs for
consistent tagUpdateFragment concatenation.

[FE] Simplify RemovableTag to always show tooltip instead of conditional truncation
detection. Fix long tags not truncating when marked for removal in ManageTagsDialog.

* Revision 11: Reuse TAG_LIMIT_ERROR constant in DatasetItemService

* [OPIK-4359] Revert redundant changes

* [OPIK-4359] Fix test

* [OPIK-4357] Fixes after merge

* [OPIK-4357] Remove ViewSelector from plugin

* [OPIK-4357] Remove redundant changes

* [OPIK-4357] Move type from component

* [OPIK-4357] Hide dashboard view contents when no permission

* [OPIK-4357] Simplify view handling

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Petro Tiurin <2856640+petrotiurin@users.noreply.github.com>
Co-authored-by: hzt <3061613175@qq.com>
Co-authored-by: Aliaksandr Kuzmik <98702584+alexkuzmik@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: andrii.dudar <andriid@comet.com>
Co-authored-by: github-actions <github-actions@comet.com>
Co-authored-by: Iaroslav Omelianenko <yaric_mail@yahoo.com>
Co-authored-by: Vincent Koc <vincentk@comet.com>
Co-authored-by: Daniel Dimenshtein <danield@comet.com>
Co-authored-by: 518miker92 <73313815+518miker92@users.noreply.github.com>
Co-authored-by: CometActions <126667691+CometActions@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andres Cruz <andresc@comet.com>
Co-authored-by: Miguel García <miguelg@comet.com>
Co-authored-by: avinahradau <a.l.vinogradov1986@gmail.com>
Co-authored-by: Liya Katz <liyak@comet.com>
Co-authored-by: Samoppakiks <87269875+Samoppakiks@users.noreply.github.com>
Co-authored-by: Thiago dos Santos Hora <thiagoh@comet.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend 🙋 Bounty claim java Pull requests that update Java code Python SDK python Pull requests that update Python code 💰 Rewarded tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR]: Support Openai TTS models tracking

4 participants