Skip to content

Conversation

@minimAluminiumalism
Copy link
Contributor

@minimAluminiumalism minimAluminiumalism commented Jul 16, 2025

  • I have added tests that cover my changes.
  • If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
  • PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
  • (If applicable) I have updated the documentation accordingly.

Feat #3139. Since I don't have an official OpenAI API, I use deepseek API for unittest.


Important

Prioritize API-provided token usage data over local calculations in OpenAI instrumentation, with tests for verification.

  • Behavior:
    • Prioritize API-provided token usage data over tiktoken calculations in _set_streaming_token_metrics() in chat_wrappers.py.
    • Handle cases where token usage data is misplaced in choices[0].
  • Tests:
    • Add test_streaming_with_api_usage_capture() and test_streaming_with_api_usage_and_events() in test_streaming_with_api_usage.py to verify token usage capture and event logging.
  • Misc:
    • Use deepseek API for testing due to lack of official OpenAI API access.

This description was created by Ellipsis for d580aba. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling when retrieving token encodings, ensuring more robust behavior if unexpected issues occur with the tokenizer.
  • New Features

    • Enhanced accuracy of token usage metrics during streaming chat completions by preferring API-provided usage data and calculating token counts only when necessary.
  • Tests

    • Added new tests and test data to verify correct handling and logging of token usage metrics in streaming chat completions, including scenarios with API-provided usage information.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to d580aba in 2 minutes and 20 seconds. Click for details.
  • Reviewed 1138 lines of code in 5 files
  • Skipped 1 files when reviewing.
  • Skipped posting 5 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py:525
  • Draft comment:
    Good use of the API 'usage' field. Consider adding a comment to clarify what happens if only one of prompt_tokens or completion_tokens is provided, so the fallback logic is explicit.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The code is actually quite clear about what happens - it checks for each token count separately and falls back individually if needed. The fallback logic is straightforward and follows naturally from the code structure. Adding a comment would be redundant since the code is self-documenting through its clear if-statement structure. The comment raises a valid point about documenting edge cases. Documentation about partial API responses could help future maintainers. While documentation is good, in this case the code structure itself clearly shows the fallback behavior through separate if-statements for each token type. The logic is simple enough that it doesn't require additional explanation. The comment should be deleted because the code is already clear and self-documenting about its fallback behavior. Adding a comment would be redundant.
2. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py:540
  • Draft comment:
    When concatenating message contents for token calculation, adding a delimiter (e.g., a space) between messages could prevent merging words and ensure accurate token counts.
  • Reason this comment was not posted:
    Comment was on unchanged code.
3. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py:977
  • Draft comment:
    Usage info from streaming chunks is set from the last chunk (or choices[0]). Ensure this overwriting behavior is intended; consider accumulating usage metrics over chunks if applicable.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment is speculative - it asks "ensure this is intended" without providing evidence that the current behavior is wrong. The code comment explains this is intentionally handling different provider implementations. The suggestion to accumulate metrics is not backed by evidence that this would be better. The current implementation seems reasonable for handling usage stats from different providers. Maybe there are valid use cases for accumulating usage across chunks that I'm not considering. The comment could be pointing to a real issue with metric accuracy. While accumulation could be valid, without clear evidence of issues with the current approach, this is a speculative suggestion. The code's approach of using the final usage stats appears intentional and documented. Delete the comment as it is speculative and asks for confirmation without providing evidence of actual issues. The code's behavior appears intentional and documented.
4. packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py:38
  • Draft comment:
    Test validates API usage metrics are captured when provided. Consider adding a test case where the API response lacks usage info so that the fallback (tiktoken calculation) path is exercised.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a test file specifically focused on testing API providers that include usage information (as stated in the fixture docstring). The suggestion to test the fallback path seems reasonable from a testing perspective, but would likely belong in a separate test file focused on that scenario. The current file's scope is intentionally limited to testing the "with API usage" case. The comment raises a valid point about test coverage. However, I may be wrong about whether this belongs in a separate file - perhaps this file should test both paths since they're related functionality. While testing both paths could make sense, the file name and docstrings clearly indicate this file's focused scope. Adding the fallback test here would muddy that focus. The comment should be deleted as it suggests expanding beyond the intentionally focused scope of this test file. The fallback path test would be better placed in a separate test file.
5. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py:374
  • Draft comment:
    Typo in comment: 'tiktok' should be changed to 'tiktoken' to match usage later in the code.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% While the comment is technically correct (there is indeed a typo), comments about typos in comments are generally very low value. The typo doesn't affect functionality at all. The code works perfectly fine regardless of this typo. This kind of nitpick creates noise in the review process. The typo could potentially confuse future readers who might think tiktok is the actual library name. Documentation accuracy has value. While documentation accuracy has some value, this is an internal comment in an error handling block. The correct library name is used prominently in the import statement and throughout the code. The impact of this typo is minimal. This comment should be deleted. While technically correct, comments about typos in comments are too minor to be worth the review noise.

Workflow ID: wflow_zQUnzXkPfW5cXaDE

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@minimAluminiumalism
Copy link
Contributor Author

@nirga Some questions here:

  1. Considering the prevalence of token usage data in LLM SDK responses, is it redundant to rely on param enrich_token_usage for data acquisition?
  2. There appears to be an inconsistency in the default behavior: Traceloop.init() sets should_enrich_metrics=True (implying enrich_token_usage is enabled), thus including token usage in metrics. However, instrumentors such as OpenAIInstrumentor/AnthropicInstrumentor/GroqInstrumentor/... default enrich_token_usage to False. This design prevents the inclusion of token data when using packages like opentelemetry-instrumentation-openai directly. Isn't this a design inconsistency?
def __init__(
        self,
        enrich_assistant: bool = False,
        enrich_token_usage: bool = False,
        exception_logger=None,
        get_common_metrics_attributes: Callable[[], dict] = lambda: {},
        upload_base64_image: Optional[
            Callable[[str, str, str, str], Coroutine[None, None, str]]
        ] = lambda *args: "",
        enable_trace_context_propagation: bool = True,
        use_legacy_attributes: bool = True,
    ):

@minimAluminiumalism
Copy link
Contributor Author

@nirga Some questions here:

  1. Considering the prevalence of token usage data in LLM SDK responses, is it redundant to rely on param enrich_token_usage for data acquisition?
  2. There appears to be an inconsistency in the default behavior: Traceloop.init() sets should_enrich_metrics=True (implying enrich_token_usage is enabled), thus including token usage in metrics. However, instrumentors such as OpenAIInstrumentor/AnthropicInstrumentor/GroqInstrumentor/... default enrich_token_usage to False. This design prevents the inclusion of token data when using packages like opentelemetry-instrumentation-openai directly. Isn't this a design inconsistency?
def __init__(
        self,
        enrich_assistant: bool = False,
        enrich_token_usage: bool = False,
        exception_logger=None,
        get_common_metrics_attributes: Callable[[], dict] = lambda: {},
        upload_base64_image: Optional[
            Callable[[str, str, str, str], Coroutine[None, None, str]]
        ] = lambda *args: "",
        enable_trace_context_propagation: bool = True,
        use_legacy_attributes: bool = True,
    ):

To simplify and improve usability, we should consider either: (1) deprecating the param enrich_token_usage entirely, or (2) setting a default value of True for enrich_token_usage. In either case, we should enable configuration via environment variables, similar to opentelemetry-instrument, so users can control the behavior without code modifications.

@nirga nirga changed the title feat(openai): prioritize api-provided token over tiktoken calculation fix(openai): prioritize api-provided token over tiktoken calculation Jul 20, 2025
@coderabbitai
Copy link

coderabbitai bot commented Jul 20, 2025

Walkthrough

The changes improve token usage metric handling in streaming chat completions by prioritizing API-provided usage data and only falling back to tiktoken-based calculations when necessary. Exception handling for tiktoken encoding retrieval is broadened. New tests and a cassette are added to verify correct span and log recording with streaming API usage data.

Changes

File(s) Change Summary
.../openai/shared/init.py Broadened exception handling in get_token_count_from_string to catch all exceptions from tiktoken encoding retrieval.
.../openai/shared/chat_wrappers.py Enhanced logic in streaming token metric extraction to prefer API-provided usage data and only compute tokens if usage is missing.
.../tests/traces/test_streaming_with_api_usage.py Added new tests for streaming chat completions with API usage data, verifying span and log recording.
.../tests/traces/cassettes/test_streaming_with_api_usage/test_streaming_with_api_usage_capture.yaml Added new test cassette capturing a streaming chat completion with usage data from the Deepseek API.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Instrumentation
    participant API

    Client->>API: Send streaming chat completion request
    API-->>Client: Stream chunks (some may include usage data)
    Client->>Instrumentation: Pass each chunk
    loop For each chunk
        Instrumentation->>Instrumentation: Accumulate content
        alt Chunk contains usage data
            Instrumentation->>Instrumentation: Extract usage data
        end
    end
    alt Usage data present in API response
        Instrumentation->>Instrumentation: Set token metrics from API usage
    else
        Instrumentation->>Instrumentation: Compute token counts using tiktoken
    end
    Instrumentation->>Client: Provide accumulated response and metrics
Loading

Possibly related issues

Poem

In streaming chat, the tokens flow,
Now counted true, as metrics grow.
API usage leads the way,
tiktoken helps if stats go astray.
With tests and logs, we celebrate—
Each span and joke, we annotate!
🐰✨

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

cursor[bot]

This comment was marked as outdated.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Token Count Validation Fails

The if conditions checking for prompt_tokens and completion_tokens from the API response incorrectly treat 0 as a falsy value. This causes the code to bypass valid zero token counts and unnecessarily fall back to tiktoken calculation. The conditions should check is not None to properly handle legitimate zero token counts.

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py#L527-L531

usage = complete_response["usage"]
if usage.get("prompt_tokens"):
prompt_usage = usage["prompt_tokens"]
if usage.get("completion_tokens"):
completion_usage = usage["completion_tokens"]

Fix in CursorFix in Web


Was this report helpful? Give feedback by reacting with 👍 or 👎

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (1)

525-557: Excellent implementation of API usage prioritization!

The logic correctly prioritizes API-provided token usage data and only falls back to tiktoken calculations when necessary. The approach is robust and maintains backward compatibility.

Consider a minor optimization to avoid redundant should_record_stream_token_usage() calls:

 # Calculate prompt tokens if not available from API
-if prompt_usage == -1 and request_kwargs and request_kwargs.get("messages"):
+if prompt_usage == -1 and request_kwargs and request_kwargs.get("messages") and should_record_stream_token_usage():
     prompt_content = ""
     for msg in request_kwargs.get("messages"):
         if msg.get("content"):
             prompt_content += msg.get("content")
-    if model_name and should_record_stream_token_usage():
+    if model_name:
         prompt_usage = get_token_count_from_string(prompt_content, model_name)

 # Calculate completion tokens if not available from API
-if completion_usage == -1 and complete_response.get("choices"):
+if completion_usage == -1 and complete_response.get("choices") and should_record_stream_token_usage():
     completion_content = ""
     for choice in complete_response.get("choices"):
         if choice.get("message") and choice.get("message").get("content"):
             completion_content += choice["message"]["content"]
-    if model_name and should_record_stream_token_usage():
+    if model_name:
         completion_usage = get_token_count_from_string(
             completion_content, model_name
         )
packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py (2)

58-59: Address unused loop variable.

The loop variable chunk is not used within the loop body. Following the static analysis suggestion, rename it to _chunk to indicate it's intentionally unused.

-    for chunk in response:
-        pass  # Just consume the stream
+    for _chunk in response:
+        pass  # Just consume the stream

71-72: Consider making the log assertion more specific.

The assertion len(logs) >= 2 could be more specific about what logs are expected to improve test reliability and clarity.

Consider asserting the exact number of expected logs or checking for specific log types:

-    # Check event logs
-    logs = log_exporter.get_finished_logs()
-    assert len(logs) >= 2  # At least user message and assistant choice
+    # Check event logs (user message + assistant response)
+    logs = log_exporter.get_finished_logs()
+    assert len(logs) == 2, f"Expected 2 logs (user + assistant), got {len(logs)}"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca529e4 and 6cb10d5.

⛔ Files ignored due to path filters (1)
  • packages/opentelemetry-instrumentation-openai/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (1 hunks)
  • packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (2 hunks)
  • packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_streaming_with_api_usage/test_streaming_with_api_usage_capture.yaml (1 hunks)
  • packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (2)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/utils.py (1)
  • should_record_stream_token_usage (34-35)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (1)
  • get_token_count_from_string (358-385)
packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py (2)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • SpanAttributes (36-229)
packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
  • get_finished_spans (40-43)
🪛 Ruff (0.12.2)
packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py

58-58: Loop control variable chunk not used within loop body

Rename unused chunk to _chunk

(B007)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Test Packages (3.10)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Test Packages (3.9)
  • GitHub Check: Build Packages (3.11)
  • GitHub Check: Lint
  • GitHub Check: Cursor BugBot
🔇 Additional comments (4)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (1)

373-378: LGTM! Improved error handling for tiktoken encoding retrieval.

The broadened exception handling ensures that any unexpected errors from tiktoken.encoding_for_model() are caught and logged gracefully, making the token counting fallback more robust. This aligns well with the PR's objective to prioritize API-provided token data.

packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_streaming_with_api_usage/test_streaming_with_api_usage_capture.yaml (1)

1-165: LGTM! Well-structured test cassette for streaming API usage validation.

The cassette properly captures a streaming chat completion interaction with the Deepseek API, including the crucial usage data in the final chunk (prompt_tokens, completion_tokens, total_tokens). This provides excellent test coverage for the new API usage prioritization functionality.

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (1)

976-981: Great handling of usage data capture from streaming chunks.

The implementation correctly captures usage information from the last streaming chunks and includes a thoughtful fallback for providers like Moonshot that mistakenly place usage data within choices[0]. This ensures comprehensive coverage across different API providers.

packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py (1)

1-45: Excellent test coverage for streaming API usage functionality.

The tests comprehensively validate the new streaming token usage capture functionality, including proper span attribute verification and content accumulation. The use of Deepseek API as a substitute for testing is pragmatic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants