-
Notifications
You must be signed in to change notification settings - Fork 868
fix(openai): prioritize api-provided token over tiktoken calculation #3142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to d580aba in 2 minutes and 20 seconds. Click for details.
- Reviewed
1138lines of code in5files - Skipped
1files when reviewing. - Skipped posting
5draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py:525
- Draft comment:
Good use of the API 'usage' field. Consider adding a comment to clarify what happens if only one of prompt_tokens or completion_tokens is provided, so the fallback logic is explicit. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The code is actually quite clear about what happens - it checks for each token count separately and falls back individually if needed. The fallback logic is straightforward and follows naturally from the code structure. Adding a comment would be redundant since the code is self-documenting through its clear if-statement structure. The comment raises a valid point about documenting edge cases. Documentation about partial API responses could help future maintainers. While documentation is good, in this case the code structure itself clearly shows the fallback behavior through separate if-statements for each token type. The logic is simple enough that it doesn't require additional explanation. The comment should be deleted because the code is already clear and self-documenting about its fallback behavior. Adding a comment would be redundant.
2. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py:540
- Draft comment:
When concatenating message contents for token calculation, adding a delimiter (e.g., a space) between messages could prevent merging words and ensure accurate token counts. - Reason this comment was not posted:
Comment was on unchanged code.
3. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py:977
- Draft comment:
Usage info from streaming chunks is set from the last chunk (or choices[0]). Ensure this overwriting behavior is intended; consider accumulating usage metrics over chunks if applicable. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment is speculative - it asks "ensure this is intended" without providing evidence that the current behavior is wrong. The code comment explains this is intentionally handling different provider implementations. The suggestion to accumulate metrics is not backed by evidence that this would be better. The current implementation seems reasonable for handling usage stats from different providers. Maybe there are valid use cases for accumulating usage across chunks that I'm not considering. The comment could be pointing to a real issue with metric accuracy. While accumulation could be valid, without clear evidence of issues with the current approach, this is a speculative suggestion. The code's approach of using the final usage stats appears intentional and documented. Delete the comment as it is speculative and asks for confirmation without providing evidence of actual issues. The code's behavior appears intentional and documented.
4. packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py:38
- Draft comment:
Test validates API usage metrics are captured when provided. Consider adding a test case where the API response lacks usage info so that the fallback (tiktoken calculation) path is exercised. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a test file specifically focused on testing API providers that include usage information (as stated in the fixture docstring). The suggestion to test the fallback path seems reasonable from a testing perspective, but would likely belong in a separate test file focused on that scenario. The current file's scope is intentionally limited to testing the "with API usage" case. The comment raises a valid point about test coverage. However, I may be wrong about whether this belongs in a separate file - perhaps this file should test both paths since they're related functionality. While testing both paths could make sense, the file name and docstrings clearly indicate this file's focused scope. Adding the fallback test here would muddy that focus. The comment should be deleted as it suggests expanding beyond the intentionally focused scope of this test file. The fallback path test would be better placed in a separate test file.
5. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py:374
- Draft comment:
Typo in comment: 'tiktok' should be changed to 'tiktoken' to match usage later in the code. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% While the comment is technically correct (there is indeed a typo), comments about typos in comments are generally very low value. The typo doesn't affect functionality at all. The code works perfectly fine regardless of this typo. This kind of nitpick creates noise in the review process. The typo could potentially confuse future readers who might think tiktok is the actual library name. Documentation accuracy has value. While documentation accuracy has some value, this is an internal comment in an error handling block. The correct library name is used prominently in the import statement and throughout the code. The impact of this typo is minimal. This comment should be deleted. While technically correct, comments about typos in comments are too minor to be worth the review noise.
Workflow ID: wflow_zQUnzXkPfW5cXaDE
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
|
@nirga Some questions here:
|
To simplify and improve usability, we should consider either: (1) deprecating the param |
WalkthroughThe changes improve token usage metric handling in streaming chat completions by prioritizing API-provided usage data and only falling back to tiktoken-based calculations when necessary. Exception handling for tiktoken encoding retrieval is broadened. New tests and a cassette are added to verify correct span and log recording with streaming API usage data. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Instrumentation
participant API
Client->>API: Send streaming chat completion request
API-->>Client: Stream chunks (some may include usage data)
Client->>Instrumentation: Pass each chunk
loop For each chunk
Instrumentation->>Instrumentation: Accumulate content
alt Chunk contains usage data
Instrumentation->>Instrumentation: Extract usage data
end
end
alt Usage data present in API response
Instrumentation->>Instrumentation: Set token metrics from API usage
else
Instrumentation->>Instrumentation: Compute token counts using tiktoken
end
Instrumentation->>Client: Provide accumulated response and metrics
Possibly related issues
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Token Count Validation Fails
The if conditions checking for prompt_tokens and completion_tokens from the API response incorrectly treat 0 as a falsy value. This causes the code to bypass valid zero token counts and unnecessarily fall back to tiktoken calculation. The conditions should check is not None to properly handle legitimate zero token counts.
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py#L527-L531
Lines 527 to 531 in 6cb10d5
| usage = complete_response["usage"] | |
| if usage.get("prompt_tokens"): | |
| prompt_usage = usage["prompt_tokens"] | |
| if usage.get("completion_tokens"): | |
| completion_usage = usage["completion_tokens"] |
Was this report helpful? Give feedback by reacting with 👍 or 👎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (1)
525-557: Excellent implementation of API usage prioritization!The logic correctly prioritizes API-provided token usage data and only falls back to tiktoken calculations when necessary. The approach is robust and maintains backward compatibility.
Consider a minor optimization to avoid redundant
should_record_stream_token_usage()calls:# Calculate prompt tokens if not available from API -if prompt_usage == -1 and request_kwargs and request_kwargs.get("messages"): +if prompt_usage == -1 and request_kwargs and request_kwargs.get("messages") and should_record_stream_token_usage(): prompt_content = "" for msg in request_kwargs.get("messages"): if msg.get("content"): prompt_content += msg.get("content") - if model_name and should_record_stream_token_usage(): + if model_name: prompt_usage = get_token_count_from_string(prompt_content, model_name) # Calculate completion tokens if not available from API -if completion_usage == -1 and complete_response.get("choices"): +if completion_usage == -1 and complete_response.get("choices") and should_record_stream_token_usage(): completion_content = "" for choice in complete_response.get("choices"): if choice.get("message") and choice.get("message").get("content"): completion_content += choice["message"]["content"] - if model_name and should_record_stream_token_usage(): + if model_name: completion_usage = get_token_count_from_string( completion_content, model_name )packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py (2)
58-59: Address unused loop variable.The loop variable
chunkis not used within the loop body. Following the static analysis suggestion, rename it to_chunkto indicate it's intentionally unused.- for chunk in response: - pass # Just consume the stream + for _chunk in response: + pass # Just consume the stream
71-72: Consider making the log assertion more specific.The assertion
len(logs) >= 2could be more specific about what logs are expected to improve test reliability and clarity.Consider asserting the exact number of expected logs or checking for specific log types:
- # Check event logs - logs = log_exporter.get_finished_logs() - assert len(logs) >= 2 # At least user message and assistant choice + # Check event logs (user message + assistant response) + logs = log_exporter.get_finished_logs() + assert len(logs) == 2, f"Expected 2 logs (user + assistant), got {len(logs)}"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
packages/opentelemetry-instrumentation-openai/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py(1 hunks)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py(2 hunks)packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_streaming_with_api_usage/test_streaming_with_api_usage_capture.yaml(1 hunks)packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (2)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/utils.py (1)
should_record_stream_token_usage(34-35)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (1)
get_token_count_from_string(358-385)
packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py (2)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes(36-229)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
get_finished_spans(40-43)
🪛 Ruff (0.12.2)
packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py
58-58: Loop control variable chunk not used within loop body
Rename unused chunk to _chunk
(B007)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.9)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
- GitHub Check: Cursor BugBot
🔇 Additional comments (4)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (1)
373-378: LGTM! Improved error handling for tiktoken encoding retrieval.The broadened exception handling ensures that any unexpected errors from
tiktoken.encoding_for_model()are caught and logged gracefully, making the token counting fallback more robust. This aligns well with the PR's objective to prioritize API-provided token data.packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_streaming_with_api_usage/test_streaming_with_api_usage_capture.yaml (1)
1-165: LGTM! Well-structured test cassette for streaming API usage validation.The cassette properly captures a streaming chat completion interaction with the Deepseek API, including the crucial usage data in the final chunk (
prompt_tokens,completion_tokens,total_tokens). This provides excellent test coverage for the new API usage prioritization functionality.packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (1)
976-981: Great handling of usage data capture from streaming chunks.The implementation correctly captures usage information from the last streaming chunks and includes a thoughtful fallback for providers like Moonshot that mistakenly place usage data within
choices[0]. This ensures comprehensive coverage across different API providers.packages/opentelemetry-instrumentation-openai/tests/traces/test_streaming_with_api_usage.py (1)
1-45: Excellent test coverage for streaming API usage functionality.The tests comprehensively validate the new streaming token usage capture functionality, including proper span attribute verification and content accumulation. The use of Deepseek API as a substitute for testing is pragmatic.
…3142) Co-authored-by: Nir Gazit <nirga@users.noreply.github.com>
feat(instrumentation): ...orfix(instrumentation): ....Feat #3139. Since I don't have an official OpenAI API, I use deepseek API for unittest.
Important
Prioritize API-provided token usage data over local calculations in OpenAI instrumentation, with tests for verification.
tiktokencalculations in_set_streaming_token_metrics()inchat_wrappers.py.choices[0].test_streaming_with_api_usage_capture()andtest_streaming_with_api_usage_and_events()intest_streaming_with_api_usage.pyto verify token usage capture and event logging.deepseekAPI for testing due to lack of official OpenAI API access.This description was created by
for d580aba. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Bug Fixes
New Features
Tests