fix(openai-http): propagate token usage in /v1/chat/completions response#38893
fix(openai-http): propagate token usage in /v1/chat/completions response#38893extrasmall0 wants to merge 3 commits into
Conversation
Greptile SummaryThis PR fixes a real issue where the non-streaming
Confidence Score: 3/5
Last reviewed commit: 566f864 |
| @@ -503,7 +508,11 @@ export async function handleOpenAiHttpRequest( | |||
| finish_reason: "stop", | |||
| }, | |||
| ], | |||
| usage: { prompt_tokens: 0, completion_tokens: 0, total_tokens: 0 }, | |||
| usage: { | |||
| prompt_tokens: promptTokens, | |||
| completion_tokens: completionTokens, | |||
| total_tokens: promptTokens + completionTokens, | |||
There was a problem hiding this comment.
total_tokens omits cache tokens and ignores pre-computed total
The EmbeddedPiAgentMeta.usage shape includes cacheRead, cacheWrite, and a pre-computed total field (see src/agents/pi-embedded-runner/types.ts). This PR only reads input and output, so total_tokens will be understated whenever cache tokens are present.
The sibling handler openresponses-http.ts already solves this correctly — its toUsage helper (lines 170–181) uses:
const total = value.total ?? input + output + cacheRead + cacheWrite;The implementation here should follow the same pattern:
| const agentUsage = ( | |
| result as { | |
| meta?: { | |
| agentMeta?: { | |
| usage?: { | |
| input?: number; | |
| output?: number; | |
| cacheRead?: number; | |
| cacheWrite?: number; | |
| total?: number; | |
| }; | |
| }; | |
| }; | |
| } | |
| )?.meta?.agentMeta?.usage; | |
| const promptTokens = agentUsage?.input ?? 0; | |
| const completionTokens = agentUsage?.output ?? 0; | |
| const cacheReadTokens = agentUsage?.cacheRead ?? 0; | |
| const cacheWriteTokens = agentUsage?.cacheWrite ?? 0; | |
| const totalTokens = | |
| agentUsage?.total ?? promptTokens + completionTokens + cacheReadTokens + cacheWriteTokens; | |
| sendJson(res, 200, { | |
| id: runId, | |
| object: "chat.completion", | |
| created: Math.floor(Date.now() / 1000), | |
| model, | |
| choices: [ | |
| { | |
| index: 0, | |
| message: { role: "assistant", content }, | |
| finish_reason: "stop", | |
| }, | |
| ], | |
| usage: { | |
| prompt_tokens: promptTokens, | |
| completion_tokens: completionTokens, | |
| total_tokens: totalTokens, | |
| }, | |
| }); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/openai-http.ts
Line: 493-514
Comment:
**`total_tokens` omits cache tokens and ignores pre-computed `total`**
The `EmbeddedPiAgentMeta.usage` shape includes `cacheRead`, `cacheWrite`, and a pre-computed `total` field (see `src/agents/pi-embedded-runner/types.ts`). This PR only reads `input` and `output`, so `total_tokens` will be understated whenever cache tokens are present.
The sibling handler `openresponses-http.ts` already solves this correctly — its `toUsage` helper (lines 170–181) uses:
```ts
const total = value.total ?? input + output + cacheRead + cacheWrite;
```
The implementation here should follow the same pattern:
```suggestion
const agentUsage = (
result as {
meta?: {
agentMeta?: {
usage?: {
input?: number;
output?: number;
cacheRead?: number;
cacheWrite?: number;
total?: number;
};
};
};
}
)?.meta?.agentMeta?.usage;
const promptTokens = agentUsage?.input ?? 0;
const completionTokens = agentUsage?.output ?? 0;
const cacheReadTokens = agentUsage?.cacheRead ?? 0;
const cacheWriteTokens = agentUsage?.cacheWrite ?? 0;
const totalTokens =
agentUsage?.total ?? promptTokens + completionTokens + cacheReadTokens + cacheWriteTokens;
sendJson(res, 200, {
id: runId,
object: "chat.completion",
created: Math.floor(Date.now() / 1000),
model,
choices: [
{
index: 0,
message: { role: "assistant", content },
finish_reason: "stop",
},
],
usage: {
prompt_tokens: promptTokens,
completion_tokens: completionTokens,
total_tokens: totalTokens,
},
});
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 566f8643f1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const promptTokens = agentUsage?.input ?? 0; | ||
| const completionTokens = agentUsage?.output ?? 0; |
There was a problem hiding this comment.
Include cached prompt tokens in usage mapping
When result.meta.agentMeta.usage includes cacheRead/cacheWrite (or a precomputed total), mapping prompt_tokens to only usage.input undercounts prompt usage and makes total_tokens (promptTokens + completionTokens) smaller than the agent’s own total. This path is expected because run metadata computes total usage from input + cache + output in src/agents/pi-embedded-runner/run.ts:195-201, so /v1/chat/completions will still report inaccurate token accounting for providers with prompt caching enabled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ebcd06f966
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }; | ||
| } | ||
| )?.meta?.agentMeta?.usage; | ||
| const promptTokens = agentUsage?.input ?? 0; |
There was a problem hiding this comment.
Include cached tokens in prompt_tokens for chat usage
When agentMeta.usage has cacheRead/cacheWrite (common with prompt caching), this maps prompt_tokens to only usage.input, so prompt usage is underreported while total_tokens still includes cached tokens via the fallback sum. That produces inconsistent accounting for OpenAI-compatible clients that assume total_tokens = prompt_tokens + completion_tokens and use prompt tokens for billing/cost analytics. Fresh evidence: the newly added cache test in src/gateway/openai-http.test.ts now explicitly expects this mismatch (prompt_tokens: 10 with total_tokens: 38).
Useful? React with 👍 / 👎.
gambletan
left a comment
There was a problem hiding this comment.
Review: propagate token usage in /v1/chat/completions response
Overall: Good fix, well-tested. ✅
What works well
- The usage propagation follows OpenAI's response format correctly (
prompt_tokens,completion_tokens,total_tokens). - Smart
total_tokenslogic: prefersagentMeta.usage.totalwhen available, falls back to summinginput + output + cacheRead + cacheWrite. - Four test cases cover the key scenarios: basic usage, cache tokens in total, precomputed total, and missing usage (zeros fallback).
- Safe defaults (
?? 0) prevent NaN propagation.
Suggestions
-
Type assertion could be cleaner: The inline type cast at lines 493-506 is quite verbose. Consider extracting a helper type or using a shared
AgentMetainterface if one exists in the codebase. Something like:interface AgentUsage { input?: number; output?: number; cacheRead?: number; cacheWrite?: number; total?: number; }
This would be more maintainable than the inline cast.
-
The
as nevercasts in tests (e.g.,} as never) bypass type safety. Consider usingsatisfiesor a proper mock builder if the test infrastructure supports it. -
Streaming endpoint: This PR only covers the non-streaming (
/v1/chat/completions) path. Does the streaming SSE path also need usage propagation? OpenAI's streaming API includes usage in the final[DONE]chunk whenstream_options.include_usageis set. Worth noting if it's out of scope.
LGTM — meaningful improvement for API consumers. 👍
|
Thanks for the thorough review! Good point on the type assertion — I kept it inline to avoid adding a new interface for a single use site, but if usage propagation gets added to the streaming path too it would make sense to extract it then. Re: streaming — yeah, intentionally scoped this to the non-streaming path. The SSE handler has a different flow for the final chunk so figured that can be a follow-up if needed. |
fbd4b0c to
977a7f9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 977a7f94ce
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }; | ||
| } | ||
| )?.meta?.agentMeta?.usage; | ||
| const promptTokens = agentUsage?.input ?? 0; |
There was a problem hiding this comment.
Count cache tokens in prompt_tokens to keep usage consistent
When agentMeta.usage includes cacheRead/cacheWrite, this sets prompt_tokens from usage.input only, but total_tokens is computed with cache tokens included. In cached-prompt runs this underreports prompt usage and breaks the OpenAI-compatible invariant many clients rely on (total_tokens = prompt_tokens + completion_tokens), which can skew billing/analytics for consumers of /v1/chat/completions.
Useful? React with 👍 / 👎.
977a7f9 to
3bd93c4
Compare
…mpletions response The non-streaming /v1/chat/completions handler returned hardcoded zeros for usage (prompt_tokens, completion_tokens, total_tokens). The actual token counts were available in result.meta.agentMeta.usage but were never extracted. Extract input/output token counts from the agent run result and map them to the OpenAI-compatible usage fields. Fixes openclaw#38735
3bd93c4 to
17c14f3
Compare
|
Closing stale PR — will revisit if still relevant. Thanks! |
Problem
The non-streaming
/v1/chat/completionsendpoint always returns hardcoded zero token usage:This makes it impossible for consumers of the OpenAI-compatible API to track token usage for cost monitoring.
Root Cause
The handler discards the full
agentCommandFromIngress()result and only extracts text viaresolveAgentResponseText(). The usage data is available inresult.meta.agentMeta.usage(with.inputand.outputfields) but was never read.Fix
Extract the
inputandoutputtoken counts from the agent run result and map them to the OpenAI-compatibleprompt_tokens/completion_tokens/total_tokensfields. Falls back to 0 when usage data is unavailable.Tests
Added two test cases:
agentMeta.usageis presentAll existing tests continue to pass.
Fixes #38735