-
Notifications
You must be signed in to change notification settings - Fork 18
feat: API proxy token usage tracking and conversation cost analysis #1536
Description
Summary
Add token usage tracking to the api-proxy sidecar so that every LLM API call records input/output token counts. This data enables correlation with agent conversation turns and, ultimately, an agentic workflow that identifies opportunities to move expensive agentic work into deterministic pre-processing steps or skills.
Problem
There is currently no visibility into token consumption during agentic workflow runs. The api-proxy streams responses directly to the agent via proxyRes.pipe(res) (server.js:428) without inspecting response bodies. While request_bytes and response_bytes are logged, actual token usage from provider responses is never captured.
Without token data we cannot:
- Measure the cost of individual workflow runs
- Identify which conversation turns consume the most tokens
- Find patterns where deterministic tooling (gh CLI, Python scripts, skills) could replace expensive agentic reasoning
- Set token budgets or detect runaway consumption
Investigation Findings
Current api-proxy architecture
| Aspect | Status |
|---|---|
| Response handling | proxyRes.pipe(res) — direct stream, no buffering |
| Token capture | ❌ Not implemented |
| Logging (logging.js) | Structured JSON to stdout: request_id, provider, status, duration_ms, bytes |
| Metrics (metrics.js) | Counters (requests, errors, bytes), histograms (duration), gauges (active) |
| Log volume | /var/log/api-proxy (writable, persisted to ${workDir}/api-proxy-logs) |
| Rate limiting | RPM/RPH/bytes — no token-based limits |
Token usage in provider responses
Anthropic (/v1/messages):
{ "usage": { "input_tokens": 150, "output_tokens": 45, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 120 } }OpenAI / Copilot (/v1/chat/completions):
{ "usage": { "prompt_tokens": 150, "completion_tokens": 45, "total_tokens": 195 } }Streaming (SSE): Token usage appears in the final event before [DONE]:
- Anthropic:
data: {"type":"message_delta","usage":{"output_tokens":45}} - OpenAI:
data: {"usage":{"prompt_tokens":150,"completion_tokens":45,"total_tokens":195}} - Some providers include
usagein themessage_startevent (input) andmessage_deltaevent (output)
Existing data that can be correlated
| Source | Location | Correlation Key |
|---|---|---|
| Api-proxy request log | stdout (JSONL) | request_id (UUID), timestamp |
| Squid access log | ${workDir}/squid-logs/access.log |
timestamp + client IP (172.30.0.20) |
| Squid audit JSONL | ${workDir}/squid-logs/audit.jsonl |
timestamp + client IP |
| Agent execution log | /tmp/gh-aw/sandbox/agent/logs/ |
Copilot CLI JSONL — contains turn structure |
| MCP Gateway log | /tmp/gh-aw/mcp-logs/ |
MCP tool call correlation |
| Safe outputs | /tmp/gh-aw/safeoutputs.jsonl |
Output items with timestamps |
Workflow artifact pipeline (already in place)
All lock.yml workflows upload these artifacts via actions/upload-artifact:
agent/— prompt, agent logs, firewall logs, MCP logs, safe outputs, stdio logactivation/— engine info, compiled promptdetection/— threat detection logsafe-output-items/— safe output manifest
A new token-usage.jsonl file written by the api-proxy would automatically be included in the agent/ artifact (it lives under the api-proxy log volume which is already collected).
Proposed Implementation
Phase 1: Token usage capture in api-proxy
Approach: Use a Node.js Transform stream instead of proxyRes.pipe(res) to intercept response chunks without full buffering.
proxyRes → TokenUsageTransform → res (client)
↓
token-usage.jsonl
For non-streaming responses: Buffer the response body, parse JSON, extract usage, write to log, then send body to client.
For streaming (SSE) responses: Pass each data: chunk through immediately. Accumulate usage fields from message_start and message_delta events. Write aggregated usage to log after [DONE].
Output schema (/var/log/api-proxy/token-usage.jsonl):
{
"timestamp": "2026-04-01T00:30:00.123Z",
"request_id": "uuid-v4",
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"path": "/v1/messages",
"status": 200,
"streaming": true,
"input_tokens": 4200,
"output_tokens": 850,
"cache_read_tokens": 3800,
"cache_write_tokens": 0,
"duration_ms": 2340,
"request_bytes": 12500,
"response_bytes": 45000
}Files to modify:
containers/api-proxy/server.js— Add Transform stream in response handler (~lines 355-430)containers/api-proxy/metrics.js— Add token counters (input_tokens_total,output_tokens_totalby provider)containers/api-proxy/logging.js— Addtoken_usageevent type
New files:
containers/api-proxy/token-tracker.js— Transform stream + provider-specific usage extraction
Phase 2: Model extraction from requests
Extract the model field from request bodies (already buffered for auth injection at lines 287-316) to correlate token usage with specific models.
Phase 3: Conversation turn correlation
The Copilot CLI agent log (/tmp/gh-aw/sandbox/agent/logs/) contains structured JSONL with conversation turns. Each turn generates one or more API calls through the proxy. Correlation approach:
- Timestamp windowing: Group api-proxy token-usage entries by time windows matching agent turn boundaries
- Request counting: Each conversation turn typically produces 1 API call (unless tool use triggers follow-ups)
- Cumulative tracking: Running total of tokens consumed, with per-turn deltas
Output: token-usage-by-turn.jsonl (generated by a post-processing script)
Phase 4: Analysis agentic workflow
Create an agentic workflow (token-usage-analyzer.md) that:
- Downloads token-usage artifacts from recent workflow runs
- Aggregates token consumption by workflow, turn, and model
- Identifies the most expensive conversation patterns:
- Large context windows (high input tokens → candidate for summarization)
- Repeated tool calls (high turn count → candidate for batching)
- Simple data retrieval (low output/high input → candidate for
ghCLI pre-fetching)
- Generates recommendations:
- "Issue triage workflow spends 40% of tokens on fetching issue metadata → move to deterministic
gh issue viewpre-step" - "PR review workflow re-reads file contents 3x → add file content to prompt context"
- "Security scan workflow uses 8 tool calls to list packages → replace with
npm audit --jsonskill"
- "Issue triage workflow spends 40% of tokens on fetching issue metadata → move to deterministic
- Posts findings as a GitHub issue or PR comment
Phase 5: CLI integration
awf logs token-usage— Show per-run token consumption summaryawf logs token-usage --format markdown— For$GITHUB_STEP_SUMMARYawf logs token-usage --format json— For programmatic consumption- Integrate into existing
awf logs summaryoutput
Key Design Decisions
- Transform stream vs full buffering: Transform stream preserves streaming latency while capturing usage. Full buffering would add TTFB latency for streaming responses.
- JSONL file output: Consistent with existing log formats (Squid audit, safe outputs). Automatically persisted via existing volume mount.
- Provider-specific parsing: Each provider has a different
usageschema — centralize normalization intoken-tracker.js. - No breaking changes: Token tracking is additive. The proxy continues to work identically; the Transform stream is transparent to the agent.
Out of Scope (Future)
- Token-based rate limiting (use existing RPM/bytes limits for now)
- Real-time cost dashboards
- Cross-run token budget enforcement
- Billing integration