Summary
Add token usage tracking to the api-proxy sidecar so that every LLM API call records input/output token counts. This data enables correlation with agent conversation turns and, ultimately, an agentic workflow that identifies opportunities to move expensive agentic work into deterministic pre-processing steps or skills.
Problem
There is currently no visibility into token consumption during agentic workflow runs. The api-proxy streams responses directly to the agent via proxyRes.pipe(res) (server.js:428) without inspecting response bodies. While request_bytes and response_bytes are logged, actual token usage from provider responses is never captured.
Without token data we cannot:
- Measure the cost of individual workflow runs
- Identify which conversation turns consume the most tokens
- Find patterns where deterministic tooling (gh CLI, Python scripts, skills) could replace expensive agentic reasoning
- Set token budgets or detect runaway consumption
Investigation Findings
Current api-proxy architecture
| Aspect |
Status |
| Response handling |
proxyRes.pipe(res) — direct stream, no buffering |
| Token capture |
❌ Not implemented |
| Logging (logging.js) |
Structured JSON to stdout: request_id, provider, status, duration_ms, bytes |
| Metrics (metrics.js) |
Counters (requests, errors, bytes), histograms (duration), gauges (active) |
| Log volume |
/var/log/api-proxy (writable, persisted to ${workDir}/api-proxy-logs) |
| Rate limiting |
RPM/RPH/bytes — no token-based limits |
Token usage in provider responses
Anthropic (/v1/messages):
{ "usage": { "input_tokens": 150, "output_tokens": 45, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 120 } }
OpenAI / Copilot (/v1/chat/completions):
{ "usage": { "prompt_tokens": 150, "completion_tokens": 45, "total_tokens": 195 } }
Streaming (SSE): Token usage appears in the final event before [DONE]:
- Anthropic:
data: {"type":"message_delta","usage":{"output_tokens":45}}
- OpenAI:
data: {"usage":{"prompt_tokens":150,"completion_tokens":45,"total_tokens":195}}
- Some providers include
usage in the message_start event (input) and message_delta event (output)
Existing data that can be correlated
| Source |
Location |
Correlation Key |
| Api-proxy request log |
stdout (JSONL) |
request_id (UUID), timestamp |
| Squid access log |
${workDir}/squid-logs/access.log |
timestamp + client IP (172.30.0.20) |
| Squid audit JSONL |
${workDir}/squid-logs/audit.jsonl |
timestamp + client IP |
| Agent execution log |
/tmp/gh-aw/sandbox/agent/logs/ |
Copilot CLI JSONL — contains turn structure |
| MCP Gateway log |
/tmp/gh-aw/mcp-logs/ |
MCP tool call correlation |
| Safe outputs |
/tmp/gh-aw/safeoutputs.jsonl |
Output items with timestamps |
Workflow artifact pipeline (already in place)
All lock.yml workflows upload these artifacts via actions/upload-artifact:
agent/ — prompt, agent logs, firewall logs, MCP logs, safe outputs, stdio log
activation/ — engine info, compiled prompt
detection/ — threat detection log
safe-output-items/ — safe output manifest
A new token-usage.jsonl file written by the api-proxy would automatically be included in the agent/ artifact (it lives under the api-proxy log volume which is already collected).
Proposed Implementation
Phase 1: Token usage capture in api-proxy
Approach: Use a Node.js Transform stream instead of proxyRes.pipe(res) to intercept response chunks without full buffering.
proxyRes → TokenUsageTransform → res (client)
↓
token-usage.jsonl
For non-streaming responses: Buffer the response body, parse JSON, extract usage, write to log, then send body to client.
For streaming (SSE) responses: Pass each data: chunk through immediately. Accumulate usage fields from message_start and message_delta events. Write aggregated usage to log after [DONE].
Output schema (/var/log/api-proxy/token-usage.jsonl):
{
"timestamp": "2026-04-01T00:30:00.123Z",
"request_id": "uuid-v4",
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"path": "/v1/messages",
"status": 200,
"streaming": true,
"input_tokens": 4200,
"output_tokens": 850,
"cache_read_tokens": 3800,
"cache_write_tokens": 0,
"duration_ms": 2340,
"request_bytes": 12500,
"response_bytes": 45000
}
Files to modify:
containers/api-proxy/server.js — Add Transform stream in response handler (~lines 355-430)
containers/api-proxy/metrics.js — Add token counters (input_tokens_total, output_tokens_total by provider)
containers/api-proxy/logging.js — Add token_usage event type
New files:
containers/api-proxy/token-tracker.js — Transform stream + provider-specific usage extraction
Phase 2: Model extraction from requests
Extract the model field from request bodies (already buffered for auth injection at lines 287-316) to correlate token usage with specific models.
Phase 3: Conversation turn correlation
The Copilot CLI agent log (/tmp/gh-aw/sandbox/agent/logs/) contains structured JSONL with conversation turns. Each turn generates one or more API calls through the proxy. Correlation approach:
- Timestamp windowing: Group api-proxy token-usage entries by time windows matching agent turn boundaries
- Request counting: Each conversation turn typically produces 1 API call (unless tool use triggers follow-ups)
- Cumulative tracking: Running total of tokens consumed, with per-turn deltas
Output: token-usage-by-turn.jsonl (generated by a post-processing script)
Phase 4: Analysis agentic workflow
Create an agentic workflow (token-usage-analyzer.md) that:
- Downloads token-usage artifacts from recent workflow runs
- Aggregates token consumption by workflow, turn, and model
- Identifies the most expensive conversation patterns:
- Large context windows (high input tokens → candidate for summarization)
- Repeated tool calls (high turn count → candidate for batching)
- Simple data retrieval (low output/high input → candidate for
gh CLI pre-fetching)
- Generates recommendations:
- "Issue triage workflow spends 40% of tokens on fetching issue metadata → move to deterministic
gh issue view pre-step"
- "PR review workflow re-reads file contents 3x → add file content to prompt context"
- "Security scan workflow uses 8 tool calls to list packages → replace with
npm audit --json skill"
- Posts findings as a GitHub issue or PR comment
Phase 5: CLI integration
awf logs token-usage — Show per-run token consumption summary
awf logs token-usage --format markdown — For $GITHUB_STEP_SUMMARY
awf logs token-usage --format json — For programmatic consumption
- Integrate into existing
awf logs summary output
Key Design Decisions
- Transform stream vs full buffering: Transform stream preserves streaming latency while capturing usage. Full buffering would add TTFB latency for streaming responses.
- JSONL file output: Consistent with existing log formats (Squid audit, safe outputs). Automatically persisted via existing volume mount.
- Provider-specific parsing: Each provider has a different
usage schema — centralize normalization in token-tracker.js.
- No breaking changes: Token tracking is additive. The proxy continues to work identically; the Transform stream is transparent to the agent.
Out of Scope (Future)
- Token-based rate limiting (use existing RPM/bytes limits for now)
- Real-time cost dashboards
- Cross-run token budget enforcement
- Billing integration
Summary
Add token usage tracking to the api-proxy sidecar so that every LLM API call records input/output token counts. This data enables correlation with agent conversation turns and, ultimately, an agentic workflow that identifies opportunities to move expensive agentic work into deterministic pre-processing steps or skills.
Problem
There is currently no visibility into token consumption during agentic workflow runs. The api-proxy streams responses directly to the agent via
proxyRes.pipe(res)(server.js:428) without inspecting response bodies. Whilerequest_bytesandresponse_bytesare logged, actual token usage from provider responses is never captured.Without token data we cannot:
Investigation Findings
Current api-proxy architecture
proxyRes.pipe(res)— direct stream, no buffering/var/log/api-proxy(writable, persisted to${workDir}/api-proxy-logs)Token usage in provider responses
Anthropic (
/v1/messages):{ "usage": { "input_tokens": 150, "output_tokens": 45, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 120 } }OpenAI / Copilot (
/v1/chat/completions):{ "usage": { "prompt_tokens": 150, "completion_tokens": 45, "total_tokens": 195 } }Streaming (SSE): Token usage appears in the final event before
[DONE]:data: {"type":"message_delta","usage":{"output_tokens":45}}data: {"usage":{"prompt_tokens":150,"completion_tokens":45,"total_tokens":195}}usagein themessage_startevent (input) andmessage_deltaevent (output)Existing data that can be correlated
request_id(UUID), timestamp${workDir}/squid-logs/access.log${workDir}/squid-logs/audit.jsonl/tmp/gh-aw/sandbox/agent/logs//tmp/gh-aw/mcp-logs//tmp/gh-aw/safeoutputs.jsonlWorkflow artifact pipeline (already in place)
All lock.yml workflows upload these artifacts via
actions/upload-artifact:agent/— prompt, agent logs, firewall logs, MCP logs, safe outputs, stdio logactivation/— engine info, compiled promptdetection/— threat detection logsafe-output-items/— safe output manifestA new
token-usage.jsonlfile written by the api-proxy would automatically be included in theagent/artifact (it lives under the api-proxy log volume which is already collected).Proposed Implementation
Phase 1: Token usage capture in api-proxy
Approach: Use a Node.js
Transformstream instead ofproxyRes.pipe(res)to intercept response chunks without full buffering.For non-streaming responses: Buffer the response body, parse JSON, extract
usage, write to log, then send body to client.For streaming (SSE) responses: Pass each
data:chunk through immediately. Accumulateusagefields frommessage_startandmessage_deltaevents. Write aggregated usage to log after[DONE].Output schema (
/var/log/api-proxy/token-usage.jsonl):{ "timestamp": "2026-04-01T00:30:00.123Z", "request_id": "uuid-v4", "provider": "anthropic", "model": "claude-sonnet-4-20250514", "path": "/v1/messages", "status": 200, "streaming": true, "input_tokens": 4200, "output_tokens": 850, "cache_read_tokens": 3800, "cache_write_tokens": 0, "duration_ms": 2340, "request_bytes": 12500, "response_bytes": 45000 }Files to modify:
containers/api-proxy/server.js— Add Transform stream in response handler (~lines 355-430)containers/api-proxy/metrics.js— Add token counters (input_tokens_total,output_tokens_totalby provider)containers/api-proxy/logging.js— Addtoken_usageevent typeNew files:
containers/api-proxy/token-tracker.js— Transform stream + provider-specific usage extractionPhase 2: Model extraction from requests
Extract the
modelfield from request bodies (already buffered for auth injection at lines 287-316) to correlate token usage with specific models.Phase 3: Conversation turn correlation
The Copilot CLI agent log (
/tmp/gh-aw/sandbox/agent/logs/) contains structured JSONL with conversation turns. Each turn generates one or more API calls through the proxy. Correlation approach:Output:
token-usage-by-turn.jsonl(generated by a post-processing script)Phase 4: Analysis agentic workflow
Create an agentic workflow (
token-usage-analyzer.md) that:ghCLI pre-fetching)gh issue viewpre-step"npm audit --jsonskill"Phase 5: CLI integration
awf logs token-usage— Show per-run token consumption summaryawf logs token-usage --format markdown— For$GITHUB_STEP_SUMMARYawf logs token-usage --format json— For programmatic consumptionawf logs summaryoutputKey Design Decisions
usageschema — centralize normalization intoken-tracker.js.Out of Scope (Future)