Skip to content

feat: API proxy token usage tracking and conversation cost analysis #1536

@lpcox

Description

@lpcox

Summary

Add token usage tracking to the api-proxy sidecar so that every LLM API call records input/output token counts. This data enables correlation with agent conversation turns and, ultimately, an agentic workflow that identifies opportunities to move expensive agentic work into deterministic pre-processing steps or skills.

Problem

There is currently no visibility into token consumption during agentic workflow runs. The api-proxy streams responses directly to the agent via proxyRes.pipe(res) (server.js:428) without inspecting response bodies. While request_bytes and response_bytes are logged, actual token usage from provider responses is never captured.

Without token data we cannot:

  • Measure the cost of individual workflow runs
  • Identify which conversation turns consume the most tokens
  • Find patterns where deterministic tooling (gh CLI, Python scripts, skills) could replace expensive agentic reasoning
  • Set token budgets or detect runaway consumption

Investigation Findings

Current api-proxy architecture

Aspect Status
Response handling proxyRes.pipe(res) — direct stream, no buffering
Token capture ❌ Not implemented
Logging (logging.js) Structured JSON to stdout: request_id, provider, status, duration_ms, bytes
Metrics (metrics.js) Counters (requests, errors, bytes), histograms (duration), gauges (active)
Log volume /var/log/api-proxy (writable, persisted to ${workDir}/api-proxy-logs)
Rate limiting RPM/RPH/bytes — no token-based limits

Token usage in provider responses

Anthropic (/v1/messages):

{ "usage": { "input_tokens": 150, "output_tokens": 45, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 120 } }

OpenAI / Copilot (/v1/chat/completions):

{ "usage": { "prompt_tokens": 150, "completion_tokens": 45, "total_tokens": 195 } }

Streaming (SSE): Token usage appears in the final event before [DONE]:

  • Anthropic: data: {"type":"message_delta","usage":{"output_tokens":45}}
  • OpenAI: data: {"usage":{"prompt_tokens":150,"completion_tokens":45,"total_tokens":195}}
  • Some providers include usage in the message_start event (input) and message_delta event (output)

Existing data that can be correlated

Source Location Correlation Key
Api-proxy request log stdout (JSONL) request_id (UUID), timestamp
Squid access log ${workDir}/squid-logs/access.log timestamp + client IP (172.30.0.20)
Squid audit JSONL ${workDir}/squid-logs/audit.jsonl timestamp + client IP
Agent execution log /tmp/gh-aw/sandbox/agent/logs/ Copilot CLI JSONL — contains turn structure
MCP Gateway log /tmp/gh-aw/mcp-logs/ MCP tool call correlation
Safe outputs /tmp/gh-aw/safeoutputs.jsonl Output items with timestamps

Workflow artifact pipeline (already in place)

All lock.yml workflows upload these artifacts via actions/upload-artifact:

  • agent/ — prompt, agent logs, firewall logs, MCP logs, safe outputs, stdio log
  • activation/ — engine info, compiled prompt
  • detection/ — threat detection log
  • safe-output-items/ — safe output manifest

A new token-usage.jsonl file written by the api-proxy would automatically be included in the agent/ artifact (it lives under the api-proxy log volume which is already collected).

Proposed Implementation

Phase 1: Token usage capture in api-proxy

Approach: Use a Node.js Transform stream instead of proxyRes.pipe(res) to intercept response chunks without full buffering.

proxyRes → TokenUsageTransform → res (client)
                ↓
        token-usage.jsonl

For non-streaming responses: Buffer the response body, parse JSON, extract usage, write to log, then send body to client.

For streaming (SSE) responses: Pass each data: chunk through immediately. Accumulate usage fields from message_start and message_delta events. Write aggregated usage to log after [DONE].

Output schema (/var/log/api-proxy/token-usage.jsonl):

{
  "timestamp": "2026-04-01T00:30:00.123Z",
  "request_id": "uuid-v4",
  "provider": "anthropic",
  "model": "claude-sonnet-4-20250514",
  "path": "/v1/messages",
  "status": 200,
  "streaming": true,
  "input_tokens": 4200,
  "output_tokens": 850,
  "cache_read_tokens": 3800,
  "cache_write_tokens": 0,
  "duration_ms": 2340,
  "request_bytes": 12500,
  "response_bytes": 45000
}

Files to modify:

  • containers/api-proxy/server.js — Add Transform stream in response handler (~lines 355-430)
  • containers/api-proxy/metrics.js — Add token counters (input_tokens_total, output_tokens_total by provider)
  • containers/api-proxy/logging.js — Add token_usage event type

New files:

  • containers/api-proxy/token-tracker.js — Transform stream + provider-specific usage extraction

Phase 2: Model extraction from requests

Extract the model field from request bodies (already buffered for auth injection at lines 287-316) to correlate token usage with specific models.

Phase 3: Conversation turn correlation

The Copilot CLI agent log (/tmp/gh-aw/sandbox/agent/logs/) contains structured JSONL with conversation turns. Each turn generates one or more API calls through the proxy. Correlation approach:

  1. Timestamp windowing: Group api-proxy token-usage entries by time windows matching agent turn boundaries
  2. Request counting: Each conversation turn typically produces 1 API call (unless tool use triggers follow-ups)
  3. Cumulative tracking: Running total of tokens consumed, with per-turn deltas

Output: token-usage-by-turn.jsonl (generated by a post-processing script)

Phase 4: Analysis agentic workflow

Create an agentic workflow (token-usage-analyzer.md) that:

  1. Downloads token-usage artifacts from recent workflow runs
  2. Aggregates token consumption by workflow, turn, and model
  3. Identifies the most expensive conversation patterns:
    • Large context windows (high input tokens → candidate for summarization)
    • Repeated tool calls (high turn count → candidate for batching)
    • Simple data retrieval (low output/high input → candidate for gh CLI pre-fetching)
  4. Generates recommendations:
    • "Issue triage workflow spends 40% of tokens on fetching issue metadata → move to deterministic gh issue view pre-step"
    • "PR review workflow re-reads file contents 3x → add file content to prompt context"
    • "Security scan workflow uses 8 tool calls to list packages → replace with npm audit --json skill"
  5. Posts findings as a GitHub issue or PR comment

Phase 5: CLI integration

  • awf logs token-usage — Show per-run token consumption summary
  • awf logs token-usage --format markdown — For $GITHUB_STEP_SUMMARY
  • awf logs token-usage --format json — For programmatic consumption
  • Integrate into existing awf logs summary output

Key Design Decisions

  1. Transform stream vs full buffering: Transform stream preserves streaming latency while capturing usage. Full buffering would add TTFB latency for streaming responses.
  2. JSONL file output: Consistent with existing log formats (Squid audit, safe outputs). Automatically persisted via existing volume mount.
  3. Provider-specific parsing: Each provider has a different usage schema — centralize normalization in token-tracker.js.
  4. No breaking changes: Token tracking is additive. The proxy continues to work identically; the Transform stream is transparent to the agent.

Out of Scope (Future)

  • Token-based rate limiting (use existing RPM/bytes limits for now)
  • Real-time cost dashboards
  • Cross-run token budget enforcement
  • Billing integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions