feat: API proxy token usage tracking and conversation cost analysis

## Summary

Add token usage tracking to the api-proxy sidecar so that every LLM API call records input/output token counts. This data enables correlation with agent conversation turns and, ultimately, an agentic workflow that identifies opportunities to move expensive agentic work into deterministic pre-processing steps or skills.

## Problem

There is currently **no visibility into token consumption** during agentic workflow runs. The api-proxy streams responses directly to the agent via `proxyRes.pipe(res)` (server.js:428) without inspecting response bodies. While `request_bytes` and `response_bytes` are logged, actual token usage from provider responses is never captured.

Without token data we cannot:
- Measure the cost of individual workflow runs
- Identify which conversation turns consume the most tokens
- Find patterns where deterministic tooling (gh CLI, Python scripts, skills) could replace expensive agentic reasoning
- Set token budgets or detect runaway consumption

## Investigation Findings

### Current api-proxy architecture

| Aspect | Status |
|--------|--------|
| Response handling | `proxyRes.pipe(res)` — direct stream, no buffering |
| Token capture | ❌ Not implemented |
| Logging (logging.js) | Structured JSON to stdout: request_id, provider, status, duration_ms, bytes |
| Metrics (metrics.js) | Counters (requests, errors, bytes), histograms (duration), gauges (active) |
| Log volume | `/var/log/api-proxy` (writable, persisted to `${workDir}/api-proxy-logs`) |
| Rate limiting | RPM/RPH/bytes — no token-based limits |

### Token usage in provider responses

**Anthropic** (`/v1/messages`):
```json
{ "usage": { "input_tokens": 150, "output_tokens": 45, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 120 } }
```

**OpenAI / Copilot** (`/v1/chat/completions`):
```json
{ "usage": { "prompt_tokens": 150, "completion_tokens": 45, "total_tokens": 195 } }
```

**Streaming (SSE)**: Token usage appears in the **final event** before `[DONE]`:
- Anthropic: `data: {"type":"message_delta","usage":{"output_tokens":45}}`  
- OpenAI: `data: {"usage":{"prompt_tokens":150,"completion_tokens":45,"total_tokens":195}}`  
- Some providers include `usage` in the `message_start` event (input) and `message_delta` event (output)

### Existing data that can be correlated

| Source | Location | Correlation Key |
|--------|----------|----------------|
| Api-proxy request log | stdout (JSONL) | `request_id` (UUID), timestamp |
| Squid access log | `${workDir}/squid-logs/access.log` | timestamp + client IP (172.30.0.20) |
| Squid audit JSONL | `${workDir}/squid-logs/audit.jsonl` | timestamp + client IP |
| Agent execution log | `/tmp/gh-aw/sandbox/agent/logs/` | Copilot CLI JSONL — contains turn structure |
| MCP Gateway log | `/tmp/gh-aw/mcp-logs/` | MCP tool call correlation |
| Safe outputs | `/tmp/gh-aw/safeoutputs.jsonl` | Output items with timestamps |

### Workflow artifact pipeline (already in place)

All lock.yml workflows upload these artifacts via `actions/upload-artifact`:
- `agent/` — prompt, agent logs, firewall logs, MCP logs, safe outputs, stdio log
- `activation/` — engine info, compiled prompt
- `detection/` — threat detection log
- `safe-output-items/` — safe output manifest

A new `token-usage.jsonl` file written by the api-proxy would automatically be included in the `agent/` artifact (it lives under the api-proxy log volume which is already collected).

## Proposed Implementation

### Phase 1: Token usage capture in api-proxy

**Approach**: Use a Node.js `Transform` stream instead of `proxyRes.pipe(res)` to intercept response chunks without full buffering.

```
proxyRes → TokenUsageTransform → res (client)
                ↓
        token-usage.jsonl
```

**For non-streaming responses**: Buffer the response body, parse JSON, extract `usage`, write to log, then send body to client.

**For streaming (SSE) responses**: Pass each `data:` chunk through immediately. Accumulate `usage` fields from `message_start` and `message_delta` events. Write aggregated usage to log after `[DONE]`.

**Output schema** (`/var/log/api-proxy/token-usage.jsonl`):
```json
{
  "timestamp": "2026-04-01T00:30:00.123Z",
  "request_id": "uuid-v4",
  "provider": "anthropic",
  "model": "claude-sonnet-4-20250514",
  "path": "/v1/messages",
  "status": 200,
  "streaming": true,
  "input_tokens": 4200,
  "output_tokens": 850,
  "cache_read_tokens": 3800,
  "cache_write_tokens": 0,
  "duration_ms": 2340,
  "request_bytes": 12500,
  "response_bytes": 45000
}
```

**Files to modify**:
- `containers/api-proxy/server.js` — Add Transform stream in response handler (~lines 355-430)
- `containers/api-proxy/metrics.js` — Add token counters (`input_tokens_total`, `output_tokens_total` by provider)
- `containers/api-proxy/logging.js` — Add `token_usage` event type

**New files**:
- `containers/api-proxy/token-tracker.js` — Transform stream + provider-specific usage extraction

### Phase 2: Model extraction from requests

Extract the `model` field from request bodies (already buffered for auth injection at lines 287-316) to correlate token usage with specific models.

### Phase 3: Conversation turn correlation

The Copilot CLI agent log (`/tmp/gh-aw/sandbox/agent/logs/`) contains structured JSONL with conversation turns. Each turn generates one or more API calls through the proxy. Correlation approach:

1. **Timestamp windowing**: Group api-proxy token-usage entries by time windows matching agent turn boundaries
2. **Request counting**: Each conversation turn typically produces 1 API call (unless tool use triggers follow-ups)
3. **Cumulative tracking**: Running total of tokens consumed, with per-turn deltas

Output: `token-usage-by-turn.jsonl` (generated by a post-processing script)

### Phase 4: Analysis agentic workflow

Create an agentic workflow (`token-usage-analyzer.md`) that:
1. Downloads token-usage artifacts from recent workflow runs
2. Aggregates token consumption by workflow, turn, and model
3. Identifies the most expensive conversation patterns:
   - Large context windows (high input tokens → candidate for summarization)
   - Repeated tool calls (high turn count → candidate for batching)
   - Simple data retrieval (low output/high input → candidate for `gh` CLI pre-fetching)
4. Generates recommendations:
   - "Issue triage workflow spends 40% of tokens on fetching issue metadata → move to deterministic `gh issue view` pre-step"
   - "PR review workflow re-reads file contents 3x → add file content to prompt context"
   - "Security scan workflow uses 8 tool calls to list packages → replace with `npm audit --json` skill"
5. Posts findings as a GitHub issue or PR comment

### Phase 5: CLI integration

- `awf logs token-usage` — Show per-run token consumption summary
- `awf logs token-usage --format markdown` — For `$GITHUB_STEP_SUMMARY`
- `awf logs token-usage --format json` — For programmatic consumption
- Integrate into existing `awf logs summary` output

## Key Design Decisions

1. **Transform stream vs full buffering**: Transform stream preserves streaming latency while capturing usage. Full buffering would add TTFB latency for streaming responses.
2. **JSONL file output**: Consistent with existing log formats (Squid audit, safe outputs). Automatically persisted via existing volume mount.
3. **Provider-specific parsing**: Each provider has a different `usage` schema — centralize normalization in `token-tracker.js`.
4. **No breaking changes**: Token tracking is additive. The proxy continues to work identically; the Transform stream is transparent to the agent.

## Out of Scope (Future)

- Token-based rate limiting (use existing RPM/bytes limits for now)
- Real-time cost dashboards
- Cross-run token budget enforcement
- Billing integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: API proxy token usage tracking and conversation cost analysis #1536

Summary

Problem

Investigation Findings

Current api-proxy architecture

Token usage in provider responses

Existing data that can be correlated

Workflow artifact pipeline (already in place)

Proposed Implementation

Phase 1: Token usage capture in api-proxy

Phase 2: Model extraction from requests

Phase 3: Conversation turn correlation

Phase 4: Analysis agentic workflow

Phase 5: CLI integration

Key Design Decisions

Out of Scope (Future)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Aspect	Status
Response handling	`proxyRes.pipe(res)` — direct stream, no buffering
Token capture	❌ Not implemented
Logging (logging.js)	Structured JSON to stdout: request_id, provider, status, duration_ms, bytes
Metrics (metrics.js)	Counters (requests, errors, bytes), histograms (duration), gauges (active)
Log volume	`/var/log/api-proxy` (writable, persisted to `${workDir}/api-proxy-logs`)
Rate limiting	RPM/RPH/bytes — no token-based limits

Source	Location	Correlation Key
Api-proxy request log	stdout (JSONL)	`request_id` (UUID), timestamp
Squid access log	`${workDir}/squid-logs/access.log`	timestamp + client IP (172.30.0.20)
Squid audit JSONL	`${workDir}/squid-logs/audit.jsonl`	timestamp + client IP
Agent execution log	`/tmp/gh-aw/sandbox/agent/logs/`	Copilot CLI JSONL — contains turn structure
MCP Gateway log	`/tmp/gh-aw/mcp-logs/`	MCP tool call correlation
Safe outputs	`/tmp/gh-aw/safeoutputs.jsonl`	Output items with timestamps

feat: API proxy token usage tracking and conversation cost analysis #1536

Description

Summary

Problem

Investigation Findings

Current api-proxy architecture

Token usage in provider responses

Existing data that can be correlated

Workflow artifact pipeline (already in place)

Proposed Implementation

Phase 1: Token usage capture in api-proxy

Phase 2: Model extraction from requests

Phase 3: Conversation turn correlation

Phase 4: Analysis agentic workflow

Phase 5: CLI integration

Key Design Decisions

Out of Scope (Future)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions