RFC: Mock services infrastructure for API-dependent tasks

## Problem

Many real-world OpenClaw tasks involve external APIs that we can't reliably test against:
- **Rate limits** — Real APIs throttle, making benchmarks inconsistent
- **Cost** — Some APIs charge per call
- **State mutation** — Real CRMs/calendars would get polluted
- **Reproducibility** — External data changes over time
- **Grading** — Hard to verify correct API usage without seeing what was called

We need mock services that behave like real APIs but are controlled, observable, and free.

## Proposed Mock Services

### Critical (High-value OpenClaw use cases)

| Service | Why | Example Tasks |
|---------|-----|---------------|
| **Calendar API** | Scheduling is core OpenClaw UX | "Schedule a meeting", "Find free slots", "Check conflicts" |
| **Email/SMTP** | Email management is huge | "Send email", "Search inbox", "Triage messages" |
| **Weather API** | Common info lookup | "What's the weather", "Should I bring umbrella" |
| **Web Search** | Research tasks need it | "Find info about X", "Compare products" |
| **Task/Todo API** | Productivity workflows | "Add task", "Check overdue", "Complete item" |
| **File Storage** | Document workflows | "Upload file", "List files", "Share document" |

### High Value

| Service | Why | Example Tasks |
|---------|-----|---------------|
| **CRM API** | Business automation | "Add contact", "Log interaction", "Find customer" |
| **Helpdesk API** | Support workflows | "Create ticket", "Escalate issue", "Search KB" |
| **RSS Feed** | Content monitoring | "Summarize feed", "Find new posts" |
| **Issue Tracker** (JIRA/Linear style) | Dev workflows | "Create issue", "Update status", "Link PRs" |
| **Database/KV Store** | Data persistence | "Store value", "Query records" |

### Nice to Have

| Service | Why | Example Tasks |
|---------|-----|---------------|
| **Payment API** | E-commerce | "Check balance", "Process refund" |
| **Shipping/Tracking** | Logistics | "Track package", "Get delivery estimate" |
| **Social Media** | Content posting | "Post update", "Check mentions" |
| **SMS/Messaging** | Notifications | "Send SMS", "Check delivery" |
| **Analytics API** | Reporting | "Get metrics", "Generate report" |

---

## Implementation Options

### Option A: Sidecar Services (VM-local)

Each benchmark VM spins up mock services as Docker containers alongside the agent.

```
┌─────────────────────────────────────────┐
│ Benchmark VM                            │
│  ┌─────────────┐  ┌─────────────┐       │
│  │   OpenClaw  │  │ Mock Svc    │       │
│  │   Agent     │──│ Container   │       │
│  └─────────────┘  │ - Calendar  │       │
│                   │ - Email     │       │
│                   │ - Weather   │       │
│                   └─────────────┘       │
│                         │               │
│                    logs/state           │
│                         ↓               │
│               /results/api_calls.jsonl  │
└─────────────────────────────────────────┘
```

**Pros:**
- ✅ Complete isolation — each run gets fresh state
- ✅ All logs local — easy to collect for grading
- ✅ No network latency — fast API responses
- ✅ No central infrastructure to maintain
- ✅ Works offline / air-gapped

**Cons:**
- ❌ Adds to VM startup time (pull/start containers)
- ❌ Uses VM resources (RAM/CPU for mock services)
- ❌ Need to bake mock services into snapshot or pull on boot
- ❌ Harder to update mock behavior across all VMs

**Grading approach:** Mock services write all requests to a JSONL file. Grading script reads it directly from the workspace.

---

### Option B: Centralized Mock Hub (services.pinchbench.com)

All mock services run on a central server. VMs call out to it with a unique run ID.

```
┌────────────────┐         ┌──────────────────────────┐
│ Benchmark VM   │         │ services.pinchbench.com  │
│  ┌──────────┐  │  HTTPS  │  ┌────────────────────┐  │
│  │ OpenClaw │──┼────────►│  │ Mock Service Hub   │  │
│  │ Agent    │  │         │  │ - Calendar         │  │
│  └──────────┘  │         │  │ - Email            │  │
│                │         │  │ - Weather          │  │
│                │         │  └────────────────────┘  │
│                │         │           │              │
│                │         │     logs per run_id      │
│                │         │           ↓              │
│                │  GET    │  /api/logs/{run_id}      │
│                │◄────────┼──────────────────────────│
└────────────────┘         └──────────────────────────┘
```

**Pros:**
- ✅ VMs stay lightweight — no extra containers
- ✅ Easy to update mock behavior globally
- ✅ Central logging dashboard possible
- ✅ Can add rate limiting / chaos testing

**Cons:**
- ❌ Network dependency — if hub is down, benchmarks fail
- ❌ Latency — network round trips for every API call
- ❌ Multi-tenancy complexity — need run isolation
- ❌ State management — how to reset between runs?
- ❌ Cost — another server to run

**Grading approach:** After run completes, grading script fetches `/api/logs/{run_id}` from the hub.

---

### Option C: Hybrid (Local-first with Central Fallback)

Mock services are bundled in the VM snapshot (local). Central hub exists for:
- Configuration updates (mock data, behavior)
- Log aggregation (optional)
- Shared fixtures (large datasets)

```
┌─────────────────────────────────────────┐
│ Benchmark VM                            │
│  ┌─────────────┐  ┌─────────────┐       │
│  │   OpenClaw  │──│ Local Mocks │       │
│  │   Agent     │  │ (embedded)  │       │
│  └─────────────┘  └──────┬──────┘       │
│                          │              │
│                    logs/state           │
│                          ↓              │
│               /results/api_calls.jsonl  │
└─────────────────────────────────────────┘
                           │
                    optional sync
                           ↓
              ┌────────────────────────┐
              │ hub.pinchbench.com     │
              │ - Config updates       │
              │ - Log aggregation      │
              │ - Fixture data         │
              └────────────────────────┘
```

**Pros:**
- ✅ Best of both — local speed, central coordination
- ✅ Graceful degradation — works if hub is down
- ✅ Easy grading — logs are local
- ✅ Central dashboard for aggregate analytics

**Cons:**
- ❌ Most complex to implement
- ❌ Two codebases to maintain (local + hub)
- ❌ Sync logic can get tricky

**Grading approach:** Same as Option A — local JSONL files.

---

## Recommendation

**Start with Option A (Sidecar/VM-local)** because:
1. Grading is simplest — logs are right there
2. No new infrastructure needed
3. Matches our current "self-destruct VM" model
4. Can migrate to hybrid later if needed

**Implementation plan:**
1. Create a single `pinchbench-mocks` Docker image with all services
2. Each service listens on a different port (e.g., calendar:8001, email:8002)
3. Services log all requests to `/shared/api_calls.jsonl`
4. Mount shared volume between agent and mocks container
5. Grading scripts parse the JSONL to verify correct API usage

---

## Mock Service Spec (Draft)

Each mock should:
- Accept standard REST/JSON (or match real API format)
- Pre-seed with fixture data (e.g., 10 calendar events, 50 emails)
- Log every request with timestamp, method, path, body, response
- Support reset endpoint (`POST /reset`) for clean state
- Return realistic response times (configurable latency)

Example log entry:
```json
{"ts":"2026-04-07T21:00:00Z","service":"calendar","method":"POST","path":"/events","body":{"title":"Team Sync","start":"..."},"status":201,"response":{"id":"evt_123"}}
```

---

## Open Questions

1. Should mocks exactly mirror real APIs (Google Calendar, Todoist) or use simplified schemas?
2. How much fixture data? (10 items vs 1000 items affects task difficulty)
3. Should we add chaos/failure modes? (rate limits, 500 errors, timeouts)
4. Language for mocks? (Node.js for speed? Python for familiarity? Go for single binary?)

cc @olearycrew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Mock services infrastructure for API-dependent tasks #123

Problem

Proposed Mock Services

Critical (High-value OpenClaw use cases)

High Value

Nice to Have

Implementation Options

Option A: Sidecar Services (VM-local)

Option B: Centralized Mock Hub (services.pinchbench.com)

Option C: Hybrid (Local-first with Central Fallback)

Recommendation

Mock Service Spec (Draft)

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Service	Why	Example Tasks
Calendar API	Scheduling is core OpenClaw UX	"Schedule a meeting", "Find free slots", "Check conflicts"
Email/SMTP	Email management is huge	"Send email", "Search inbox", "Triage messages"
Weather API	Common info lookup	"What's the weather", "Should I bring umbrella"
Web Search	Research tasks need it	"Find info about X", "Compare products"
Task/Todo API	Productivity workflows	"Add task", "Check overdue", "Complete item"
File Storage	Document workflows	"Upload file", "List files", "Share document"

Service	Why	Example Tasks
CRM API	Business automation	"Add contact", "Log interaction", "Find customer"
Helpdesk API	Support workflows	"Create ticket", "Escalate issue", "Search KB"
RSS Feed	Content monitoring	"Summarize feed", "Find new posts"
Issue Tracker (JIRA/Linear style)	Dev workflows	"Create issue", "Update status", "Link PRs"
Database/KV Store	Data persistence	"Store value", "Query records"

Service	Why	Example Tasks
Payment API	E-commerce	"Check balance", "Process refund"
Shipping/Tracking	Logistics	"Track package", "Get delivery estimate"
Social Media	Content posting	"Post update", "Check mentions"
SMS/Messaging	Notifications	"Send SMS", "Check delivery"
Analytics API	Reporting	"Get metrics", "Generate report"

RFC: Mock services infrastructure for API-dependent tasks #123

Description

Problem

Proposed Mock Services

Critical (High-value OpenClaw use cases)

High Value

Nice to Have

Implementation Options

Option A: Sidecar Services (VM-local)

Option B: Centralized Mock Hub (services.pinchbench.com)

Option C: Hybrid (Local-first with Central Fallback)

Recommendation

Mock Service Spec (Draft)

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions