Problem
Many real-world OpenClaw tasks involve external APIs that we can't reliably test against:
- Rate limits — Real APIs throttle, making benchmarks inconsistent
- Cost — Some APIs charge per call
- State mutation — Real CRMs/calendars would get polluted
- Reproducibility — External data changes over time
- Grading — Hard to verify correct API usage without seeing what was called
We need mock services that behave like real APIs but are controlled, observable, and free.
Proposed Mock Services
Critical (High-value OpenClaw use cases)
| Service |
Why |
Example Tasks |
| Calendar API |
Scheduling is core OpenClaw UX |
"Schedule a meeting", "Find free slots", "Check conflicts" |
| Email/SMTP |
Email management is huge |
"Send email", "Search inbox", "Triage messages" |
| Weather API |
Common info lookup |
"What's the weather", "Should I bring umbrella" |
| Web Search |
Research tasks need it |
"Find info about X", "Compare products" |
| Task/Todo API |
Productivity workflows |
"Add task", "Check overdue", "Complete item" |
| File Storage |
Document workflows |
"Upload file", "List files", "Share document" |
High Value
| Service |
Why |
Example Tasks |
| CRM API |
Business automation |
"Add contact", "Log interaction", "Find customer" |
| Helpdesk API |
Support workflows |
"Create ticket", "Escalate issue", "Search KB" |
| RSS Feed |
Content monitoring |
"Summarize feed", "Find new posts" |
| Issue Tracker (JIRA/Linear style) |
Dev workflows |
"Create issue", "Update status", "Link PRs" |
| Database/KV Store |
Data persistence |
"Store value", "Query records" |
Nice to Have
| Service |
Why |
Example Tasks |
| Payment API |
E-commerce |
"Check balance", "Process refund" |
| Shipping/Tracking |
Logistics |
"Track package", "Get delivery estimate" |
| Social Media |
Content posting |
"Post update", "Check mentions" |
| SMS/Messaging |
Notifications |
"Send SMS", "Check delivery" |
| Analytics API |
Reporting |
"Get metrics", "Generate report" |
Implementation Options
Option A: Sidecar Services (VM-local)
Each benchmark VM spins up mock services as Docker containers alongside the agent.
┌─────────────────────────────────────────┐
│ Benchmark VM │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ OpenClaw │ │ Mock Svc │ │
│ │ Agent │──│ Container │ │
│ └─────────────┘ │ - Calendar │ │
│ │ - Email │ │
│ │ - Weather │ │
│ └─────────────┘ │
│ │ │
│ logs/state │
│ ↓ │
│ /results/api_calls.jsonl │
└─────────────────────────────────────────┘
Pros:
- ✅ Complete isolation — each run gets fresh state
- ✅ All logs local — easy to collect for grading
- ✅ No network latency — fast API responses
- ✅ No central infrastructure to maintain
- ✅ Works offline / air-gapped
Cons:
- ❌ Adds to VM startup time (pull/start containers)
- ❌ Uses VM resources (RAM/CPU for mock services)
- ❌ Need to bake mock services into snapshot or pull on boot
- ❌ Harder to update mock behavior across all VMs
Grading approach: Mock services write all requests to a JSONL file. Grading script reads it directly from the workspace.
Option B: Centralized Mock Hub (services.pinchbench.com)
All mock services run on a central server. VMs call out to it with a unique run ID.
┌────────────────┐ ┌──────────────────────────┐
│ Benchmark VM │ │ services.pinchbench.com │
│ ┌──────────┐ │ HTTPS │ ┌────────────────────┐ │
│ │ OpenClaw │──┼────────►│ │ Mock Service Hub │ │
│ │ Agent │ │ │ │ - Calendar │ │
│ └──────────┘ │ │ │ - Email │ │
│ │ │ │ - Weather │ │
│ │ │ └────────────────────┘ │
│ │ │ │ │
│ │ │ logs per run_id │
│ │ │ ↓ │
│ │ GET │ /api/logs/{run_id} │
│ │◄────────┼──────────────────────────│
└────────────────┘ └──────────────────────────┘
Pros:
- ✅ VMs stay lightweight — no extra containers
- ✅ Easy to update mock behavior globally
- ✅ Central logging dashboard possible
- ✅ Can add rate limiting / chaos testing
Cons:
- ❌ Network dependency — if hub is down, benchmarks fail
- ❌ Latency — network round trips for every API call
- ❌ Multi-tenancy complexity — need run isolation
- ❌ State management — how to reset between runs?
- ❌ Cost — another server to run
Grading approach: After run completes, grading script fetches /api/logs/{run_id} from the hub.
Option C: Hybrid (Local-first with Central Fallback)
Mock services are bundled in the VM snapshot (local). Central hub exists for:
- Configuration updates (mock data, behavior)
- Log aggregation (optional)
- Shared fixtures (large datasets)
┌─────────────────────────────────────────┐
│ Benchmark VM │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ OpenClaw │──│ Local Mocks │ │
│ │ Agent │ │ (embedded) │ │
│ └─────────────┘ └──────┬──────┘ │
│ │ │
│ logs/state │
│ ↓ │
│ /results/api_calls.jsonl │
└─────────────────────────────────────────┘
│
optional sync
↓
┌────────────────────────┐
│ hub.pinchbench.com │
│ - Config updates │
│ - Log aggregation │
│ - Fixture data │
└────────────────────────┘
Pros:
- ✅ Best of both — local speed, central coordination
- ✅ Graceful degradation — works if hub is down
- ✅ Easy grading — logs are local
- ✅ Central dashboard for aggregate analytics
Cons:
- ❌ Most complex to implement
- ❌ Two codebases to maintain (local + hub)
- ❌ Sync logic can get tricky
Grading approach: Same as Option A — local JSONL files.
Recommendation
Start with Option A (Sidecar/VM-local) because:
- Grading is simplest — logs are right there
- No new infrastructure needed
- Matches our current "self-destruct VM" model
- Can migrate to hybrid later if needed
Implementation plan:
- Create a single
pinchbench-mocks Docker image with all services
- Each service listens on a different port (e.g., calendar:8001, email:8002)
- Services log all requests to
/shared/api_calls.jsonl
- Mount shared volume between agent and mocks container
- Grading scripts parse the JSONL to verify correct API usage
Mock Service Spec (Draft)
Each mock should:
- Accept standard REST/JSON (or match real API format)
- Pre-seed with fixture data (e.g., 10 calendar events, 50 emails)
- Log every request with timestamp, method, path, body, response
- Support reset endpoint (
POST /reset) for clean state
- Return realistic response times (configurable latency)
Example log entry:
{"ts":"2026-04-07T21:00:00Z","service":"calendar","method":"POST","path":"/events","body":{"title":"Team Sync","start":"..."},"status":201,"response":{"id":"evt_123"}}
Open Questions
- Should mocks exactly mirror real APIs (Google Calendar, Todoist) or use simplified schemas?
- How much fixture data? (10 items vs 1000 items affects task difficulty)
- Should we add chaos/failure modes? (rate limits, 500 errors, timeouts)
- Language for mocks? (Node.js for speed? Python for familiarity? Go for single binary?)
cc @olearycrew
Problem
Many real-world OpenClaw tasks involve external APIs that we can't reliably test against:
We need mock services that behave like real APIs but are controlled, observable, and free.
Proposed Mock Services
Critical (High-value OpenClaw use cases)
High Value
Nice to Have
Implementation Options
Option A: Sidecar Services (VM-local)
Each benchmark VM spins up mock services as Docker containers alongside the agent.
Pros:
Cons:
Grading approach: Mock services write all requests to a JSONL file. Grading script reads it directly from the workspace.
Option B: Centralized Mock Hub (services.pinchbench.com)
All mock services run on a central server. VMs call out to it with a unique run ID.
Pros:
Cons:
Grading approach: After run completes, grading script fetches
/api/logs/{run_id}from the hub.Option C: Hybrid (Local-first with Central Fallback)
Mock services are bundled in the VM snapshot (local). Central hub exists for:
Pros:
Cons:
Grading approach: Same as Option A — local JSONL files.
Recommendation
Start with Option A (Sidecar/VM-local) because:
Implementation plan:
pinchbench-mocksDocker image with all services/shared/api_calls.jsonlMock Service Spec (Draft)
Each mock should:
POST /reset) for clean stateExample log entry:
{"ts":"2026-04-07T21:00:00Z","service":"calendar","method":"POST","path":"/events","body":{"title":"Team Sync","start":"..."},"status":201,"response":{"id":"evt_123"}}Open Questions
cc @olearycrew