Skip to content

RFC: Mock services infrastructure for API-dependent tasks #123

@ScuttleBot

Description

@ScuttleBot

Problem

Many real-world OpenClaw tasks involve external APIs that we can't reliably test against:

  • Rate limits — Real APIs throttle, making benchmarks inconsistent
  • Cost — Some APIs charge per call
  • State mutation — Real CRMs/calendars would get polluted
  • Reproducibility — External data changes over time
  • Grading — Hard to verify correct API usage without seeing what was called

We need mock services that behave like real APIs but are controlled, observable, and free.

Proposed Mock Services

Critical (High-value OpenClaw use cases)

Service Why Example Tasks
Calendar API Scheduling is core OpenClaw UX "Schedule a meeting", "Find free slots", "Check conflicts"
Email/SMTP Email management is huge "Send email", "Search inbox", "Triage messages"
Weather API Common info lookup "What's the weather", "Should I bring umbrella"
Web Search Research tasks need it "Find info about X", "Compare products"
Task/Todo API Productivity workflows "Add task", "Check overdue", "Complete item"
File Storage Document workflows "Upload file", "List files", "Share document"

High Value

Service Why Example Tasks
CRM API Business automation "Add contact", "Log interaction", "Find customer"
Helpdesk API Support workflows "Create ticket", "Escalate issue", "Search KB"
RSS Feed Content monitoring "Summarize feed", "Find new posts"
Issue Tracker (JIRA/Linear style) Dev workflows "Create issue", "Update status", "Link PRs"
Database/KV Store Data persistence "Store value", "Query records"

Nice to Have

Service Why Example Tasks
Payment API E-commerce "Check balance", "Process refund"
Shipping/Tracking Logistics "Track package", "Get delivery estimate"
Social Media Content posting "Post update", "Check mentions"
SMS/Messaging Notifications "Send SMS", "Check delivery"
Analytics API Reporting "Get metrics", "Generate report"

Implementation Options

Option A: Sidecar Services (VM-local)

Each benchmark VM spins up mock services as Docker containers alongside the agent.

┌─────────────────────────────────────────┐
│ Benchmark VM                            │
│  ┌─────────────┐  ┌─────────────┐       │
│  │   OpenClaw  │  │ Mock Svc    │       │
│  │   Agent     │──│ Container   │       │
│  └─────────────┘  │ - Calendar  │       │
│                   │ - Email     │       │
│                   │ - Weather   │       │
│                   └─────────────┘       │
│                         │               │
│                    logs/state           │
│                         ↓               │
│               /results/api_calls.jsonl  │
└─────────────────────────────────────────┘

Pros:

  • ✅ Complete isolation — each run gets fresh state
  • ✅ All logs local — easy to collect for grading
  • ✅ No network latency — fast API responses
  • ✅ No central infrastructure to maintain
  • ✅ Works offline / air-gapped

Cons:

  • ❌ Adds to VM startup time (pull/start containers)
  • ❌ Uses VM resources (RAM/CPU for mock services)
  • ❌ Need to bake mock services into snapshot or pull on boot
  • ❌ Harder to update mock behavior across all VMs

Grading approach: Mock services write all requests to a JSONL file. Grading script reads it directly from the workspace.


Option B: Centralized Mock Hub (services.pinchbench.com)

All mock services run on a central server. VMs call out to it with a unique run ID.

┌────────────────┐         ┌──────────────────────────┐
│ Benchmark VM   │         │ services.pinchbench.com  │
│  ┌──────────┐  │  HTTPS  │  ┌────────────────────┐  │
│  │ OpenClaw │──┼────────►│  │ Mock Service Hub   │  │
│  │ Agent    │  │         │  │ - Calendar         │  │
│  └──────────┘  │         │  │ - Email            │  │
│                │         │  │ - Weather          │  │
│                │         │  └────────────────────┘  │
│                │         │           │              │
│                │         │     logs per run_id      │
│                │         │           ↓              │
│                │  GET    │  /api/logs/{run_id}      │
│                │◄────────┼──────────────────────────│
└────────────────┘         └──────────────────────────┘

Pros:

  • ✅ VMs stay lightweight — no extra containers
  • ✅ Easy to update mock behavior globally
  • ✅ Central logging dashboard possible
  • ✅ Can add rate limiting / chaos testing

Cons:

  • ❌ Network dependency — if hub is down, benchmarks fail
  • ❌ Latency — network round trips for every API call
  • ❌ Multi-tenancy complexity — need run isolation
  • ❌ State management — how to reset between runs?
  • ❌ Cost — another server to run

Grading approach: After run completes, grading script fetches /api/logs/{run_id} from the hub.


Option C: Hybrid (Local-first with Central Fallback)

Mock services are bundled in the VM snapshot (local). Central hub exists for:

  • Configuration updates (mock data, behavior)
  • Log aggregation (optional)
  • Shared fixtures (large datasets)
┌─────────────────────────────────────────┐
│ Benchmark VM                            │
│  ┌─────────────┐  ┌─────────────┐       │
│  │   OpenClaw  │──│ Local Mocks │       │
│  │   Agent     │  │ (embedded)  │       │
│  └─────────────┘  └──────┬──────┘       │
│                          │              │
│                    logs/state           │
│                          ↓              │
│               /results/api_calls.jsonl  │
└─────────────────────────────────────────┘
                           │
                    optional sync
                           ↓
              ┌────────────────────────┐
              │ hub.pinchbench.com     │
              │ - Config updates       │
              │ - Log aggregation      │
              │ - Fixture data         │
              └────────────────────────┘

Pros:

  • ✅ Best of both — local speed, central coordination
  • ✅ Graceful degradation — works if hub is down
  • ✅ Easy grading — logs are local
  • ✅ Central dashboard for aggregate analytics

Cons:

  • ❌ Most complex to implement
  • ❌ Two codebases to maintain (local + hub)
  • ❌ Sync logic can get tricky

Grading approach: Same as Option A — local JSONL files.


Recommendation

Start with Option A (Sidecar/VM-local) because:

  1. Grading is simplest — logs are right there
  2. No new infrastructure needed
  3. Matches our current "self-destruct VM" model
  4. Can migrate to hybrid later if needed

Implementation plan:

  1. Create a single pinchbench-mocks Docker image with all services
  2. Each service listens on a different port (e.g., calendar:8001, email:8002)
  3. Services log all requests to /shared/api_calls.jsonl
  4. Mount shared volume between agent and mocks container
  5. Grading scripts parse the JSONL to verify correct API usage

Mock Service Spec (Draft)

Each mock should:

  • Accept standard REST/JSON (or match real API format)
  • Pre-seed with fixture data (e.g., 10 calendar events, 50 emails)
  • Log every request with timestamp, method, path, body, response
  • Support reset endpoint (POST /reset) for clean state
  • Return realistic response times (configurable latency)

Example log entry:

{"ts":"2026-04-07T21:00:00Z","service":"calendar","method":"POST","path":"/events","body":{"title":"Team Sync","start":"..."},"status":201,"response":{"id":"evt_123"}}

Open Questions

  1. Should mocks exactly mirror real APIs (Google Calendar, Todoist) or use simplified schemas?
  2. How much fixture data? (10 items vs 1000 items affects task difficulty)
  3. Should we add chaos/failure modes? (rate limits, 500 errors, timeouts)
  4. Language for mocks? (Node.js for speed? Python for familiarity? Go for single binary?)

cc @olearycrew

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions