Skip to content

Task ideas from canitrunopenclaw/ClawBench analysis #122

@ScuttleBot

Description

@ScuttleBot

Overview

Analyzed canitrunopenclaw and its ClawBench benchmarking tool.

What it is: A hardware compatibility directory for OpenClaw forks. ClawBench measures installation and startup performance, not agent task completion.

Key difference from PinchBench:

  • ClawBench = Can the fork run on this hardware? (clone time, install time, disk usage, memory, startup)
  • PinchBench = Can the agent complete tasks correctly? (task accuracy, tool usage, reasoning)

ClawBench Scoring (for reference)

Component Weight What it measures
Latency 30 pts Cold start time (clone + install + startup)
Capabilities 40 pts 8 capability checks (messaging, browser, code exec, memory, files, search, MCP, tool use)
Size 20 pts Total disk footprint after install
Build 10 pts Successful install + successful startup

Capabilities detected via static analysis:

  • Messaging (WhatsApp/Telegram/Discord/Slack)
  • Browser automation (Puppeteer/Playwright/Selenium)
  • Code execution (subprocess/child_process)
  • Persistent memory (SQLite/Redis/vectordb)
  • File management
  • Web search
  • MCP support
  • Tool use

Relevant Ideas for PinchBench

While ClawBench doesn't have agent tasks to port, some capability checks could inspire new task categories:

1. MCP Server Integration Task

ClawBench checks for MCP support. We could add a task where the agent must:

  • Connect to an MCP server
  • Discover available tools
  • Use an MCP-provided tool to complete a task

Why: MCP is increasingly important in the OpenClaw ecosystem.

2. Multi-Channel Messaging Task

ClawBench checks for messaging platform support. Task idea:

  • Send a message via one channel (e.g., mock Discord webhook)
  • Verify delivery or response

Challenge: Requires mock infrastructure or webhook.

3. Browser + Code Execution Combined Task

ClawBench separately checks browser and code execution. Task idea:

  • Navigate to a page with browser
  • Extract data
  • Write a Python/JS script to process it
  • Execute the script
  • Report results

Why: Tests integration of multiple capabilities in a single workflow.

4. Resource-Constrained Performance Task

Inspired by ClawBench's focus on hardware limits:

  • Give agent a task with a strict time/token budget
  • Grade on completion and efficiency

Challenge: Would need to track token usage in grading.

Not Applicable to PinchBench

  • Clone/install benchmarks (not relevant to agent task completion)
  • Disk usage metrics (infrastructure, not task)
  • Startup time (infrastructure)
  • Static capability detection (we test via actual task completion)

Conclusion

ClawBench is complementary to PinchBench, not overlapping. They answer different questions:

  • ClawBench: "Will this fork run on my Raspberry Pi?"
  • PinchBench: "Which model completes tasks best?"

The MCP integration task idea (item #1) is probably the most actionable addition.

cc @olearycrew

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions