You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analyzed ClawBytes — the KiloClaw cookbook of real user automation recipes — to identify what people actually do with OpenClaw. These represent validated use cases that users care about.
Philosophy: If users are building "bytes" for these workflows, they're high-value tasks worth benchmarking.
Proposed Tasks (20)
Developer Workflows (7 tasks)
#
Task
ClawByte
What to test
Grading
1
Shell Command Generator
Shell Translator
"List all files over 100MB" → correct find command
Automated: run command, verify output
2
Git Rescue
Git Rescue
"I committed to wrong branch" → recovery commands
Automated: create broken repo, verify fixed state
3
Commit Message Writer
Commit Poet
Given a diff, write conventional commit message
LLM judge: quality, format compliance
4
Log Analysis
Log Detective
Parse error logs, identify root cause
Automated: check if correct error identified
5
CI/CD Debug
Pipeline Paramedic
Fix failing GitHub Actions YAML
Automated: validate fixed YAML, run linter
6
Test Generation
Test Factory
Generate tests for a function
Automated: run tests, check coverage
7
Dockerfile Optimization
Dockerfile Doctor
Optimize a bloated Dockerfile
Automated: build both, compare size/layers
Productivity & Email (4 tasks)
#
Task
ClawByte
What to test
Grading
8
Meeting Summary
Meeting Distiller
Transcript → action items, decisions, TL;DR
LLM judge: completeness, accuracy
9
Email Triage
Inbox Zero Bot
Categorize 20 emails by priority/type
Automated: compare to ground truth labels
10
Email Drafting
Inbox Zero Bot
Draft reply to customer complaint
LLM judge: tone, completeness
11
Task Management
Task Whisperer
"Add task for Friday" → correct API call
Automated: verify mock API received correct request
Research & Information (4 tasks)
#
Task
ClawByte
What to test
Grading
12
Deep Research
Source Hunter
Research a topic, find primary sources
LLM judge: source quality, citations
13
Tech News Digest
Tech Radar
Summarize HN/Reddit for a topic
LLM judge: relevance, coverage
14
Bookmark Organization
Bookmark Rescuer
Categorize 50 bookmarks, find dead links
Automated: check categories, verify dead links
15
Competitive Research
(general)
Compare 3 products, structured output
LLM judge: accuracy, structure
Code Understanding (3 tasks)
#
Task
ClawByte
What to test
Grading
16
Codebase Navigation
Codebase GPS
"Where is auth handled?" in unfamiliar repo
Automated: check if correct file(s) identified
17
README Generation
README Reviver
Generate README from repo contents
LLM judge: accuracy, completeness
18
Onboarding Guide
Onboarding Buddy
"How do I run this repo?" setup instructions
LLM judge: correctness, completeness
Writing & Content (2 tasks)
#
Task
ClawByte
What to test
Grading
19
AI Writing Cleanup
De-Botinator
Remove AI patterns from text
Automated: count AI patterns before/after
20
Data Cleaning
Data Janitor
Normalize messy CSV (dates, names, etc.)
Automated: compare to expected output
Priority Ranking
High priority (core OpenClaw value props, easy to implement):
Overview
Analyzed ClawBytes — the KiloClaw cookbook of real user automation recipes — to identify what people actually do with OpenClaw. These represent validated use cases that users care about.
Philosophy: If users are building "bytes" for these workflows, they're high-value tasks worth benchmarking.
Proposed Tasks (20)
Developer Workflows (7 tasks)
findcommandProductivity & Email (4 tasks)
Research & Information (4 tasks)
Code Understanding (3 tasks)
Writing & Content (2 tasks)
Priority Ranking
High priority (core OpenClaw value props, easy to implement):
Medium priority (valuable but need mock services):
6. Task Management — needs mock Todoist API
7. Email Drafting — needs mock email service
8. CI/CD Debug — needs fixture repos
9. Test Generation — needs fixture code
10. Commit Message Writer — needs fixture diffs
Lower priority (complex setup or subjective grading):
11-20. Research tasks, code understanding, README generation
Implementation Notes
Tasks that need mock services (see #123)
Tasks that need fixture data
Tasks that can use real APIs
Overlap with Existing Tasks
Checking against current PinchBench tasks:
task_16_email_triagetask_18_market_researchtask_20_eli5_pdf_summaryNext Steps
cc @olearycrew