Skip to content

RFC: 20 real-world task ideas from ClawBytes use cases #124

@ScuttleBot

Description

@ScuttleBot

Overview

Analyzed ClawBytes — the KiloClaw cookbook of real user automation recipes — to identify what people actually do with OpenClaw. These represent validated use cases that users care about.

Philosophy: If users are building "bytes" for these workflows, they're high-value tasks worth benchmarking.


Proposed Tasks (20)

Developer Workflows (7 tasks)

# Task ClawByte What to test Grading
1 Shell Command Generator Shell Translator "List all files over 100MB" → correct find command Automated: run command, verify output
2 Git Rescue Git Rescue "I committed to wrong branch" → recovery commands Automated: create broken repo, verify fixed state
3 Commit Message Writer Commit Poet Given a diff, write conventional commit message LLM judge: quality, format compliance
4 Log Analysis Log Detective Parse error logs, identify root cause Automated: check if correct error identified
5 CI/CD Debug Pipeline Paramedic Fix failing GitHub Actions YAML Automated: validate fixed YAML, run linter
6 Test Generation Test Factory Generate tests for a function Automated: run tests, check coverage
7 Dockerfile Optimization Dockerfile Doctor Optimize a bloated Dockerfile Automated: build both, compare size/layers

Productivity & Email (4 tasks)

# Task ClawByte What to test Grading
8 Meeting Summary Meeting Distiller Transcript → action items, decisions, TL;DR LLM judge: completeness, accuracy
9 Email Triage Inbox Zero Bot Categorize 20 emails by priority/type Automated: compare to ground truth labels
10 Email Drafting Inbox Zero Bot Draft reply to customer complaint LLM judge: tone, completeness
11 Task Management Task Whisperer "Add task for Friday" → correct API call Automated: verify mock API received correct request

Research & Information (4 tasks)

# Task ClawByte What to test Grading
12 Deep Research Source Hunter Research a topic, find primary sources LLM judge: source quality, citations
13 Tech News Digest Tech Radar Summarize HN/Reddit for a topic LLM judge: relevance, coverage
14 Bookmark Organization Bookmark Rescuer Categorize 50 bookmarks, find dead links Automated: check categories, verify dead links
15 Competitive Research (general) Compare 3 products, structured output LLM judge: accuracy, structure

Code Understanding (3 tasks)

# Task ClawByte What to test Grading
16 Codebase Navigation Codebase GPS "Where is auth handled?" in unfamiliar repo Automated: check if correct file(s) identified
17 README Generation README Reviver Generate README from repo contents LLM judge: accuracy, completeness
18 Onboarding Guide Onboarding Buddy "How do I run this repo?" setup instructions LLM judge: correctness, completeness

Writing & Content (2 tasks)

# Task ClawByte What to test Grading
19 AI Writing Cleanup De-Botinator Remove AI patterns from text Automated: count AI patterns before/after
20 Data Cleaning Data Janitor Normalize messy CSV (dates, names, etc.) Automated: compare to expected output

Priority Ranking

High priority (core OpenClaw value props, easy to implement):

  1. Shell Command Generator — immediate payoff, easy grading
  2. Email Triage — common use case, automated grading possible
  3. Meeting Summary — high-value, shows information synthesis
  4. Data Cleaning — clear before/after, automated grading
  5. Git Rescue — dev-focused, verifiable

Medium priority (valuable but need mock services):
6. Task Management — needs mock Todoist API
7. Email Drafting — needs mock email service
8. CI/CD Debug — needs fixture repos
9. Test Generation — needs fixture code
10. Commit Message Writer — needs fixture diffs

Lower priority (complex setup or subjective grading):
11-20. Research tasks, code understanding, README generation


Implementation Notes

Tasks that need mock services (see #123)

  • Email Triage/Drafting → Mock Email API
  • Task Management → Mock Todo API
  • CI/CD Debug → Mock GitHub API (or real public repo)

Tasks that need fixture data

  • Log Analysis → Sample error logs with known root causes
  • Data Cleaning → Messy CSVs with known clean versions
  • Meeting Summary → Sample transcripts with expected outputs
  • Bookmark Organization → Bookmark exports with known categories

Tasks that can use real APIs

  • Shell Command Generator → Just needs a shell
  • Deep Research → Web search (already available)
  • Codebase Navigation → Real public repos

Overlap with Existing Tasks

Checking against current PinchBench tasks:

ClawByte Existing PinchBench Task? Notes
Meeting Summary New
Email Triage task_16_email_triage ✅ Already exists
Data Cleaning New
Shell Command New
Git Rescue New
Research task_18_market_research Similar, could expand
PDF Summary task_20_eli5_pdf_summary ✅ Already exists

Next Steps

  1. Start with Shell Command Generator — simplest to implement, high value
  2. Add Data Cleaning — clear automated grading
  3. Add Meeting Summary — showcase information synthesis
  4. Build mock services (per RFC: Mock services infrastructure for API-dependent tasks #123) to unlock email/task tasks
  5. Add code-focused tasks as fixture repos are built

cc @olearycrew

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions