RFC: Additional task ideas - Email, Finance, Research, and More

## Overview

Additional task ideas to expand PinchBench coverage. Organized by category with implementation notes.

---

## Email & Communication (4 tasks)

| Task | Description | Grading | Mock Service Needed |
|------|-------------|---------|---------------------|
| **Email Reply Drafting** | Check inbox for new emails, draft appropriate replies for each unread message | LLM judge: tone, completeness, appropriateness | Mock Email API |
| **Email → Meeting Scheduler** | Read an email requesting a meeting, extract details, create calendar event with correct time/attendees | Automated: verify calendar event fields match email content | Mock Email + Calendar API |
| **Meeting Summary** | Given a meeting transcript, produce structured summary with decisions, action items, open questions, TL;DR | LLM judge: completeness, accuracy of extraction | None (fixture transcript) |
| **Contact Lookup** | Find contact information for a person/role at a company (e.g., "Find the procurement contact at Acme Corp") | Automated: verify correct contact returned from mock CRM | Mock CRM API |

---

## Task & Productivity Management (2 tasks)

| Task | Description | Grading | Mock Service Needed |
|------|-------------|---------|---------------------|
| **Todo List Cleanup** | Clean up a messy to-do list: merge duplicates (mark-as-completed, never delete), flag overdue items, organize by project | Automated: verify no items deleted, duplicates merged correctly, overdue flagged | Mock Todo API |
| **Cron Job Organizer** | Given a list of desired scheduled tasks ("backup DB nightly", "send report every Monday 9am"), create properly formatted cron expressions and organize them | Automated: validate cron syntax, verify schedules match descriptions | None (output validation) |

---

## Finance & Business Research (7 tasks)

| Task | Description | Grading | Mock Service Needed |
|------|-------------|---------|---------------------|
| **Executive Lookup** | "Who is the CFO of GitLab, Inc as of April 7, 2026?" — Point-in-time factual lookup requiring web search | Automated: exact match on name | None (web search) |
| **Earnings Analysis - Margin** | "How many basis points did GitLab (GTLB) beat or miss Q3 2025 GAAP gross margin guidance?" — Extract from earnings reports, calculate delta | Automated: exact numeric answer (±tolerance) | None (web search + PDF reading) |
| **Earnings Analysis - Sales** | "What was Chipotle's (CMG) same-store sales growth in Q4 2025?" — Extract specific metric from earnings | Automated: exact numeric answer | None (web search + PDF reading) |
| **Financial Ratio Calculation** | "Calculate the inventory turnover ratio for United States Steel Corporation (X) in FY2024" — Requires finding financials, applying formula | Automated: numeric answer within tolerance | None (web search + calculation) |
| **Expense Report Submission** | Given receipt images/data, organize into expense report with categories, totals, required fields | Automated: verify totals, categories, required fields present | Mock Expense API |
| **Finance Report from CSV** | Given company financial CSVs, generate a structured report with key metrics, trends, summaries | LLM judge: accuracy of calculations, completeness | None (fixture CSVs) |
| **Pricing Research** | Compare pricing for a product/service across 3-5 vendors, produce structured comparison | LLM judge: accuracy, completeness, source citations | None (web search) |

---

## Technology & Product Research (5 tasks)

| Task | Description | Grading | Mock Service Needed |
|------|-------------|---------|---------------------|
| **IT Equipment Procurement** | Research and recommend laptops meeting specific requirements (budget, specs, use case), with justification | LLM judge: requirements met, reasoning quality, sources | None (web search) |
| **Open Source Alternative Research** | "Find open source alternatives to Notion" — Research, compare features, recommend with pros/cons | LLM judge: relevance, completeness, accuracy | None (web search) |
| **Competitive Research** | Compare 3 competitors in a space (features, pricing, market position), produce structured analysis | LLM judge: accuracy, depth, source quality | None (web search) |
| **New Phone Research** | "I need a new phone under $800 with great camera" — Research current options, recommend with reasoning | LLM judge: requirements addressed, current info, reasoning | None (web search) |
| **BYOK Best Practices** | "Put together best practices for BYOK (Bring Your Own Key) for AI inference" — Research and compile technical guidance | LLM judge: accuracy, completeness, source quality | None (web search) |

---

## Regulatory & Compliance (1 task)

| Task | Description | Grading | Mock Service Needed |
|------|-------------|---------|---------------------|
| **EU Regulation Research** | Research a specific EU regulation (GDPR, AI Act, etc.), summarize key requirements for a company | LLM judge: accuracy, completeness, actionability | None (web search) |

---

## Image & Document Processing (4 tasks)

| Task | Description | Grading | Mock Service Needed |
|------|-------------|---------|---------------------|
| **OCR - Sheet Music** | Extract musical notation or metadata from sheet music image | Automated: verify extracted notes/metadata match ground truth | None (fixture images) |
| **OCR - General Document** | Extract structured text from a scanned document (receipt, form, etc.) | Automated: compare extracted text to ground truth | None (fixture images) |
| **Image Identification** | Given tagged images (phone, food, menu, receipt, etc.), correctly identify and describe each | Automated: verify labels match ground truth | None (fixture images) |
| **Subway Navigation** | Given NYC subway map image + origin/destination, provide directions | LLM judge: correct route, reasonable instructions | None (fixture image) |
| **PDF Summary** | Summarize a multi-page PDF document (technical, legal, or business) | LLM judge: accuracy, completeness, key points captured | None (fixture PDF) |

---

## Summary by Category

| Category | Count | Mock Services Needed |
|----------|-------|---------------------|
| Email & Communication | 4 | Email API, Calendar API, CRM API |
| Task & Productivity | 2 | Todo API |
| Finance & Business | 7 | Expense API (1 task), rest use web search |
| Technology Research | 5 | None |
| Regulatory | 1 | None |
| Image & Document | 5 | None |
| **Total** | **24** | |

---

## Priority Recommendations

**High priority (no mock services needed, high value):**
1. Executive Lookup — simple factual, automated grading
2. Financial Ratio Calculation — tests math + research
3. Competitive Research — common use case
4. Image Identification — tests vision capabilities
5. PDF Summary — core document understanding

**Medium priority (fixture data needed):**
6. Todo List Cleanup — needs mock Todo API
7. Email Reply Drafting — needs mock Email API
8. OCR tasks — needs fixture images with ground truth
9. Finance Report from CSV — needs fixture data

**Lower priority (complex grading):**
10. Meeting Summary — LLM judge, subjective
11. Research tasks — LLM judge, hard to verify accuracy

---

## Relationship to Other Issues

- **#123 (Mock Services)** — Email, Calendar, CRM, Todo, Expense APIs needed for ~6 tasks
- **#124 (ClawBytes Tasks)** — Some overlap on email triage, research, meeting summary
- **#121 (WildClawBench)** — Calendar scheduling, research tasks overlap

cc @olearycrew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Additional task ideas - Email, Finance, Research, and More #125

Overview

Email & Communication (4 tasks)

Task & Productivity Management (2 tasks)

Finance & Business Research (7 tasks)

Technology & Product Research (5 tasks)

Regulatory & Compliance (1 task)

Image & Document Processing (4 tasks)

Summary by Category

Priority Recommendations

Relationship to Other Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Task	Description	Grading	Mock Service Needed
Email Reply Drafting	Check inbox for new emails, draft appropriate replies for each unread message	LLM judge: tone, completeness, appropriateness	Mock Email API
Email → Meeting Scheduler	Read an email requesting a meeting, extract details, create calendar event with correct time/attendees	Automated: verify calendar event fields match email content	Mock Email + Calendar API
Meeting Summary	Given a meeting transcript, produce structured summary with decisions, action items, open questions, TL;DR	LLM judge: completeness, accuracy of extraction	None (fixture transcript)
Contact Lookup	Find contact information for a person/role at a company (e.g., "Find the procurement contact at Acme Corp")	Automated: verify correct contact returned from mock CRM	Mock CRM API

Task	Description	Grading	Mock Service Needed
Todo List Cleanup	Clean up a messy to-do list: merge duplicates (mark-as-completed, never delete), flag overdue items, organize by project	Automated: verify no items deleted, duplicates merged correctly, overdue flagged	Mock Todo API
Cron Job Organizer	Given a list of desired scheduled tasks ("backup DB nightly", "send report every Monday 9am"), create properly formatted cron expressions and organize them	Automated: validate cron syntax, verify schedules match descriptions	None (output validation)

Task	Description	Grading	Mock Service Needed
Executive Lookup	"Who is the CFO of GitLab, Inc as of April 7, 2026?" — Point-in-time factual lookup requiring web search	Automated: exact match on name	None (web search)
Earnings Analysis - Margin	"How many basis points did GitLab (GTLB) beat or miss Q3 2025 GAAP gross margin guidance?" — Extract from earnings reports, calculate delta	Automated: exact numeric answer (±tolerance)	None (web search + PDF reading)
Earnings Analysis - Sales	"What was Chipotle's (CMG) same-store sales growth in Q4 2025?" — Extract specific metric from earnings	Automated: exact numeric answer	None (web search + PDF reading)
Financial Ratio Calculation	"Calculate the inventory turnover ratio for United States Steel Corporation (X) in FY2024" — Requires finding financials, applying formula	Automated: numeric answer within tolerance	None (web search + calculation)
Expense Report Submission	Given receipt images/data, organize into expense report with categories, totals, required fields	Automated: verify totals, categories, required fields present	Mock Expense API
Finance Report from CSV	Given company financial CSVs, generate a structured report with key metrics, trends, summaries	LLM judge: accuracy of calculations, completeness	None (fixture CSVs)
Pricing Research	Compare pricing for a product/service across 3-5 vendors, produce structured comparison	LLM judge: accuracy, completeness, source citations	None (web search)

Task	Description	Grading	Mock Service Needed
IT Equipment Procurement	Research and recommend laptops meeting specific requirements (budget, specs, use case), with justification	LLM judge: requirements met, reasoning quality, sources	None (web search)
Open Source Alternative Research	"Find open source alternatives to Notion" — Research, compare features, recommend with pros/cons	LLM judge: relevance, completeness, accuracy	None (web search)
Competitive Research	Compare 3 competitors in a space (features, pricing, market position), produce structured analysis	LLM judge: accuracy, depth, source quality	None (web search)
New Phone Research	"I need a new phone under $800 with great camera" — Research current options, recommend with reasoning	LLM judge: requirements addressed, current info, reasoning	None (web search)
BYOK Best Practices	"Put together best practices for BYOK (Bring Your Own Key) for AI inference" — Research and compile technical guidance	LLM judge: accuracy, completeness, source quality	None (web search)

Task	Description	Grading	Mock Service Needed
OCR - Sheet Music	Extract musical notation or metadata from sheet music image	Automated: verify extracted notes/metadata match ground truth	None (fixture images)
OCR - General Document	Extract structured text from a scanned document (receipt, form, etc.)	Automated: compare extracted text to ground truth	None (fixture images)
Image Identification	Given tagged images (phone, food, menu, receipt, etc.), correctly identify and describe each	Automated: verify labels match ground truth	None (fixture images)
Subway Navigation	Given NYC subway map image + origin/destination, provide directions	LLM judge: correct route, reasonable instructions	None (fixture image)
PDF Summary	Summarize a multi-page PDF document (technical, legal, or business)	LLM judge: accuracy, completeness, key points captured	None (fixture PDF)

Category	Count	Mock Services Needed
Email & Communication	4	Email API, Calendar API, CRM API
Task & Productivity	2	Todo API
Finance & Business	7	Expense API (1 task), rest use web search
Technology Research	5	None
Regulatory	1	None
Image & Document	5	None
Total	24

RFC: Additional task ideas - Email, Finance, Research, and More #125

Description

Overview

Email & Communication (4 tasks)

Task & Productivity Management (2 tasks)

Finance & Business Research (7 tasks)

Technology & Product Research (5 tasks)

Regulatory & Compliance (1 task)

Image & Document Processing (4 tasks)

Summary by Category

Priority Recommendations

Relationship to Other Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions