Overview
Additional task ideas to expand PinchBench coverage. Organized by category with implementation notes.
Email & Communication (4 tasks)
| Task |
Description |
Grading |
Mock Service Needed |
| Email Reply Drafting |
Check inbox for new emails, draft appropriate replies for each unread message |
LLM judge: tone, completeness, appropriateness |
Mock Email API |
| Email → Meeting Scheduler |
Read an email requesting a meeting, extract details, create calendar event with correct time/attendees |
Automated: verify calendar event fields match email content |
Mock Email + Calendar API |
| Meeting Summary |
Given a meeting transcript, produce structured summary with decisions, action items, open questions, TL;DR |
LLM judge: completeness, accuracy of extraction |
None (fixture transcript) |
| Contact Lookup |
Find contact information for a person/role at a company (e.g., "Find the procurement contact at Acme Corp") |
Automated: verify correct contact returned from mock CRM |
Mock CRM API |
Task & Productivity Management (2 tasks)
| Task |
Description |
Grading |
Mock Service Needed |
| Todo List Cleanup |
Clean up a messy to-do list: merge duplicates (mark-as-completed, never delete), flag overdue items, organize by project |
Automated: verify no items deleted, duplicates merged correctly, overdue flagged |
Mock Todo API |
| Cron Job Organizer |
Given a list of desired scheduled tasks ("backup DB nightly", "send report every Monday 9am"), create properly formatted cron expressions and organize them |
Automated: validate cron syntax, verify schedules match descriptions |
None (output validation) |
Finance & Business Research (7 tasks)
| Task |
Description |
Grading |
Mock Service Needed |
| Executive Lookup |
"Who is the CFO of GitLab, Inc as of April 7, 2026?" — Point-in-time factual lookup requiring web search |
Automated: exact match on name |
None (web search) |
| Earnings Analysis - Margin |
"How many basis points did GitLab (GTLB) beat or miss Q3 2025 GAAP gross margin guidance?" — Extract from earnings reports, calculate delta |
Automated: exact numeric answer (±tolerance) |
None (web search + PDF reading) |
| Earnings Analysis - Sales |
"What was Chipotle's (CMG) same-store sales growth in Q4 2025?" — Extract specific metric from earnings |
Automated: exact numeric answer |
None (web search + PDF reading) |
| Financial Ratio Calculation |
"Calculate the inventory turnover ratio for United States Steel Corporation (X) in FY2024" — Requires finding financials, applying formula |
Automated: numeric answer within tolerance |
None (web search + calculation) |
| Expense Report Submission |
Given receipt images/data, organize into expense report with categories, totals, required fields |
Automated: verify totals, categories, required fields present |
Mock Expense API |
| Finance Report from CSV |
Given company financial CSVs, generate a structured report with key metrics, trends, summaries |
LLM judge: accuracy of calculations, completeness |
None (fixture CSVs) |
| Pricing Research |
Compare pricing for a product/service across 3-5 vendors, produce structured comparison |
LLM judge: accuracy, completeness, source citations |
None (web search) |
Technology & Product Research (5 tasks)
| Task |
Description |
Grading |
Mock Service Needed |
| IT Equipment Procurement |
Research and recommend laptops meeting specific requirements (budget, specs, use case), with justification |
LLM judge: requirements met, reasoning quality, sources |
None (web search) |
| Open Source Alternative Research |
"Find open source alternatives to Notion" — Research, compare features, recommend with pros/cons |
LLM judge: relevance, completeness, accuracy |
None (web search) |
| Competitive Research |
Compare 3 competitors in a space (features, pricing, market position), produce structured analysis |
LLM judge: accuracy, depth, source quality |
None (web search) |
| New Phone Research |
"I need a new phone under $800 with great camera" — Research current options, recommend with reasoning |
LLM judge: requirements addressed, current info, reasoning |
None (web search) |
| BYOK Best Practices |
"Put together best practices for BYOK (Bring Your Own Key) for AI inference" — Research and compile technical guidance |
LLM judge: accuracy, completeness, source quality |
None (web search) |
Regulatory & Compliance (1 task)
| Task |
Description |
Grading |
Mock Service Needed |
| EU Regulation Research |
Research a specific EU regulation (GDPR, AI Act, etc.), summarize key requirements for a company |
LLM judge: accuracy, completeness, actionability |
None (web search) |
Image & Document Processing (4 tasks)
| Task |
Description |
Grading |
Mock Service Needed |
| OCR - Sheet Music |
Extract musical notation or metadata from sheet music image |
Automated: verify extracted notes/metadata match ground truth |
None (fixture images) |
| OCR - General Document |
Extract structured text from a scanned document (receipt, form, etc.) |
Automated: compare extracted text to ground truth |
None (fixture images) |
| Image Identification |
Given tagged images (phone, food, menu, receipt, etc.), correctly identify and describe each |
Automated: verify labels match ground truth |
None (fixture images) |
| Subway Navigation |
Given NYC subway map image + origin/destination, provide directions |
LLM judge: correct route, reasonable instructions |
None (fixture image) |
| PDF Summary |
Summarize a multi-page PDF document (technical, legal, or business) |
LLM judge: accuracy, completeness, key points captured |
None (fixture PDF) |
Summary by Category
| Category |
Count |
Mock Services Needed |
| Email & Communication |
4 |
Email API, Calendar API, CRM API |
| Task & Productivity |
2 |
Todo API |
| Finance & Business |
7 |
Expense API (1 task), rest use web search |
| Technology Research |
5 |
None |
| Regulatory |
1 |
None |
| Image & Document |
5 |
None |
| Total |
24 |
|
Priority Recommendations
High priority (no mock services needed, high value):
- Executive Lookup — simple factual, automated grading
- Financial Ratio Calculation — tests math + research
- Competitive Research — common use case
- Image Identification — tests vision capabilities
- PDF Summary — core document understanding
Medium priority (fixture data needed):
6. Todo List Cleanup — needs mock Todo API
7. Email Reply Drafting — needs mock Email API
8. OCR tasks — needs fixture images with ground truth
9. Finance Report from CSV — needs fixture data
Lower priority (complex grading):
10. Meeting Summary — LLM judge, subjective
11. Research tasks — LLM judge, hard to verify accuracy
Relationship to Other Issues
cc @olearycrew
Overview
Additional task ideas to expand PinchBench coverage. Organized by category with implementation notes.
Email & Communication (4 tasks)
Task & Productivity Management (2 tasks)
Finance & Business Research (7 tasks)
Technology & Product Research (5 tasks)
Regulatory & Compliance (1 task)
Image & Document Processing (4 tasks)
Summary by Category
Priority Recommendations
High priority (no mock services needed, high value):
Medium priority (fixture data needed):
6. Todo List Cleanup — needs mock Todo API
7. Email Reply Drafting — needs mock Email API
8. OCR tasks — needs fixture images with ground truth
9. Finance Report from CSV — needs fixture data
Lower priority (complex grading):
10. Meeting Summary — LLM judge, subjective
11. Research tasks — LLM judge, hard to verify accuracy
Relationship to Other Issues
cc @olearycrew