Skip to content

RFC: Additional task ideas - Email, Finance, Research, and More #125

@ScuttleBot

Description

@ScuttleBot

Overview

Additional task ideas to expand PinchBench coverage. Organized by category with implementation notes.


Email & Communication (4 tasks)

Task Description Grading Mock Service Needed
Email Reply Drafting Check inbox for new emails, draft appropriate replies for each unread message LLM judge: tone, completeness, appropriateness Mock Email API
Email → Meeting Scheduler Read an email requesting a meeting, extract details, create calendar event with correct time/attendees Automated: verify calendar event fields match email content Mock Email + Calendar API
Meeting Summary Given a meeting transcript, produce structured summary with decisions, action items, open questions, TL;DR LLM judge: completeness, accuracy of extraction None (fixture transcript)
Contact Lookup Find contact information for a person/role at a company (e.g., "Find the procurement contact at Acme Corp") Automated: verify correct contact returned from mock CRM Mock CRM API

Task & Productivity Management (2 tasks)

Task Description Grading Mock Service Needed
Todo List Cleanup Clean up a messy to-do list: merge duplicates (mark-as-completed, never delete), flag overdue items, organize by project Automated: verify no items deleted, duplicates merged correctly, overdue flagged Mock Todo API
Cron Job Organizer Given a list of desired scheduled tasks ("backup DB nightly", "send report every Monday 9am"), create properly formatted cron expressions and organize them Automated: validate cron syntax, verify schedules match descriptions None (output validation)

Finance & Business Research (7 tasks)

Task Description Grading Mock Service Needed
Executive Lookup "Who is the CFO of GitLab, Inc as of April 7, 2026?" — Point-in-time factual lookup requiring web search Automated: exact match on name None (web search)
Earnings Analysis - Margin "How many basis points did GitLab (GTLB) beat or miss Q3 2025 GAAP gross margin guidance?" — Extract from earnings reports, calculate delta Automated: exact numeric answer (±tolerance) None (web search + PDF reading)
Earnings Analysis - Sales "What was Chipotle's (CMG) same-store sales growth in Q4 2025?" — Extract specific metric from earnings Automated: exact numeric answer None (web search + PDF reading)
Financial Ratio Calculation "Calculate the inventory turnover ratio for United States Steel Corporation (X) in FY2024" — Requires finding financials, applying formula Automated: numeric answer within tolerance None (web search + calculation)
Expense Report Submission Given receipt images/data, organize into expense report with categories, totals, required fields Automated: verify totals, categories, required fields present Mock Expense API
Finance Report from CSV Given company financial CSVs, generate a structured report with key metrics, trends, summaries LLM judge: accuracy of calculations, completeness None (fixture CSVs)
Pricing Research Compare pricing for a product/service across 3-5 vendors, produce structured comparison LLM judge: accuracy, completeness, source citations None (web search)

Technology & Product Research (5 tasks)

Task Description Grading Mock Service Needed
IT Equipment Procurement Research and recommend laptops meeting specific requirements (budget, specs, use case), with justification LLM judge: requirements met, reasoning quality, sources None (web search)
Open Source Alternative Research "Find open source alternatives to Notion" — Research, compare features, recommend with pros/cons LLM judge: relevance, completeness, accuracy None (web search)
Competitive Research Compare 3 competitors in a space (features, pricing, market position), produce structured analysis LLM judge: accuracy, depth, source quality None (web search)
New Phone Research "I need a new phone under $800 with great camera" — Research current options, recommend with reasoning LLM judge: requirements addressed, current info, reasoning None (web search)
BYOK Best Practices "Put together best practices for BYOK (Bring Your Own Key) for AI inference" — Research and compile technical guidance LLM judge: accuracy, completeness, source quality None (web search)

Regulatory & Compliance (1 task)

Task Description Grading Mock Service Needed
EU Regulation Research Research a specific EU regulation (GDPR, AI Act, etc.), summarize key requirements for a company LLM judge: accuracy, completeness, actionability None (web search)

Image & Document Processing (4 tasks)

Task Description Grading Mock Service Needed
OCR - Sheet Music Extract musical notation or metadata from sheet music image Automated: verify extracted notes/metadata match ground truth None (fixture images)
OCR - General Document Extract structured text from a scanned document (receipt, form, etc.) Automated: compare extracted text to ground truth None (fixture images)
Image Identification Given tagged images (phone, food, menu, receipt, etc.), correctly identify and describe each Automated: verify labels match ground truth None (fixture images)
Subway Navigation Given NYC subway map image + origin/destination, provide directions LLM judge: correct route, reasonable instructions None (fixture image)
PDF Summary Summarize a multi-page PDF document (technical, legal, or business) LLM judge: accuracy, completeness, key points captured None (fixture PDF)

Summary by Category

Category Count Mock Services Needed
Email & Communication 4 Email API, Calendar API, CRM API
Task & Productivity 2 Todo API
Finance & Business 7 Expense API (1 task), rest use web search
Technology Research 5 None
Regulatory 1 None
Image & Document 5 None
Total 24

Priority Recommendations

High priority (no mock services needed, high value):

  1. Executive Lookup — simple factual, automated grading
  2. Financial Ratio Calculation — tests math + research
  3. Competitive Research — common use case
  4. Image Identification — tests vision capabilities
  5. PDF Summary — core document understanding

Medium priority (fixture data needed):
6. Todo List Cleanup — needs mock Todo API
7. Email Reply Drafting — needs mock Email API
8. OCR tasks — needs fixture images with ground truth
9. Finance Report from CSV — needs fixture data

Lower priority (complex grading):
10. Meeting Summary — LLM judge, subjective
11. Research tasks — LLM judge, hard to verify accuracy


Relationship to Other Issues

cc @olearycrew

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions