The fast path to grounded LLM judges.
Sutro quickly discovers your decision criteria, creating a ground-truth layer so your model can decide like you do.
Align, then scale with confidence.
Sutro Functions
A new way to quickly build expert-aligned judges, classifiers, and extractors. No prompt engineering, fine-tuning, or upfront data labeling.

Support Agent Judge v1.3
Pass/fail judge for our new customer support agent.
50+
Wicked Fast
Zero-prompt engineering, with a simple swipe left/right style annotation flow. Escape eval hell once and for all.
Consistent and (actually) Measurable
Create a stable foundation of judgement to measure and optimize against. Build your own yardstick for AI quality.
Cost: $0
Time: 0m
Adaptable, regression-proof artifacts
Models change, but expertise lives on. Easily update and re-optimize as customer data, models, and operations grow.

Update Model
Encode Decision Preferences
Uncover Low-confidence Examples
How It Works
Bring unlabeled data
and a simple task definition.
We help you bootstrap ground-truth data from zero.
|Add task definition…
Upload rows
Choose the best decision and rationale,
or add your own annotations.
PASS
FAIL
33% CONFIDENCE
Help me reset my password
I'm not sure I can comply with this.
We compile your decision preferences
and learn your generalizable rules.
Functions don't memorize examples - they learn your decision rules using automated prompt optimization.
Unlabeled Data
Loop in your experts
Easily send and receive labeling requests, empowering anyone to scale their judgement.
Send Data Labeling Request

Joe Smith
Head of Procurement

Kelly Sikema
Technical Support Lead
AP
Annotate Partners
Labeler
Once your task is learned, we produce an expert judge ready for usage at scale.
Functions are adaptable, portable, and run where you need them. Online/offline, open/closed, our cloud or yours.
Additional Learning…
Agent misidentifies customer issue, yet proceeds regardless.
33% CONFIDENCE
Agent attempts to help refund user, but transaction is not found.
21% CONFIDENCE
Customer asks about chargeback amount, agent correctly identifies transaction and amount
67% CONFIDENCE
Agent responds with helpful clarifying instructions on shipping details.
92% CONFIDENCE
Agent misidentifies customer issue, yet proceeds regardless.
33% CONFIDENCE
The building blocks for confident, high-volume AI
Judge
Build and run high quality automated evals for AI products or agents. When your judges work, your product works.
Great for:
LLM output evaluation
Pass/fail agent traces
QA gates
Classify
Organize unstructured data into one or several pre-defined categories, with confidence scores you can actually trust.
Great for:
Routers
Triaging systems
Semantic filters
Extract
Pull structured spans, keywords, and relevant passages into normalized schemas.
Great for:
Structuring large datasets for analytics
Document retrieval systems
Normalization scripts
Sutro Batch
Run Sutro Functions, custom models, and pre-trained LLMs over large datasets with thousands, or millions of inputs.
10x
Faster
5x
Less Expensive
Simple Python SDK compatible with most data tools and dataframe libraries.


