The fast path to grounded LLM judges.

Sutro quickly discovers your decision criteria, creating a ground-truth layer so your model can decide like you do.


Align, then scale with confidence.

Pass

Total Count: 5,287

Fail

Total Count: 1,319

PASS

FAIL

45% CONFIDENCE

Apply the student discount.

Do you have a .edu email or verification link?

Pass

Fail

PASS

FAIL

45% CONFIDENCE

Apply the student discount.

Do you have a .edu email or verification link?

Sutro Functions

A new way to quickly build expert-aligned judges, classifiers, and extractors. No prompt engineering, fine-tuning, or upfront data labeling.

Support Agent Judge v1.3

Pass/fail judge for our new customer support agent.

Expert Annotations

Expert Annotations

+10

+10

50+

Validation Accuracy

Validation Accuracy

↑15%

↑15%

85%

85%

Cost/1,000 traces

Cost/1,000 traces

↓75%

↓75%

$0.03

$0.03

Wicked Fast

Zero-prompt engineering, with a simple swipe left/right style annotation flow. Escape eval hell once and for all.

|Add system prompt…

Cost: $5

Time: 10m

|Add system prompt…

Cost: $5

Time: 10m

Consistent and (actually) Measurable

Create a stable foundation of judgement to measure and optimize against. Build your own yardstick for AI quality.

Cost: $0

Time: 0m

Adaptable, regression-proof artifacts

Models change, but expertise lives on. Easily update and re-optimize as customer data, models, and operations grow.

Update Model

Encode Decision Preferences

Uncover Low-confidence Examples

How It Works

Bring unlabeled data
and a simple task definition.

We help you bootstrap ground-truth data from zero.

|Add task definition…

Upload rows

We automatically label your (easy) data
and surface cases requiring your expertise.

We use an ensemble of frontier models to label your data, and surface cases where they disagree.

Choose the best decision and rationale or write your own.

We automatically label your data and surface the most ambiguous cases.

Choose the best decision and rationale,

or add your own annotations.

PASS

FAIL

33% CONFIDENCE

Help me reset my password

I'm not sure I can comply with this.

We compile your decision preferences
and learn your generalizable rules.

Functions don't memorize examples - they learn your decision rules using automated prompt optimization.

Unlabeled Data

Loop in your experts

Easily send and receive labeling requests, empowering anyone to scale their judgement.

Send Data Labeling Request

Joe Smith

Head of Procurement

Kelly Sikema

Technical Support Lead

AP

Annotate Partners

Labeler

Once your task is learned, we produce an expert judge ready for usage at scale.

Functions are adaptable, portable, and run where you need them. Online/offline, open/closed, our cloud or yours.

Additional Learning…

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

Agent attempts to help refund user, but transaction is not found.

21% CONFIDENCE

Customer asks about chargeback amount, agent correctly identifies transaction and amount

67% CONFIDENCE

Agent responds with helpful clarifying instructions on shipping details.

92% CONFIDENCE

Agent misidentifies customer issue, yet proceeds regardless.

33% CONFIDENCE

The building blocks for confident, high-volume AI

Judge

Build and run high quality automated evals for AI products or agents. When your judges work, your product works.


Great for:

LLM output evaluation

Pass/fail agent traces

QA gates

Classify

Organize unstructured data into one or several pre-defined categories, with confidence scores you can actually trust.

Great for:

Routers

Triaging systems

Semantic filters

Extract

Pull structured spans, keywords, and relevant passages into normalized schemas.


Great for:

Structuring large datasets for analytics

Document retrieval systems

Normalization scripts

Sutro Batch

Align, then scale. Serverless async inference; simple usage-based pricing based on data volume. Ideal for evals over historical traces, model outputs, or unstructured data transformation.

Align, then scale. Serverless async inference; simple usage-based pricing based on data volume. Ideal for evals over historical traces, model outputs, or unstructured data transformation.

Run Sutro Functions, custom models, and pre-trained LLMs over large datasets with thousands, or millions of inputs.

10x

Faster

5x

Less Expensive

Simple Python SDK compatible with most data tools and dataframe libraries.

FAQ

How is Sutro Function's labeling process better than prompt engineering?

How is a Sutro Function different than prompting a foundation model like GPT, Claude, or Gemini?

How can I assess a Function’s decision confidence?

How can I improve my Function over time?

Can I use Sutro Functions on multimodal data (images, documents, etc.)?

How does Sutro Functions compare to DSPy and other prompt optimization tools?

Can I use Sutro products in my VPC?

Why should I use Sutro Functions and Batch together?

How is Sutro Function's labeling process better than prompt engineering?

How is a Sutro Function different than prompting a foundation model like GPT, Claude, or Gemini?

How can I assess a Function’s decision confidence?

How can I improve my Function over time?

Can I use Sutro Functions on multimodal data (images, documents, etc.)?

How does Sutro Functions compare to DSPy and other prompt optimization tools?

Can I use Sutro products in my VPC?

Why should I use Sutro Functions and Batch together?

What Will You Scale with Sutro?