# Introduction
Source: https://docs.zeroeval.com/autotune/introduction
Run evaluations on models and prompts to find the best variants for your agents
Prompt optimization is a different approach to the traditional evals experience. Instead of setting up complex eval pipelines, we simply ingest your production traces and let you optimize your prompts based on your feedback.
## How it works
Replace hardcoded prompts with `ze.prompt()` calls in Python or `ze.prompt({...})` in TypeScript
Each time you modify your prompt content, a new version is automatically created and tracked
ZeroEval automatically tracks all LLM interactions and their outcomes
Use the UI to run experiments, vote on outputs, and identify the best prompt/model combinations
Winning configurations are automatically deployed to your application without code changes
Learn how to integrate ze.prompt() into your Python or TypeScript codebase
Run experiments and deploy winning combinations
# Models
Source: https://docs.zeroeval.com/autotune/prompts/models
Evaluate your agent's performance across multiple models
ZeroEval lets you evaluate real production traces of specific agent tasks across different models, then ranking them over time. This helps you pick the best model for each part of your agent.
# Prompts
Source: https://docs.zeroeval.com/autotune/prompts/prompts
Use feedback on production traces to generate and validate better prompts
ZeroEval derives prompt optimization suggestions directly from feedback on your production traces. By capturing preferences and correctness signals, we provide concrete prompt edits you can test and use for your agents.
## Submitting Feedback
Feedback is the foundation of prompt optimization. You can submit feedback for completions through the ZeroEval dashboard, the Python SDK, or the public API. Feedback helps ZeroEval understand what good and bad outputs look like for your specific use case.
### Feedback through the dashboard
The easiest way to provide feedback is through the ZeroEval dashboard. Navigate to your task's "Suggestions" tab, review incoming completions, and provide thumbs up/down feedback with optional reasons and expected outputs.
### Feedback through the SDK
For programmatic feedback submission, use the Python or TypeScript SDK. This is useful when you have automated evaluation systems or want to collect feedback from your application in production.
```python Python theme={null}
import zeroeval as ze
ze.init()
# Send feedback for a specific completion
ze.send_feedback(
prompt_slug="support-bot",
completion_id="550e8400-e29b-41d4-a716-446655440000",
thumbs_up=False,
reason="Response was too verbose",
expected_output="A concise 2-3 sentence response"
)
```
```typescript TypeScript theme={null}
import * as ze from 'zeroeval';
ze.init();
// Send feedback for a specific completion
await ze.sendFeedback({
promptSlug: "support-bot",
completionId: "550e8400-e29b-41d4-a716-446655440000",
thumbsUp: false,
reason: "Response was too verbose",
expectedOutput: "A concise 2-3 sentence response"
});
```
#### Parameters
| Python | TypeScript | Type | Required | Description |
| ----------------- | ---------------- | ---------------- | -------- | ------------------------------------------------------------ |
| `prompt_slug` | `promptSlug` | `str`/`string` | Yes | The slug/name of your prompt (same as used in `ze.prompt()`) |
| `completion_id` | `completionId` | `str`/`string` | Yes | The UUID of the completion to provide feedback on |
| `thumbs_up` | `thumbsUp` | `bool`/`boolean` | Yes | `True`/`true` for positive, `False`/`false` for negative |
| `reason` | `reason` | `str`/`string` | No | Optional explanation of why you gave this feedback |
| `expected_output` | `expectedOutput` | `str`/`string` | No | Optional description of what the expected output should be |
| `metadata` | `metadata` | `dict`/`object` | No | Optional additional metadata to attach to the feedback |
The `completion_id` is automatically tracked when you use `ze.prompt()` with automatic tracing enabled. You can access it from the OpenAI response object's `id` field, or retrieve it from your traces in the dashboard.
#### Complete example with feedback
```python Python theme={null}
import zeroeval as ze
from openai import OpenAI
ze.init()
client = OpenAI()
# Define your prompt - ZeroEval will automatically use the latest optimized
# version from your dashboard if one exists, falling back to this content
system_prompt = ze.prompt(
name="support-bot",
content="You are a helpful customer support agent."
)
# Make a completion
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "How do I reset my password?"}
]
)
# Get the completion ID and text
completion_id = response.id
completion_text = response.choices[0].message.content
# Evaluate the response (manually or automatically)
is_good_response = evaluate_response(completion_text)
# Send feedback based on evaluation
ze.send_feedback(
prompt_slug="support-bot",
completion_id=completion_id,
thumbs_up=is_good_response,
reason="Clear step-by-step instructions" if is_good_response else "Missing link to reset page",
expected_output=None if is_good_response else "Should include direct link: https://app.example.com/reset"
)
```
```typescript TypeScript theme={null}
import * as ze from 'zeroeval';
import { OpenAI } from 'openai';
ze.init();
const client = ze.wrap(new OpenAI());
// Define your prompt - ZeroEval will automatically use the latest optimized
// version from your dashboard if one exists, falling back to this content
const systemPrompt = await ze.prompt({
name: "support-bot",
content: "You are a helpful customer support agent."
});
// Make a completion
const response = await client.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: "How do I reset my password?" }
]
});
// Get the completion ID and text
const completionId = response.id;
const completionText = response.choices[0].message.content;
// Evaluate the response (manually or automatically)
const isGoodResponse = evaluateResponse(completionText);
// Send feedback based on evaluation
await ze.sendFeedback({
promptSlug: "support-bot",
completionId: completionId,
thumbsUp: isGoodResponse,
reason: isGoodResponse ? "Clear step-by-step instructions" : "Missing link to reset page",
expectedOutput: isGoodResponse ? undefined : "Should include direct link: https://app.example.com/reset"
});
```
**Auto-optimization**: When you use `ze.prompt()` with `content`, ZeroEval automatically fetches the latest optimized version from your dashboard if one exists. Your `content` serves as a fallback for initial setup. This means your prompts improve automatically as you tune them, without any code changes.
If you need to test the hardcoded content specifically (e.g., for debugging or A/B testing), use `from_="explicit"` (Python) or `from: "explicit"` (TypeScript):
```python Python theme={null}
# Bypass auto-optimization and always use this exact content
prompt = ze.prompt(
name="support-bot",
from_="explicit",
content="You are a helpful customer support agent."
)
```
```typescript TypeScript theme={null}
// Bypass auto-optimization and always use this exact content
const prompt = await ze.prompt({
name: "support-bot",
from: "explicit",
content: "You are a helpful customer support agent."
});
```
### Feedback through the API
For integration with non-Python systems or direct API access, you can submit feedback using the public HTTP API.
#### Endpoint
```
POST /v1/prompts/{prompt_slug}/completions/{completion_id}/feedback
```
#### Authentication
Requires API key authentication via the `Authorization` header:
```
Authorization: Bearer YOUR_API_KEY
```
#### Request body
```json theme={null}
{
"thumbs_up": false,
"reason": "Response was inaccurate",
"expected_output": "The correct answer should mention X, Y, and Z",
"metadata": {
"evaluated_by": "automated_system",
"evaluation_score": 0.45
}
}
```
#### Response
```json theme={null}
{
"id": "fb123e45-67f8-90ab-cdef-1234567890ab",
"completion_id": "550e8400-e29b-41d4-a716-446655440000",
"prompt_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"prompt_version_id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
"project_id": "c3d4e5f6-a7b8-9012-cdef-123456789012",
"thumbs_up": false,
"reason": "Response was inaccurate",
"expected_output": "The correct answer should mention X, Y, and Z",
"metadata": {
"evaluated_by": "automated_system",
"evaluation_score": 0.45
},
"created_by": "user_id",
"created_at": "2025-11-22T10:30:00Z",
"updated_at": "2025-11-22T10:30:00Z"
}
```
#### Example with cURL
```bash theme={null}
curl -X POST https://api.zeroeval.com/v1/prompts/support-bot/completions/550e8400-e29b-41d4-a716-446655440000/feedback \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"thumbs_up": false,
"reason": "Response was too vague",
"expected_output": "Should provide specific steps",
"metadata": {
"user_satisfaction": "low"
}
}'
```
If feedback already exists for the same completion from the same user, it will be updated with the new values. This allows you to correct or refine feedback as needed.
## Prompt optimizations from feedback
Once you've given a good amount of feedback on the incoming traffic for a given task, you can generate prompt optimizations using that feedback by clicking on the "Optimize Prompt" button in the "Suggestions" tab for the task.
Once you've generated a new prompt, you can test it with various models and see how it performs against the feedback you've already given.
# Reference
Source: https://docs.zeroeval.com/autotune/reference
Parameters and configuration for ze.prompt
`ze.prompt` creates or fetches versioned prompts from the Prompt Library and returns decorated content for downstream LLM calls.
**TypeScript differences**: In TypeScript, `ze.prompt()` is an async function that returns `Promise`. Parameters use camelCase and are passed as an options object: `ze.prompt({ name: "...", content: "..." })`.
## Parameters
| Python | TypeScript | Type | Required | Default | Description |
| ----------- | ----------- | ----------- | -------- | ------------------ | ---------------------------------------------------------- |
| `name` | `name` | string | yes | — | Task name associated with the prompt in the library |
| `content` | `content` | string | no | `None`/`undefined` | Raw prompt content to ensure/create a version by content |
| `from_` | `from` | string | no | `None`/`undefined` | Either `"latest"`, `"explicit"`, or a 64‑char SHA‑256 hash |
| `variables` | `variables` | dict/object | no | `None`/`undefined` | Template variables to render `{{variable}}` tokens |
Notes:
* In Python, use `from_` (with underscore) as `from` is a reserved keyword. TypeScript uses `from` directly.
* Exactly one of `content` or `from` must be provided (except when using `from: "explicit"` with `content`).
* `from="latest"` fetches the latest version bound to the task; otherwise `from` must be a 64‑char hex SHA‑256 hash.
## Behavior
* **content provided**: Computes a normalized SHA‑256 hash, ensures a prompt version exists for `name`, and returns decorated content.
* **from="latest"**: Fetches the latest version for `name` and returns decorated content.
* **from=**``: Fetches by content hash for `name` and returns decorated content.
Decoration adds a compact metadata header used by integrations:
* `task`, `prompt_slug`, `prompt_version`, `prompt_version_id`, `variables`, and (when created by content) `content_hash`.
OpenAI integration: when `prompt_version_id` is present, the SDK will automatically patch the `model` parameter to the model bound to that prompt version.
## Return Value
* **Python**: `str` - Decorated prompt content ready to pass into LLM clients.
* **TypeScript**: `Promise` - Async function returning decorated prompt content.
## Errors
| Python | TypeScript | When |
| --------------------- | --------------------- | -------------------------------------------------------------------------------------- |
| `ValueError` | `Error` | Both `content` and `from` provided (except explicit), or neither; invalid `from` value |
| `PromptRequestError` | `PromptRequestError` | `from="latest"` but no versions exist for `name` |
| `PromptNotFoundError` | `PromptNotFoundError` | `from` is a hash that does not exist for `name` |
## Examples
```python Python theme={null}
import zeroeval as ze
# Create/ensure a version by content
system = ze.prompt(
name="support-triage",
content="You are a helpful assistant for {{product}}.",
variables={"product": "Acme"},
)
# Fetch the latest version for this task
system = ze.prompt(name="support-triage", from_="latest")
# Fetch a specific version by content hash
system = ze.prompt(name="support-triage", from_="c6a7...deadbeef...0123")
```
```typescript TypeScript theme={null}
import * as ze from 'zeroeval';
// Create/ensure a version by content
const system = await ze.prompt({
name: "support-triage",
content: "You are a helpful assistant for {{product}}.",
variables: { product: "Acme" },
});
// Fetch the latest version for this task
const system = await ze.prompt({ name: "support-triage", from: "latest" });
// Fetch a specific version by content hash
const system = await ze.prompt({ name: "support-triage", from: "c6a7...deadbeef...0123" });
```
# Setup
Source: https://docs.zeroeval.com/autotune/setup
Getting started with autotune
ZeroEval's autotune feature allows you to continuously improve your prompts and automatically deploy the best-performing models. The setup is simple and powerful.
## Getting started (\<5 mins)
Replace hardcoded prompts with `ze.prompt()` and include the name of the specific part of your agent that you want to tune.
```python Python theme={null}
# Before
prompt = "You are a helpful assistant"
# After - with autotune
prompt = ze.prompt(
name="assistant",
content="You are a helpful assistant"
)
```
```typescript TypeScript theme={null}
// Before
const prompt = "You are a helpful assistant";
// After - with autotune
const prompt = await ze.prompt({
name: "assistant",
content: "You are a helpful assistant"
});
```
That's it! You'll start seeing production traces in your dashboard for this specific task at [`ZeroEval › Prompts › [task_name]`](https://app.zeroeval.com).
**Auto-tune behavior:** When you provide `content`, ZeroEval automatically uses the latest optimized version from your dashboard if one exists. The `content` parameter serves as a fallback for when no optimized versions are available yet. This means you can hardcode a default prompt in your code, but ZeroEval will seamlessly swap in tuned versions without any code changes.
To explicitly use the hardcoded content and bypass auto-optimization, use `from_="explicit"` (Python) or `from: "explicit"` (TypeScript):
```python Python theme={null}
prompt = ze.prompt(
name="assistant",
from_="explicit",
content="You are a helpful assistant"
)
```
```typescript TypeScript theme={null}
const prompt = await ze.prompt({
name: "assistant",
from: "explicit",
content: "You are a helpful assistant"
});
```
## Pushing models to production
Once you see a model that performs well, you can send it to production with a single click, as seen below.
Your specified model gets replaced automatically any time you use the prompt from `ze.prompt()`, as seen below.
```python Python theme={null}
# You write this
response = client.chat.completions.create(
model="gpt-4", # ← Gets replaced!
messages=[{"role": "system", "content": prompt}]
)
```
```typescript TypeScript theme={null}
// You write this
const response = await openai.chat.completions.create({
model: "gpt-4", // ← Gets replaced!
messages: [{ role: "system", content: prompt }]
});
```
## Example
Here's autotune in action for a simple customer support bot:
```python Python theme={null}
import zeroeval as ze
from openai import OpenAI
ze.init()
client = OpenAI()
# Define your prompt with version tracking
system_prompt = ze.prompt(
name="support-bot",
content="""You are a customer support agent for {{company}}.
Be helpful, concise, and professional.""",
variables={"company": "TechCorp"}
)
# Use it normally - model gets patched automatically
response = client.chat.completions.create(
model="gpt-4", # This might run claude-3-sonnet in production!
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "I need help with my order"}
]
)
```
```typescript TypeScript theme={null}
import * as ze from 'zeroeval';
import { OpenAI } from 'openai';
ze.init();
const client = ze.wrap(new OpenAI());
// Define your prompt with version tracking
const systemPrompt = await ze.prompt({
name: "support-bot",
content: `You are a customer support agent for {{company}}.
Be helpful, concise, and professional.`,
variables: { company: "TechCorp" }
});
// Use it normally - model gets patched automatically
const response = await client.chat.completions.create({
model: "gpt-4", // This might run claude-3-sonnet in production!
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: "I need help with my order" }
]
});
```
## Understanding Prompt Versions
ZeroEval automatically manages prompt versions for you. When you use `ze.prompt()` with `content`, the SDK will:
1. **Check for optimized versions**: First, it tries to fetch the latest optimized version from your dashboard
2. **Fall back to your content**: If no optimized versions exist yet, it uses the `content` you provided
3. **Create a version**: Your provided content is stored as the initial version for this task
This means you get the best of both worlds: hardcoded fallback prompts in your code, with automatic optimization in production.
```python Python theme={null}
# This will use the latest optimized version if one exists in your dashboard
# Otherwise, it uses the content you provide here
prompt = ze.prompt(
name="customer-support",
content="You are a helpful assistant."
)
```
```typescript TypeScript theme={null}
// This will use the latest optimized version if one exists in your dashboard
// Otherwise, it uses the content you provide here
const prompt = await ze.prompt({
name: "customer-support",
content: "You are a helpful assistant."
});
```
### Explicit version control
If you need more control over which version to use:
```python Python theme={null}
# Always use the latest optimized version (fails if none exists)
prompt = ze.prompt(
name="customer-support",
from_="latest"
)
# Always use the hardcoded content (bypass auto-optimization)
prompt = ze.prompt(
name="customer-support",
from_="explicit",
content="You are a helpful assistant."
)
# Use a specific version by its content hash
prompt = ze.prompt(
name="customer-support",
from_="a1b2c3d4..." # 64-character SHA-256 hash
)
```
```typescript TypeScript theme={null}
// Always use the latest optimized version (fails if none exists)
const prompt = await ze.prompt({
name: "customer-support",
from: "latest"
});
// Always use the hardcoded content (bypass auto-optimization)
const prompt = await ze.prompt({
name: "customer-support",
from: "explicit",
content: "You are a helpful assistant."
});
// Use a specific version by its content hash
const prompt = await ze.prompt({
name: "customer-support",
from: "a1b2c3d4..." // 64-character SHA-256 hash
});
```
### When to use each mode
| Mode | Use Case | Behavior |
| ----------------------------------------------------- | --------------------------------------------------- | ----------------------------------- |
| `content` only | **Recommended for most cases** | Auto-optimization with fallback |
| `from_="explicit"` (Python) / `from: "explicit"` (TS) | Testing, debugging, or A/B testing specific prompts | Always use hardcoded content |
| `from_="latest"` (Python) / `from: "latest"` (TS) | Production where optimization is required | Fail if no optimized version exists |
| `from_=""` (Python) / `from: ""` (TS) | Pinning to specific tested versions | Use exact version by hash |
**Best practice**: Use `content` parameter alone for local development and production. ZeroEval will automatically use optimized versions when available. Only use `from_="explicit"` (Python) or `from: "explicit"` (TypeScript) when you specifically need to test or debug the hardcoded content.
# Introduction
Source: https://docs.zeroeval.com/judges/introduction
Continuously evaluate your production traffic with judges that learn over time
Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spans and score outputs according to criteria you define. They get better over time the more you refine and correct their evaluations.
## When to use
Use a judge when you want consistent, scalable evaluation of:
* Hallucinations, safety/policy violations
* Response quality (helpfulness, tone, structure)
* Latency, cost, and error patterns tied to specific criteria
# Multimodal Evaluation
Source: https://docs.zeroeval.com/judges/multimodal-evaluation
Evaluate screenshots and images with LLM judges
LLM judges can evaluate spans that contain images alongside text. This is useful for browser agents, UI testing, visual QA, and any workflow where you need to assess visual output.
## How it works
1. **Attach images to spans** using SDK methods or structured output data
2. **Images are uploaded** during span ingestion (base64 data is stripped from the span)
3. **Judges fetch images** when evaluating the span and send them to a vision-capable LLM
4. **Evaluation results** appear in the dashboard like any other judge evaluation
The LLM sees both the span's text data (input/output) and any attached images, giving it full context for evaluation.
## Attaching images to spans
There are two ways to attach images to spans, depending on your workflow.
### Option 1: SDK helper methods
The SDK provides `add_screenshot()` and `add_image()` methods for attaching images with metadata.
**Screenshots with viewport context**
For browser agents or responsive testing, use `add_screenshot()` to capture different viewports:
```python theme={null}
import zeroeval as ze
with ze.span(name="homepage_test", tags={"has_screenshots": "true"}) as span:
# Desktop viewport
span.add_screenshot(
base64_data=desktop_base64,
viewport="desktop",
width=1920,
height=1080,
label="Homepage - Desktop"
)
# Mobile viewport
span.add_screenshot(
base64_data=mobile_base64,
viewport="mobile",
width=375,
height=812,
label="Homepage - Mobile"
)
span.set_io(
input_data="Load homepage and capture screenshots",
output_data="Captured 2 viewport screenshots"
)
```
**Generic images**
For charts, diagrams, or UI component states, use `add_image()`:
```python theme={null}
with ze.span(name="button_hover_test") as span:
span.add_image(
base64_data=before_hover_base64,
label="Button - Default State"
)
span.add_image(
base64_data=after_hover_base64,
label="Button - Hover State"
)
span.set_io(
input_data="Test button hover interaction",
output_data="Button changes color on hover"
)
```
### Option 2: Structured output\_data
If your workflow already produces screenshot data as structured output (common with browser automation agents), you can include images directly in the span's `output_data`. ZeroEval automatically detects and extracts images from JSON arrays containing `base64` fields.
```python theme={null}
import zeroeval as ze
with ze.span(
name="screenshot_capture",
kind="llm",
tags={"has_screenshots": "true", "screenshot_count": "2"}
) as span:
# Set input as conversation messages
span.input_data = [
{
"role": "system",
"content": "You are a screenshot capture service."
},
{
"role": "user",
"content": "Navigate to the homepage and capture screenshots"
}
]
# Set output as array of screenshot objects with base64 data
span.output_data = [
{
"viewport": "mobile",
"width": 768,
"height": 1024,
"base64": mobile_screenshot_base64
},
{
"viewport": "desktop",
"width": 1920,
"height": 1080,
"base64": desktop_screenshot_base64
}
]
```
When ZeroEval ingests this span, it:
1. Extracts each object with a `base64` field as an attachment
2. Uploads the images to storage
3. Strips the base64 data from `output_data` to keep the database lean
4. Preserves the metadata (viewport, width, height) for display
This approach works well when your browser agent or automation tool already produces structured screenshot output.
Both methods produce the same result: images stored and available for multimodal judge evaluation. Choose whichever fits your workflow better.
## Creating a multimodal judge
Multimodal judges work like regular judges, but with criteria that reference attached images. The judge prompt should describe what to look for in the visual content.
### Example: UI consistency judge
```
Evaluate whether the UI renders correctly across viewports.
Check for:
- Layout breaks or overlapping elements
- Text that's too small to read on mobile
- Missing or broken images
- Inconsistent spacing between viewports
Score 1 if all viewports render correctly, 0 if there are visual issues.
```
### Example: Brand compliance judge
```
Check if the page follows brand guidelines.
Look for:
- Correct logo placement and sizing
- Brand colors used consistently
- Proper typography hierarchy
- Appropriate whitespace
Score 1 for full compliance, 0 for violations.
```
### Example: Accessibility judge
```
Evaluate visual accessibility of the interface.
Check:
- Sufficient color contrast
- Text size readability
- Clear visual hierarchy
- Button/link affordances
Score 1 if accessible, 0 if there are issues. Include specific problems in the reasoning.
```
## Filtering spans for multimodal evaluation
Use tags to identify which spans should be evaluated by your multimodal judge:
```python theme={null}
# Tag spans that have screenshots
with ze.span(name="browser_test", tags={"has_screenshots": "true"}) as span:
span.add_screenshot(...)
```
Then configure your judge to only evaluate spans matching that tag. This prevents the judge from running on text-only spans where multimodal evaluation doesn't apply.
## Supported image formats
* JPEG
* PNG
* WebP
* GIF
Images are validated during ingestion. The maximum size is 10MB per image, with up to 5 images per span.
## Viewing images in the dashboard
Screenshots appear in two places:
1. **Span details view** - Images show in the Data tab with viewport labels and dimensions
2. **Judge evaluation modal** - When reviewing an evaluation, you'll see the images the judge analyzed
Images display with their labels, viewport type (for screenshots), and dimensions when available.
## Model support
Multimodal evaluation currently uses Gemini models, which support image inputs. When you create a judge, ZeroEval automatically handles the image formatting for the model.
Multimodal evaluation works best with specific, measurable criteria. Vague prompts like "does this look good?" will produce inconsistent results. Be explicit about what visual properties to check.
# Pulling Evaluations
Source: https://docs.zeroeval.com/judges/pull-evaluations
Retrieve judge evaluations via SDK or REST API
Retrieve judge evaluations programmatically for reporting, analysis, or integration into your own workflows.
## Finding your IDs
Before making API calls, you'll need these identifiers:
| ID | Where to find it |
| -------------- | --------------------------------------------------------------------------- |
| **Project ID** | Settings → Project, or in any URL after `/projects/` |
| **Judge ID** | Click a judge in the dashboard; the ID is in the URL (`/judges/{judge_id}`) |
| **Span ID** | In trace details, or returned by your instrumentation code |
## Python SDK
### Get available criteria for a judge
Use this before submitting criterion-level feedback to discover valid criterion keys.
```python theme={null}
import zeroeval as ze
ze.init(api_key="YOUR_API_KEY")
criteria = ze.get_judge_criteria(
project_id="your-project-id",
judge_id="your-judge-id",
)
print(criteria["evaluation_type"])
for criterion in criteria["criteria"]:
print(criterion["key"], criterion.get("description"))
```
### Get evaluations by judge
Fetch all evaluations for a specific judge with pagination and optional filters.
```python theme={null}
import zeroeval as ze
ze.init(api_key="YOUR_API_KEY")
response = ze.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=100,
offset=0,
)
print(f"Total: {response['total']}")
for eval in response["evaluations"]:
print(f"Span: {eval['span_id']}")
print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}")
print(f"Score: {eval.get('score')}") # For scored judges
print(f"Reason: {eval['evaluation_reason']}")
```
**Optional filters:**
```python theme={null}
response = ze.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=100,
offset=0,
start_date="2025-01-01T00:00:00Z",
end_date="2025-01-31T23:59:59Z",
evaluation_result=True, # Only passing evaluations
feedback_state="with_user_feedback", # Only calibrated items
)
```
### Get evaluations by span
Fetch all judge evaluations for a specific span (useful when a span has been evaluated by multiple judges).
```python theme={null}
response = ze.get_span_evaluations(
project_id="your-project-id",
span_id="your-span-id",
)
for eval in response["evaluations"]:
print(f"Judge: {eval['judge_name']}")
print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}")
if eval.get('evaluation_type') == 'scored':
print(f"Score: {eval['score']} / {eval['score_max']}")
```
## REST API
Use these endpoints directly with your API key in the `Authorization` header.
### Get available criteria for a judge
```bash theme={null}
curl -X GET "https://api.zeroeval.com/projects/{project_id}/judges/{judge_id}/criteria" \
-H "Authorization: Bearer $ZEROEVAL_API_KEY"
```
### Get evaluations by judge
```bash theme={null}
curl -X GET "https://api.zeroeval.com/projects/{project_id}/judges/{judge_id}/evaluations?limit=100&offset=0" \
-H "Authorization: Bearer $ZEROEVAL_API_KEY"
```
**Query parameters:**
| Parameter | Type | Description |
| ------------------- | ------ | ----------------------------------------------- |
| `limit` | int | Results per page (1-500, default 100) |
| `offset` | int | Pagination offset (default 0) |
| `start_date` | string | Filter by date (ISO 8601) |
| `end_date` | string | Filter by date (ISO 8601) |
| `evaluation_result` | bool | `true` for passing, `false` for failing |
| `feedback_state` | string | `with_user_feedback` or `without_user_feedback` |
### Get evaluations by span
```bash theme={null}
curl -X GET "https://api.zeroeval.com/projects/{project_id}/spans/{span_id}/evaluations" \
-H "Authorization: Bearer $ZEROEVAL_API_KEY"
```
## Response format
### Judge evaluations response
```json theme={null}
{
"evaluations": [...],
"total": 142,
"limit": 100,
"offset": 0
}
```
### Judge criteria response
```json theme={null}
{
"judge_id": "judge-uuid",
"evaluation_type": "scored",
"score_min": 0,
"score_max": 5,
"pass_threshold": 3.5,
"criteria": [
{
"key": "CTA_text",
"label": "CTA_text",
"description": "CTA clarity and visibility"
}
]
}
```
### Span evaluations response
```json theme={null}
{
"span_id": "abc-123",
"evaluations": [...]
}
```
### Evaluation object
| Field | Type | Description |
| ------------------- | ------------- | ---------------------------------- |
| `id` | string | Unique evaluation ID |
| `span_id` | string | The evaluated span |
| `evaluation_result` | bool | Pass (`true`) or fail (`false`) |
| `evaluation_reason` | string | Judge's reasoning |
| `confidence_score` | float | Model confidence (0-1) |
| `score` | float \| null | Numeric score (scored judges only) |
| `score_min` | float \| null | Minimum possible score |
| `score_max` | float \| null | Maximum possible score |
| `pass_threshold` | float \| null | Score required to pass |
| `model_used` | string | LLM model that ran the evaluation |
| `created_at` | string | ISO 8601 timestamp |
## Pagination example
For large result sets, paginate through all evaluations:
```python theme={null}
all_evaluations = []
offset = 0
limit = 100
while True:
response = ze.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=limit,
offset=offset,
)
all_evaluations.extend(response["evaluations"])
if len(response["evaluations"]) < limit:
break
offset += limit
print(f"Fetched {len(all_evaluations)} total evaluations")
```
## Related
* [Submitting Feedback](/judges/submit-feedback) - Programmatically submit feedback for judge evaluations
# Setup
Source: https://docs.zeroeval.com/judges/setup
Create and calibrate an AI judge in minutes
## Creating a judge (\<5 mins)
1. Go to [Monitoring → Judges → New Judge](https://app.zeroeval.com/monitoring/judges).
2. Specify the criteria that you want to evaluate from your production traffic.
3. Tweak the prompt of the judge until it matches what you are looking for!
That's it! Historical and future traces will be scored automatically and shown in the dashboard.
## Calibrating your judge
For each evaluated item you have the option to mark it as correct or incorrect. This is automatically stored and used to improve the judge over time.
# Submitting Feedback
Source: https://docs.zeroeval.com/judges/submit-feedback
Programmatically submit feedback for judge evaluations via SDK
## Overview
When calibrating judges, you can submit feedback programmatically using the SDK.
This is useful for:
* Bulk feedback submission from automated pipelines
* Integration with custom review workflows
* Syncing feedback from external labeling tools
Your existing `send_feedback` integrations remain valid. Criterion-level feedback is an optional extension for scored judges.
## Important: Using the Correct IDs
Judge evaluations involve two related spans:
| ID | Description |
| ---------------------- | -------------------------------------------------- |
| **Source Span ID** | The original LLM call that was evaluated |
| **Judge Call Span ID** | The span created when the judge ran its evaluation |
When submitting feedback, always include the `judge_id` parameter to ensure
feedback is correctly associated with the judge evaluation.
## Python SDK
### From the UI (Recommended)
The easiest way to get the correct IDs is from the Judge Evaluation modal:
1. Open a judge evaluation in the dashboard
2. Expand the "SDK Integration" section
3. Click "Copy" to copy the pre-filled Python code
4. Paste and customize the generated code
### Manual Submission
```python theme={null}
from zeroeval import ZeroEval
client = ZeroEval()
# Submit feedback for a judge evaluation
client.send_feedback(
prompt_slug="your-judge-task-slug", # The task/prompt associated with the judge
completion_id="span-id-here", # The span ID from the evaluation
thumbs_up=True, # True = correct, False = incorrect
reason="Optional explanation",
judge_id="automation-id-here", # Required for judge feedback
)
```
### Parameters
| Parameter | Type | Required | Description |
| ------------------- | ----- | -------- | ---------------------------------------------------------- |
| `prompt_slug` | str | Yes | The task slug associated with the judge |
| `completion_id` | str | Yes | The span ID being evaluated |
| `thumbs_up` | bool | Yes | `True` if judge was correct, `False` if wrong |
| `reason` | str | No | Explanation of the feedback |
| `judge_id` | str | Yes\* | The judge automation ID (\*required for judge feedback) |
| `expected_score` | float | No | For scored judges: the expected score value |
| `score_direction` | str | No | For scored judges: `"too_high"` or `"too_low"` |
| `criteria_feedback` | dict | No | For scored judges: per-criterion expected score/reason map |
`expected_score` and `score_direction` are only valid for scored judges
(judges with `evaluation_type: "scored"`). The API will return a 400 error
if these fields are provided for binary judges.
### Step 1: Discover Available Criteria (Scored Judges)
Before sending `criteria_feedback`, fetch valid criterion keys for the judge.
```python theme={null}
from zeroeval import ZeroEval
client = ZeroEval()
criteria = client.get_judge_criteria(
project_id="your-project-id",
judge_id="automation-id-here",
)
print(criteria["evaluation_type"]) # "scored" or "binary"
print(criteria["criteria"]) # [{"key": "...", "label": "...", "description": "..."}]
```
```bash theme={null}
curl -X GET "https://api.zeroeval.com/projects/{project_id}/judges/{judge_id}/criteria" \
-H "Authorization: Bearer $ZEROEVAL_API_KEY"
```
### Step 2: Score-Based Feedback (General Score)
For judges using scored rubrics (not binary pass/fail), you can provide additional
feedback about the overall expected score:
```python theme={null}
from zeroeval import ZeroEval
client = ZeroEval()
# Submit feedback for a scored judge evaluation
client.send_feedback(
prompt_slug="quality-scorer",
completion_id="span-id-here",
thumbs_up=False, # The judge was incorrect
judge_id="automation-id-here",
expected_score=3.5, # What the score should have been
score_direction="too_high", # The judge scored too high
reason="Score should have been lower due to grammar issues",
)
```
### Step 3: Score-Based Feedback (Per-Criterion)
For scored judges, you can send corrections for specific criteria:
```python theme={null}
from zeroeval import ZeroEval
client = ZeroEval()
client.send_feedback(
prompt_slug="quality-scorer",
completion_id="span-id-here",
thumbs_up=False,
judge_id="automation-id-here",
reason="Criterion-level score adjustments",
criteria_feedback={
"CTA_text": {
"expected_score": 4.0,
"reason": "CTA is clear and prominent"
},
"CX-004": {
"expected_score": 1.0,
"reason": "Required phone number is missing"
}
}
)
```
## REST API
### Binary Judge Feedback
```bash theme={null}
curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
-H "Authorization: Bearer $ZEROEVAL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"thumbs_up": true,
"reason": "Judge correctly identified the issue",
"judge_id": "automation-uuid-here"
}'
```
### Scored Judge Feedback
For scored judges, include `expected_score` and `score_direction`:
```bash theme={null}
curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
-H "Authorization: Bearer $ZEROEVAL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"thumbs_up": false,
"reason": "Score should have been lower",
"judge_id": "automation-uuid-here",
"expected_score": 3.5,
"score_direction": "too_high"
}'
```
### Scored Judge Feedback (Criterion-Level)
```bash theme={null}
curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
-H "Authorization: Bearer $ZEROEVAL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"thumbs_up": false,
"judge_id": "automation-uuid-here",
"reason": "Criterion-level corrections",
"criteria_feedback": {
"CTA_text": {
"expected_score": 4.0,
"reason": "CTA is clear and visible"
},
"CX-004": {
"expected_score": 1.0,
"reason": "Phone number is missing"
}
}
}'
```
## Criteria Payload Shape
`criteria_feedback` uses this shape:
```json theme={null}
{
"criteria_feedback": {
"criterion_key": {
"expected_score": 4.0,
"reason": "Optional explanation"
}
}
}
```
Validation rules:
* `judge_id` is required when sending `criteria_feedback`
* `criteria_feedback` is allowed only for scored judges (`evaluation_type: "scored"`)
## Finding Your IDs
| ID | Where to Find It |
| ------------- | ------------------------------------------------------------------ |
| **Task Slug** | In the judge settings, or the URL when editing the judge's prompt |
| **Span ID** | In the evaluation modal, or via `get_judge_evaluations()` response |
| **Judge ID** | In the URL when viewing a judge (`/judges/{judge_id}`) |
## Bulk Feedback Submission
For submitting feedback on multiple evaluations, you can iterate through evaluations:
```python theme={null}
from zeroeval import ZeroEval
client = ZeroEval()
# Get evaluations to review
evaluations = client.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=100,
)
# Submit feedback for each
for eval in evaluations["evaluations"]:
# Your logic to determine if the evaluation was correct
is_correct = your_review_logic(eval)
client.send_feedback(
prompt_slug="your-judge-task-slug",
completion_id=eval["span_id"],
thumbs_up=is_correct,
reason="Automated review",
judge_id="your-judge-id",
)
```
## Related
* [Pulling Evaluations](/judges/pull-evaluations) - Retrieve judge evaluations programmatically
* [Python SDK Reference](/tracing/sdks/python/reference) - Full SDK API reference
* [Judge Setup](/judges/setup) - Configure and deploy judges
# Manual Instrumentation
Source: https://docs.zeroeval.com/tracing/manual-instrumentation
Create spans manually for LLM calls and custom operations
This guide covers how to manually instrument your code to create spans, particularly for LLM operations. You'll learn how to use both the SDK and direct API calls to send trace data to ZeroEval.
## SDK Manual Instrumentation
### Basic LLM Span with SDK
The simplest way to create an LLM span is using the SDK's span decorator or context manager:
```python Python (Decorator) theme={null}
import zeroeval as ze
import openai
client = openai.OpenAI()
@ze.span(name="chat_completion", kind="llm")
def generate_response(messages: list) -> str:
"""Create an LLM span with automatic input/output capture"""
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
temperature=0.7
)
# The SDK automatically captures function arguments as input
# and return values as output
return response.choices[0].message.content
```
```python Python (Context Manager) theme={null}
import zeroeval as ze
import openai
client = openai.OpenAI()
def generate_response(messages: list) -> str:
"""Create an LLM span with manual control"""
with ze.span(name="chat_completion", kind="llm") as span:
# Set input data
span.set_io(input_data=str(messages))
# Make the API call
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
temperature=0.7
)
# Set output data
span.set_io(output_data=response.choices[0].message.content)
# Add LLM-specific attributes
span.set_attributes({
"llm.model": "gpt-4",
"llm.provider": "openai",
"llm.input_tokens": response.usage.prompt_tokens,
"llm.output_tokens": response.usage.completion_tokens,
"llm.total_tokens": response.usage.total_tokens,
"llm.temperature": 0.7
})
return response.choices[0].message.content
```
### Advanced LLM Span with Metrics
For production use, capture comprehensive metrics for better observability:
```python theme={null}
import zeroeval as ze
import openai
import time
import json
@ze.span(name="chat_completion_advanced", kind="llm")
def generate_with_metrics(messages: list, **kwargs):
"""Create a comprehensive LLM span with all metrics"""
# Get the current span to add attributes
span = ze.get_current_span()
# Track timing
start_time = time.time()
first_token_time = None
# Prepare the request
model = kwargs.get("model", "gpt-4")
temperature = kwargs.get("temperature", 0.7)
max_tokens = kwargs.get("max_tokens", None)
# Set pre-request attributes
span.set_attributes({
"llm.model": model,
"llm.provider": "openai",
"llm.temperature": temperature,
"llm.max_tokens": max_tokens,
"llm.streaming": kwargs.get("stream", False)
})
# Store input messages in the expected format
span.set_io(input_data=json.dumps([
{"role": msg["role"], "content": msg["content"]}
for msg in messages
]))
try:
client = openai.OpenAI()
# Handle streaming responses
if kwargs.get("stream", False):
stream = client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
stream=True
)
full_response = ""
tokens = 0
for chunk in stream:
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time()
ttft_ms = (first_token_time - start_time) * 1000
span.set_attributes({"llm.ttft_ms": ttft_ms})
full_response += chunk.choices[0].delta.content
tokens += 1
# Calculate throughput
total_time = time.time() - start_time
span.set_attributes({
"llm.output_tokens": tokens,
"llm.throughput_tokens_per_sec": tokens / total_time if total_time > 0 else 0,
"llm.duration_ms": total_time * 1000
})
span.set_io(output_data=full_response)
return full_response
else:
# Non-streaming response
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
# Capture all response metadata
span.set_attributes({
"llm.input_tokens": response.usage.prompt_tokens,
"llm.output_tokens": response.usage.completion_tokens,
"llm.total_tokens": response.usage.total_tokens,
"llm.finish_reason": response.choices[0].finish_reason,
"llm.system_fingerprint": response.system_fingerprint,
"llm.response_id": response.id,
"llm.duration_ms": (time.time() - start_time) * 1000
})
content = response.choices[0].message.content
span.set_io(output_data=content)
return content
except Exception as e:
# Capture error details
span.set_status("error")
span.set_attributes({
"error.type": type(e).__name__,
"error.message": str(e)
})
raise
```
## Provider-Specific Manual Instrumentation
For users making direct API calls to OpenAI or Gemini without using the SDK's automatic instrumentation, here are comprehensive guides to properly instrument your calls with cost calculation and conversation formatting.
### OpenAI API Manual Instrumentation
When calling the OpenAI API directly (using `requests`, `httpx`, or similar), you'll want to capture all the metrics that the automatic integration would provide:
```python Python (OpenAI Direct API) theme={null}
import requests
import json
import time
import uuid
from datetime import datetime, timezone
class OpenAITracer:
def **init**(self, api_key: str, zeroeval_api_key: str):
self.openai_api_key = api_key
self.zeroeval_api_key = zeroeval_api_key
self.zeroeval_url = "https://api.zeroeval.com/api/v1/spans"
def chat_completion_with_tracing(self, messages: list, model: str = "gpt-4o", **kwargs):
"""Make OpenAI API call with full ZeroEval instrumentation"""
# Generate span identifiers
trace_id = str(uuid.uuid4())
span_id = str(uuid.uuid4())
# Track timing
start_time = time.time()
# Prepare OpenAI request
openai_payload = {
"model": model,
"messages": messages,
**kwargs # temperature, max_tokens, etc.
}
# Add stream_options for token usage in streaming calls
is_streaming = kwargs.get("stream", False)
if is_streaming and "stream_options" not in kwargs:
openai_payload["stream_options"] = {"include_usage": True}
try:
# Make the OpenAI API call
response = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {self.openai_api_key}",
"Content-Type": "application/json"
},
json=openai_payload,
stream=is_streaming
)
response.raise_for_status()
end_time = time.time()
duration_ms = (end_time - start_time) * 1000
if is_streaming:
# Handle streaming response
full_response = ""
input_tokens = 0
output_tokens = 0
finish_reason = None
response_id = None
system_fingerprint = None
first_token_time = None
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data_str = line[6:]
if data_str == '[DONE]':
break
try:
data = json.loads(data_str)
# Capture first token timing
if data.get('choices') and data['choices'][0].get('delta', {}).get('content'):
if first_token_time is None:
first_token_time = time.time()
full_response += data['choices'][0]['delta']['content']
# Capture final metadata
if 'usage' in data:
input_tokens = data['usage']['prompt_tokens']
output_tokens = data['usage']['completion_tokens']
if data.get('choices') and data['choices'][0].get('finish_reason'):
finish_reason = data['choices'][0]['finish_reason']
if 'id' in data:
response_id = data['id']
if 'system_fingerprint' in data:
system_fingerprint = data['system_fingerprint']
except json.JSONDecodeError:
continue
# Send ZeroEval span for streaming
self._send_span(
span_id=span_id,
trace_id=trace_id,
model=model,
messages=messages,
response_text=full_response,
input_tokens=input_tokens,
output_tokens=output_tokens,
duration_ms=duration_ms,
start_time=start_time,
finish_reason=finish_reason,
response_id=response_id,
system_fingerprint=system_fingerprint,
streaming=True,
first_token_time=first_token_time,
**kwargs
)
return full_response
else:
# Handle non-streaming response
response_data = response.json()
# Extract response details
content = response_data['choices'][0]['message']['content']
usage = response_data.get('usage', {})
# Send ZeroEval span
self._send_span(
span_id=span_id,
trace_id=trace_id,
model=model,
messages=messages,
response_text=content,
input_tokens=usage.get('prompt_tokens', 0),
output_tokens=usage.get('completion_tokens', 0),
duration_ms=duration_ms,
start_time=start_time,
finish_reason=response_data['choices'][0].get('finish_reason'),
response_id=response_data.get('id'),
system_fingerprint=response_data.get('system_fingerprint'),
streaming=False,
**kwargs
)
return content
except Exception as e:
# Send error span
end_time = time.time()
duration_ms = (end_time - start_time) * 1000
self._send_error_span(
span_id=span_id,
trace_id=trace_id,
model=model,
messages=messages,
duration_ms=duration_ms,
start_time=start_time,
error=e,
**kwargs
)
raise
def _send_span(self, span_id: str, trace_id: str, model: str, messages: list,
response_text: str, input_tokens: int, output_tokens: int,
duration_ms: float, start_time: float, finish_reason: str = None,
response_id: str = None, system_fingerprint: str = None,
streaming: bool = False, first_token_time: float = None, **kwargs):
"""Send successful span to ZeroEval"""
# Calculate throughput metrics
throughput = output_tokens / (duration_ms / 1000) if duration_ms > 0 else 0
ttft_ms = None
if streaming and first_token_time:
ttft_ms = (first_token_time - start_time) * 1000
# Prepare span attributes following ZeroEval's expected format
attributes = {
# Core LLM attributes (these are used for cost calculation)
"provider": "openai", # Key for cost calculation
"model": model, # Key for cost calculation
"inputTokens": input_tokens, # Key for cost calculation
"outputTokens": output_tokens, # Key for cost calculation
# OpenAI-specific attributes
"temperature": kwargs.get("temperature"),
"max_tokens": kwargs.get("max_tokens"),
"top_p": kwargs.get("top_p"),
"frequency_penalty": kwargs.get("frequency_penalty"),
"presence_penalty": kwargs.get("presence_penalty"),
"streaming": streaming,
"finish_reason": finish_reason,
"response_id": response_id,
"system_fingerprint": system_fingerprint,
# Performance metrics
"throughput": throughput,
"duration_ms": duration_ms,
}
if ttft_ms:
attributes["ttft_ms"] = ttft_ms
# Clean up None values
attributes = {k: v for k, v in attributes.items() if v is not None}
# Format messages for good conversation display
formatted_messages = self._format_messages_for_display(messages)
span_data = {
"id": span_id,
"trace_id": trace_id,
"name": f"{model}_completion",
"kind": "llm", # Critical: must be "llm" for cost calculation
"started_at": datetime.fromtimestamp(start_time, timezone.utc).isoformat(),
"ended_at": datetime.fromtimestamp(start_time + duration_ms/1000, timezone.utc).isoformat(),
"status": "ok",
"attributes": attributes,
"input_data": json.dumps(formatted_messages),
"output_data": response_text,
"tags": {
"provider": "openai",
"model": model,
"streaming": str(streaming).lower()
}
}
# Send to ZeroEval
response = requests.post(
self.zeroeval_url,
headers={
"Authorization": f"Bearer {self.zeroeval_api_key}",
"Content-Type": "application/json"
},
json=[span_data]
)
if response.status_code != 200:
print(f"Warning: Failed to send span to ZeroEval: {response.text}")
def _send_error_span(self, span_id: str, trace_id: str, model: str,
messages: list, duration_ms: float, start_time: float,
error: Exception, **kwargs):
"""Send error span to ZeroEval"""
attributes = {
"provider": "openai",
"model": model,
"temperature": kwargs.get("temperature"),
"max_tokens": kwargs.get("max_tokens"),
"streaming": kwargs.get("stream", False),
"error_type": type(error).__name__,
"error_message": str(error),
"duration_ms": duration_ms,
}
# Clean up None values
attributes = {k: v for k, v in attributes.items() if v is not None}
formatted_messages = self._format_messages_for_display(messages)
span_data = {
"id": span_id,
"trace_id": trace_id,
"name": f"{model}_completion",
"kind": "llm",
"started_at": datetime.fromtimestamp(start_time, timezone.utc).isoformat(),
"ended_at": datetime.fromtimestamp(start_time + duration_ms/1000, timezone.utc).isoformat(),
"status": "error",
"attributes": attributes,
"input_data": json.dumps(formatted_messages),
"output_data": "",
"error_message": str(error),
"tags": {
"provider": "openai",
"model": model,
"error": "true"
}
}
requests.post(
self.zeroeval_url,
headers={
"Authorization": f"Bearer {self.zeroeval_api_key}",
"Content-Type": "application/json"
},
json=[span_data]
)
def _format_messages_for_display(self, messages: list) -> list:
"""Format messages for optimal display in ZeroEval UI"""
formatted = []
for msg in messages:
# Handle both dict and object formats
if hasattr(msg, 'role'):
role = msg.role
content = msg.content
else:
role = msg.get('role', 'user')
content = msg.get('content', '')
# Handle multimodal content
if isinstance(content, list):
# Extract text parts for display
text_parts = []
for part in content:
if isinstance(part, dict) and part.get('type') == 'text':
text_parts.append(part['text'])
elif isinstance(part, str):
text_parts.append(part)
content = '\n'.join(text_parts) if text_parts else '[Multimodal content]'
formatted.append({
"role": role,
"content": content
})
return formatted
# Usage example
tracer = OpenAITracer(
api_key="your-openai-api-key",
zeroeval_api_key="your-zeroeval-api-key"
)
# Non-streaming call
response = tracer.chat_completion_with_tracing([
{"role": "user", "content": "What is the capital of France?"}
], model="gpt-4o", temperature=0.7)
# Streaming call
response = tracer.chat_completion_with_tracing([
{"role": "user", "content": "Write a short story"}
], model="gpt-4o", stream=True, temperature=0.9)
```
### Gemini API Manual Instrumentation
Gemini has a different API structure with `contents` instead of `messages` and different parameter names. Here's how to instrument Gemini API calls:
```python Python (Gemini Direct API) theme={null}
import requests
import json
import time
import uuid
from datetime import datetime, timezone
class GeminiTracer:
def **init**(self, api_key: str, zeroeval_api_key: str):
self.gemini_api_key = api_key
self.zeroeval_api_key = zeroeval_api_key
self.zeroeval_url = "https://api.zeroeval.com/api/v1/spans"
def generate_content_with_tracing(self, messages: list, model: str = "gemini-1.5-flash", **kwargs):
"""Make Gemini API call with full ZeroEval instrumentation"""
trace_id = str(uuid.uuid4())
span_id = str(uuid.uuid4())
start_time = time.time()
# Convert OpenAI-style messages to Gemini contents format
contents, system_instruction = self._convert_messages_to_contents(messages)
# Prepare Gemini request payload
gemini_payload = {
"contents": contents
}
# Add generation config
generation_config = {}
if kwargs.get("temperature") is not None:
generation_config["temperature"] = kwargs["temperature"]
if kwargs.get("max_tokens"):
generation_config["maxOutputTokens"] = kwargs["max_tokens"]
if kwargs.get("top_p") is not None:
generation_config["topP"] = kwargs["top_p"]
if kwargs.get("top_k") is not None:
generation_config["topK"] = kwargs["top_k"]
if kwargs.get("stop"):
stop = kwargs["stop"]
generation_config["stopSequences"] = stop if isinstance(stop, list) else [stop]
if generation_config:
gemini_payload["generationConfig"] = generation_config
# Add system instruction if present
if system_instruction:
gemini_payload["systemInstruction"] = {"parts": [{"text": system_instruction}]}
# Add tools if provided
if kwargs.get("tools"):
gemini_payload["tools"] = kwargs["tools"]
if kwargs.get("tool_choice"):
gemini_payload["toolConfig"] = {
"functionCallingConfig": {"mode": kwargs["tool_choice"]}
}
# Choose endpoint based on streaming
is_streaming = kwargs.get("stream", False)
endpoint = "streamGenerateContent" if is_streaming else "generateContent"
url = f"https://generativelanguage.googleapis.com/v1beta/models/{model}:{endpoint}"
try:
response = requests.post(
url,
headers={
"x-goog-api-key": self.gemini_api_key,
"Content-Type": "application/json"
},
json=gemini_payload,
stream=is_streaming
)
response.raise_for_status()
end_time = time.time()
duration_ms = (end_time - start_time) * 1000
if is_streaming:
# Handle streaming response
full_response = ""
input_tokens = 0
output_tokens = 0
finish_reason = None
model_version = None
first_token_time = None
for line in response.iter_lines():
if line:
try:
# Gemini streaming sends JSON objects separated by newlines
data = json.loads(line.decode('utf-8'))
if 'candidates' in data and data['candidates']:
candidate = data['candidates'][0]
# Extract content
if 'content' in candidate and 'parts' in candidate['content']:
for part in candidate['content']['parts']:
if 'text' in part:
if first_token_time is None:
first_token_time = time.time()
full_response += part['text']
# Extract finish reason
if 'finishReason' in candidate:
finish_reason = candidate['finishReason']
# Extract usage metadata (usually in final chunk)
if 'usageMetadata' in data:
usage = data['usageMetadata']
input_tokens = usage.get('promptTokenCount', 0)
output_tokens = usage.get('candidatesTokenCount', 0)
# Extract model version
if 'modelVersion' in data:
model_version = data['modelVersion']
except json.JSONDecodeError:
continue
self._send_span(
span_id=span_id, trace_id=trace_id, model=model,
original_messages=messages, response_text=full_response,
input_tokens=input_tokens, output_tokens=output_tokens,
duration_ms=duration_ms, start_time=start_time,
finish_reason=finish_reason, model_version=model_version,
streaming=True, first_token_time=first_token_time,
**kwargs
)
return full_response
else:
# Handle non-streaming response
response_data = response.json()
# Extract response content
content = ""
if 'candidates' in response_data and response_data['candidates']:
candidate = response_data['candidates'][0]
if 'content' in candidate and 'parts' in candidate['content']:
content_parts = []
for part in candidate['content']['parts']:
if 'text' in part:
content_parts.append(part['text'])
content = ''.join(content_parts)
# Extract usage
usage = response_data.get('usageMetadata', {})
input_tokens = usage.get('promptTokenCount', 0)
output_tokens = usage.get('candidatesTokenCount', 0)
# Extract other metadata
finish_reason = None
if 'candidates' in response_data and response_data['candidates']:
finish_reason = response_data['candidates'][0].get('finishReason')
model_version = response_data.get('modelVersion')
self._send_span(
span_id=span_id, trace_id=trace_id, model=model,
original_messages=messages, response_text=content,
input_tokens=input_tokens, output_tokens=output_tokens,
duration_ms=duration_ms, start_time=start_time,
finish_reason=finish_reason, model_version=model_version,
streaming=False, **kwargs
)
return content
except Exception as e:
end_time = time.time()
duration_ms = (end_time - start_time) * 1000
self._send_error_span(
span_id=span_id, trace_id=trace_id, model=model,
original_messages=messages, duration_ms=duration_ms,
start_time=start_time, error=e, **kwargs
)
raise
def _convert_messages_to_contents(self, messages: list) -> tuple:
"""Convert OpenAI-style messages to Gemini contents format"""
contents = []
system_instruction = None
for msg in messages:
role = msg.get('role', 'user') if isinstance(msg, dict) else msg.role
content = msg.get('content', '') if isinstance(msg, dict) else msg.content
if role == 'system':
# Collect system instructions
if system_instruction:
system_instruction += f"\n{content}"
else:
system_instruction = content
continue
# Convert content to parts
if isinstance(content, list):
# Handle multimodal content
parts = []
for item in content:
if isinstance(item, dict) and item.get('type') == 'text':
parts.append({"text": item['text']})
# Add support for images, etc. if needed
else:
parts = [{"text": str(content)}]
# Convert role
gemini_role = "user" if role == "user" else "model"
contents.append({"role": gemini_role, "parts": parts})
return contents, system_instruction
def _send_span(self, span_id: str, trace_id: str, model: str,
original_messages: list, response_text: str,
input_tokens: int, output_tokens: int, duration_ms: float,
start_time: float, finish_reason: str = None,
model_version: str = None, streaming: bool = False,
first_token_time: float = None, **kwargs):
"""Send successful span to ZeroEval"""
# Calculate performance metrics
throughput = output_tokens / (duration_ms / 1000) if duration_ms > 0 else 0
ttft_ms = None
if streaming and first_token_time:
ttft_ms = (first_token_time - start_time) * 1000
# Prepare attributes following ZeroEval's expected format
attributes = {
# Core attributes for cost calculation (use provider naming)
"provider": "gemini", # Key for cost calculation
"model": model, # Key for cost calculation
"inputTokens": input_tokens, # Key for cost calculation
"outputTokens": output_tokens, # Key for cost calculation
# Gemini-specific attributes
"temperature": kwargs.get("temperature"),
"max_tokens": kwargs.get("max_tokens"), # maxOutputTokens
"top_p": kwargs.get("top_p"),
"top_k": kwargs.get("top_k"),
"stop_sequences": kwargs.get("stop"),
"streaming": streaming,
"finish_reason": finish_reason,
"model_version": model_version,
# Performance metrics
"throughput": throughput,
"duration_ms": duration_ms,
}
if ttft_ms:
attributes["ttft_ms"] = ttft_ms
# Include tool information if present
if kwargs.get("tools"):
attributes["tools_count"] = len(kwargs["tools"])
attributes["tool_choice"] = kwargs.get("tool_choice")
# Clean up None values
attributes = {k: v for k, v in attributes.items() if v is not None}
# Format original messages for display (convert back to OpenAI format for consistency)
formatted_messages = self._format_messages_for_display(original_messages)
span_data = {
"id": span_id,
"trace_id": trace_id,
"name": f"{model}_completion",
"kind": "llm", # Critical: must be "llm" for cost calculation
"started_at": datetime.fromtimestamp(start_time, timezone.utc).isoformat(),
"ended_at": datetime.fromtimestamp(start_time + duration_ms/1000, timezone.utc).isoformat(),
"status": "ok",
"attributes": attributes,
"input_data": json.dumps(formatted_messages),
"output_data": response_text,
"tags": {
"provider": "gemini",
"model": model,
"streaming": str(streaming).lower()
}
}
# Send to ZeroEval
response = requests.post(
self.zeroeval_url,
headers={
"Authorization": f"Bearer {self.zeroeval_api_key}",
"Content-Type": "application/json"
},
json=[span_data]
)
if response.status_code != 200:
print(f"Warning: Failed to send span to ZeroEval: {response.text}")
def _send_error_span(self, span_id: str, trace_id: str, model: str,
original_messages: list, duration_ms: float,
start_time: float, error: Exception, **kwargs):
"""Send error span to ZeroEval"""
attributes = {
"provider": "gemini",
"model": model,
"temperature": kwargs.get("temperature"),
"max_tokens": kwargs.get("max_tokens"),
"streaming": kwargs.get("stream", False),
"error_type": type(error).__name__,
"error_message": str(error),
"duration_ms": duration_ms,
}
# Clean up None values
attributes = {k: v for k, v in attributes.items() if v is not None}
formatted_messages = self._format_messages_for_display(original_messages)
span_data = {
"id": span_id,
"trace_id": trace_id,
"name": f"{model}_completion",
"kind": "llm",
"started_at": datetime.fromtimestamp(start_time, timezone.utc).isoformat(),
"ended_at": datetime.fromtimestamp(start_time + duration_ms/1000, timezone.utc).isoformat(),
"status": "error",
"attributes": attributes,
"input_data": json.dumps(formatted_messages),
"output_data": "",
"error_message": str(error),
"tags": {
"provider": "gemini",
"model": model,
"error": "true"
}
}
requests.post(
self.zeroeval_url,
headers={
"Authorization": f"Bearer {self.zeroeval_api_key}",
"Content-Type": "application/json"
},
json=[span_data]
)
def _format_messages_for_display(self, messages: list) -> list:
"""Format messages for optimal display in ZeroEval UI"""
formatted = []
for msg in messages:
if hasattr(msg, 'role'):
role = msg.role
content = msg.content
else:
role = msg.get('role', 'user')
content = msg.get('content', '')
# Handle multimodal content
if isinstance(content, list):
text_parts = []
for part in content:
if isinstance(part, dict) and part.get('type') == 'text':
text_parts.append(part['text'])
elif isinstance(part, str):
text_parts.append(part)
content = '\n'.join(text_parts) if text_parts else '[Multimodal content]'
formatted.append({
"role": role,
"content": content
})
return formatted
# Usage example
tracer = GeminiTracer(
api_key="your-gemini-api-key",
zeroeval_api_key="your-zeroeval-api-key"
)
# Non-streaming call
response = tracer.generate_content_with_tracing([
{"role": "user", "content": "What is the capital of France?"}
], model="gemini-1.5-flash", temperature=0.7)
# Streaming call
response = tracer.generate_content_with_tracing([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short story"}
], model="gemini-1.5-flash", stream=True, temperature=0.9)
```
### Key Attributes for Cost Calculation
For accurate cost calculation, ZeroEval requires these specific attributes in your span:
| Attribute | Required | Description | Example Values |
| -------------- | -------- | -------------------------------------- | ------------------------------------- |
| `provider` | ✅ | Provider identifier for pricing lookup | `"openai"`, `"gemini"`, `"anthropic"` |
| `model` | ✅ | Model identifier for pricing lookup | `"gpt-4o"`, `"gemini-1.5-flash"` |
| `inputTokens` | ✅ | Number of input tokens consumed | `150` |
| `outputTokens` | ✅ | Number of output tokens generated | `75` |
| `kind` | ✅ | Must be set to `"llm"` | `"llm"` |
**Cost Calculation Process:**
1. ZeroEval looks up pricing in the `provider_models` table using `provider` and `model`
2. Calculates: `(inputTokens × inputPrice + outputTokens × outputPrice) / 1,000,000`
3. Stores the result in the span's `cost` field
4. Cost is displayed in cents, automatically converted to dollars in the UI
**Current Supported Models for Cost Calculation:**
* **OpenAI**: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-3.5-turbo`
* **Gemini**: `gemini-1.5-flash`, `gemini-1.5-pro`, `gemini-1.0-pro`
* **Anthropic**: `claude-3-5-sonnet`, `claude-3-haiku`, `claude-3-opus`
If your model isn't listed, the cost will be `0` and you'll see a warning in the logs. Contact support to add pricing for new models.
### Conversation Formatting Best Practices
To ensure your conversations display properly in the ZeroEval UI, follow these formatting guidelines:
```python Python Message Formatting theme={null}
def format_messages_for_zeroeval(messages: list) -> list:
"""Format messages for optimal display in ZeroEval UI"""
formatted = []
for msg in messages:
# Handle both dict and object formats
if hasattr(msg, 'role'):
role = msg.role
content = msg.content
else:
role = msg.get('role', 'user')
content = msg.get('content', '')
# Standardize role names
if role in ['assistant', 'bot', 'ai']:
role = 'assistant'
elif role in ['human', 'user']:
role = 'user'
elif role == 'system':
role = 'system'
# Handle multimodal content - extract text for display
if isinstance(content, list):
text_parts = []
for part in content:
if isinstance(part, dict):
if part.get('type') == 'text':
text_parts.append(part['text'])
elif part.get('type') == 'image_url':
text_parts.append(f"[Image: {part.get('image_url', {}).get('url', 'Unknown')}]")
elif isinstance(part, str):
text_parts.append(part)
# Join text parts with newlines for readability
content = '\n'.join(text_parts) if text_parts else '[Multimodal content]'
# Ensure content is a string
if not isinstance(content, str):
content = str(content)
# Trim excessive whitespace but preserve meaningful formatting
content = content.strip()
formatted.append({
"role": role,
"content": content
})
return formatted
# Usage in span creation
span_data = {
"input_data": json.dumps(format_messages_for_zeroeval(original_messages)),
"output_data": response_text.strip(), # Clean response text too
# ... other fields
}
```
**Key Formatting Rules:**
1. **Standardize Role Names**: Use `"user"`, `"assistant"`, and `"system"` consistently
2. **Handle Multimodal Content**: Extract text content and add descriptive placeholders for non-text elements
3. **Clean Whitespace**: Trim excessive whitespace while preserving intentional formatting
4. **Ensure String Types**: Convert all content to strings to avoid serialization issues
5. **Preserve Conversation Flow**: Maintain the original message order and context
**UI Display Features:**
* **Message Bubbles**: Conversations appear as chat bubbles with clear role distinction
* **Token Counts**: Hover over messages to see token usage breakdown
* **Copy Functionality**: Users can copy individual messages or entire conversations
* **Search**: Well-formatted messages are easily searchable within traces
* **Export**: Clean formatting ensures readable exports to various formats
**Common Formatting Issues to Avoid:**
* ❌ Mixed role naming (`bot` vs `assistant`)
* ❌ Nested objects in content fields
* ❌ Excessive line breaks or whitespace
* ❌ Empty or null content fields
* ❌ Non-string data types in content
**Pro Tips:**
* Keep system messages concise but informative
* Use consistent formatting across your application
* Include relevant context in message content for better debugging
* Consider truncating very long messages (>10k characters) with ellipsis
### Creating Child Spans
Create nested spans to track sub-operations within an LLM call:
```python theme={null}
import zeroeval as ze
@ze.span(name="rag_pipeline", kind="generic")
def answer_with_context(question: str) -> str:
# Retrieval step
with ze.span(name="retrieve_context", kind="vector_store") as retrieval_span:
context = vector_db.search(question, k=5)
retrieval_span.set_attributes({
"vector_store.query": question,
"vector_store.k": 5,
"vector_store.results": len(context)
})
# LLM generation step
with ze.span(name="generate_answer", kind="llm") as llm_span:
messages = [
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": question}
]
response = generate_response(messages)
llm_span.set_attributes({
"llm.model": "gpt-4",
"llm.context_length": len(str(context))
})
return response
```
## Direct API Instrumentation
If you prefer to send spans directly to the API without using an SDK, here's how to do it:
### API Authentication
First, obtain an API key from your [Settings → API Keys](https://app.zeroeval.com/settings?section=api-keys) page.
Include the API key in your request headers:
```bash theme={null}
Authorization: Bearer YOUR_API_KEY
```
### Basic Span Creation
Send a POST request to `/api/v1/spans` with your span data:
```bash cURL theme={null}
curl -X POST https://api.zeroeval.com/api/v1/spans \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '[{
"id": "550e8400-e29b-41d4-a716-446655440000",
"trace_id": "550e8400-e29b-41d4-a716-446655440001",
"name": "chat_completion",
"kind": "llm",
"started_at": "2024-01-15T10:30:00Z",
"ended_at": "2024-01-15T10:30:02Z",
"status": "ok",
"attributes": {
"llm.model": "gpt-4",
"llm.provider": "openai",
"llm.temperature": 0.7,
"llm.input_tokens": 150,
"llm.output_tokens": 230,
"llm.total_tokens": 380
},
"input_data": "[{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]",
"output_data": "The capital of France is Paris."
}]'
```
```python Python (Requests) theme={null}
import requests
import json
from datetime import datetime, timezone
import uuid
def send_llm_span(messages, response_text, model="gpt-4", tokens=None):
"""Send an LLM span directly to the ZeroEval API"""
# Generate IDs
span_id = str(uuid.uuid4())
trace_id = str(uuid.uuid4())
# Prepare the span data
span_data = {
"id": span_id,
"trace_id": trace_id,
"name": "chat_completion",
"kind": "llm",
"started_at": datetime.now(timezone.utc).isoformat(),
"ended_at": datetime.now(timezone.utc).isoformat(),
"status": "ok",
"attributes": {
"llm.model": model,
"llm.provider": "openai",
"llm.temperature": 0.7
},
"input_data": json.dumps(messages),
"output_data": response_text
}
# Add token counts if provided
if tokens:
span_data["attributes"].update({
"llm.input_tokens": tokens.get("prompt_tokens"),
"llm.output_tokens": tokens.get("completion_tokens"),
"llm.total_tokens": tokens.get("total_tokens")
})
# Send to API
response = requests.post(
"https://api.zeroeval.com/api/v1/spans",
headers={
"Authorization": f"Bearer {YOUR_API_KEY}",
"Content-Type": "application/json"
},
json=[span_data] # Note: API expects an array
)
if response.status_code == 200:
return response.json()
else:
raise Exception(f"Failed to send span: {response.text}")
```
### Complete LLM Span with Session
Create a full trace with session context:
```python theme={null}
import requests
import json
from datetime import datetime, timezone
import uuid
import time
class ZeroEvalClient:
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://api.zeroeval.com/api/v1"
self.session_id = str(uuid.uuid4())
def create_llm_span(
self,
messages: list,
response: dict,
model: str = "gpt-4",
trace_id: str = None,
parent_span_id: str = None,
start_time: float = None,
end_time: float = None
):
"""Create a comprehensive LLM span with all metadata"""
if not trace_id:
trace_id = str(uuid.uuid4())
if not start_time:
start_time = time.time()
if not end_time:
end_time = time.time()
span_id = str(uuid.uuid4())
# Calculate duration
duration_ms = (end_time - start_time) * 1000
# Prepare comprehensive span data
span_data = {
"id": span_id,
"trace_id": trace_id,
"parent_span_id": parent_span_id,
"name": f"{model}_completion",
"kind": "llm",
"started_at": datetime.fromtimestamp(start_time, timezone.utc).isoformat(),
"ended_at": datetime.fromtimestamp(end_time, timezone.utc).isoformat(),
"duration_ms": duration_ms,
"status": "ok",
# Session context
"session": {
"id": self.session_id,
"name": "API Client Session"
},
# Core attributes
"attributes": {
"llm.model": model,
"llm.provider": "openai",
"llm.temperature": 0.7,
"llm.max_tokens": 1000,
"llm.streaming": False,
# Token metrics
"llm.input_tokens": response.get("usage", {}).get("prompt_tokens"),
"llm.output_tokens": response.get("usage", {}).get("completion_tokens"),
"llm.total_tokens": response.get("usage", {}).get("total_tokens"),
# Performance metrics
"llm.duration_ms": duration_ms,
"llm.throughput_tokens_per_sec": (
response.get("usage", {}).get("completion_tokens", 0) /
(duration_ms / 1000) if duration_ms > 0 else 0
),
# Response metadata
"llm.finish_reason": response.get("choices", [{}])[0].get("finish_reason"),
"llm.response_id": response.get("id"),
"llm.system_fingerprint": response.get("system_fingerprint")
},
# Tags for filtering
"tags": {
"environment": "production",
"version": "1.0.0",
"user_id": "user_123"
},
# Input/Output
"input_data": json.dumps(messages),
"output_data": response.get("choices", [{}])[0].get("message", {}).get("content", ""),
# Cost calculation (optional - will be calculated server-side if not provided)
"cost": self.calculate_cost(
model,
response.get("usage", {}).get("prompt_tokens", 0),
response.get("usage", {}).get("completion_tokens", 0)
)
}
# Send the span
response = requests.post(
f"{self.base_url}/spans",
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json=[span_data]
)
if response.status_code != 200:
raise Exception(f"Failed to send span: {response.text}")
return span_id
def calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
"""Calculate cost based on model and token usage"""
# Example pricing (adjust based on actual pricing)
pricing = {
"gpt-4": {"input": 0.03 / 1000, "output": 0.06 / 1000},
"gpt-3.5-turbo": {"input": 0.001 / 1000, "output": 0.002 / 1000}
}
if model in pricing:
input_cost = input_tokens * pricing[model]["input"]
output_cost = output_tokens * pricing[model]["output"]
return input_cost + output_cost
return 0.0
```
## Span Schema Reference
### Required Fields
| Field | Type | Description |
| ------------ | ----------------- | ------------------------------- |
| `trace_id` | string (UUID) | Unique identifier for the trace |
| `name` | string | Descriptive name for the span |
| `started_at` | ISO 8601 datetime | When the span started |
### Recommended Fields for LLM Spans
| Field | Type | Description |
| ------------- | ----------------- | ------------------------------------------------------- |
| `id` | string (UUID) | Unique span identifier (auto-generated if not provided) |
| `kind` | string | Set to `"llm"` for LLM spans |
| `ended_at` | ISO 8601 datetime | When the span completed |
| `status` | string | `"ok"`, `"error"`, or `"unset"` |
| `input_data` | string | JSON string of input messages |
| `output_data` | string | Generated text response |
| `duration_ms` | number | Total duration in milliseconds |
| `cost` | number | Calculated cost (auto-calculated if not provided) |
### LLM-Specific Attributes
Store these in the `attributes` field:
| Attribute | Type | Description |
| ------------------------------- | ------- | -------------------------------------------- |
| `llm.model` | string | Model identifier (e.g., "gpt-4", "claude-3") |
| `llm.provider` | string | Provider name (e.g., "openai", "anthropic") |
| `llm.temperature` | number | Temperature parameter |
| `llm.max_tokens` | number | Maximum tokens limit |
| `llm.input_tokens` | number | Number of input tokens |
| `llm.output_tokens` | number | Number of output tokens |
| `llm.total_tokens` | number | Total tokens used |
| `llm.streaming` | boolean | Whether response was streamed |
| `llm.ttft_ms` | number | Time to first token (streaming only) |
| `llm.throughput_tokens_per_sec` | number | Token generation rate |
| `llm.finish_reason` | string | Why generation stopped |
| `llm.response_id` | string | Provider's response ID |
| `llm.system_fingerprint` | string | Model version identifier |
### Optional Context Fields
| Field | Type | Description |
| ---------------- | ------------- | --------------------------------------------- |
| `parent_span_id` | string (UUID) | Parent span for nested operations |
| `session` | object | Session context with `id` and optional `name` |
| `tags` | object | Key-value pairs for filtering |
| `signals` | object | Custom signals for alerting |
| `error_message` | string | Error description if status is "error" |
| `error_stack` | string | Stack trace for debugging |
## Best Practices
1. **Always set the `kind` field**: Use `"llm"` for LLM spans to enable specialized features like embeddings and cost tracking.
2. **Include token counts**: These are essential for cost calculation and performance monitoring.
3. **Capture timing metrics**: For streaming responses, track TTFT (time to first token) and throughput.
4. **Use consistent naming**: Follow a pattern like `{model}_completion` or `{provider}_{operation}`.
5. **Add context with tags**: Use tags for environment, version, user ID, etc., to enable powerful filtering.
6. **Handle errors gracefully**: Set status to "error" and include error details in attributes.
7. **Link related spans**: Use `parent_span_id` to create hierarchical traces for complex workflows.
8. **Batch span submissions**: When sending multiple spans, include them in a single API call as an array.
## Examples
### Multi-Step LLM Pipeline
Here's a complete example of tracking a RAG (Retrieval-Augmented Generation) pipeline:
```python theme={null}
import zeroeval as ze
import time
import json
@ze.span(name="rag_query", kind="generic")
def rag_pipeline(user_query: str) -> dict:
trace_id = ze.get_current_trace()
# Step 1: Query embedding
with ze.span(name="embed_query", kind="llm") as embed_span:
start = time.time()
embedding = create_embedding(user_query)
embed_span.set_attributes({
"llm.model": "text-embedding-3-small",
"llm.provider": "openai",
"llm.input_tokens": len(user_query.split()),
"llm.duration_ms": (time.time() - start) * 1000
})
# Step 2: Vector search
with ze.span(name="vector_search", kind="vector_store") as search_span:
results = vector_db.similarity_search(embedding, k=5)
search_span.set_attributes({
"vector_store.index": "knowledge_base",
"vector_store.k": 5,
"vector_store.results_count": len(results)
})
# Step 3: Rerank results
with ze.span(name="rerank_results", kind="llm") as rerank_span:
reranked = rerank_documents(user_query, results)
rerank_span.set_attributes({
"llm.model": "rerank-english-v2.0",
"llm.provider": "cohere",
"rerank.input_documents": len(results),
"rerank.output_documents": len(reranked)
})
# Step 4: Generate response
with ze.span(name="generate_response", kind="llm") as gen_span:
context = "\n".join([doc.content for doc in reranked[:3]])
messages = [
{"role": "system", "content": f"Use this context to answer: {context}"},
{"role": "user", "content": user_query}
]
response = generate_with_metrics(messages, model="gpt-4")
gen_span.set_attributes({
"llm.context_documents": 3,
"llm.context_length": len(context)
})
return {
"answer": response,
"sources": [doc.metadata for doc in reranked[:3]],
"trace_id": trace_id
}
```
This comprehensive instrumentation provides full visibility into your LLM operations, enabling you to monitor performance, track costs, and debug issues effectively.
## Next Steps
Complete guide to environment variables, initialization parameters, and
runtime configuration options.
Automatic instrumentation for popular LLM libraries without manual code
changes.
For automatic instrumentation of popular LLM libraries, check out our [SDK
integrations](/tracing/sdks/python/integrations) which handle all of this
automatically.
# OpenTelemetry
Source: https://docs.zeroeval.com/tracing/opentelemetry
Send traces to ZeroEval using the OpenTelemetry collector
ZeroEval provides native support for the OpenTelemetry Protocol (OTLP), allowing you to send traces from any OpenTelemetry-instrumented application directly to ZeroEval's API. This guide shows you how to configure the OpenTelemetry collector to export traces to ZeroEval.
## Prerequisites
* A ZeroEval API key (get one from your [workspace settings](https://app.zeroeval.com/settings/api-keys))
* OpenTelemetry collector installed ([installation guide](https://opentelemetry.io/docs/collector/getting-started/))
## Configuration
Create a collector configuration file (`otel-collector-config.yaml`):
```yaml theme={null}
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
# ZeroEval-specific attributes
attributes:
actions:
- key: deployment.environment
value: "production" # or staging, development, etc.
action: upsert
exporters:
otlphttp:
endpoint: https://api.zeroeval.com
headers:
Authorization: "Bearer YOUR_ZEROEVAL_API_KEY"
traces_endpoint: https://api.zeroeval.com/v1/traces
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [otlphttp]
```
## Docker Deployment
For containerized deployments, use this Docker Compose configuration:
```yaml theme={null}
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
container_name: otel-collector
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "8888:8888" # Prometheus metrics
environment:
- ZEROEVAL_API_KEY=${ZEROEVAL_API_KEY}
restart: unless-stopped
```
## Environment-based Configuration
To avoid hardcoding sensitive information, use environment variables:
```yaml theme={null}
exporters:
otlphttp:
endpoint: https://api.zeroeval.com
headers:
Authorization: "Bearer ${env:ZEROEVAL_API_KEY}"
traces_endpoint: https://api.zeroeval.com/v1/traces
```
Then set the environment variable:
```bash theme={null}
export ZEROEVAL_API_KEY="your-api-key-here"
```
## Kubernetes Deployment
For Kubernetes environments, use this ConfigMap and Deployment:
```yaml theme={null}
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
otel-collector-config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
exporters:
otlphttp:
endpoint: https://api.zeroeval.com
headers:
Authorization: "Bearer ${env:ZEROEVAL_API_KEY}"
traces_endpoint: https://api.zeroeval.com/v1/traces
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, k8sattributes]
exporters: [otlphttp]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel-collector-config.yaml"]
env:
- name: ZEROEVAL_API_KEY
valueFrom:
secretKeyRef:
name: zeroeval-secret
key: api-key
ports:
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http
volumeMounts:
- name: config
mountPath: /etc/otel-collector-config.yaml
subPath: otel-collector-config.yaml
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
```
# Quickstart
Source: https://docs.zeroeval.com/tracing/quickstart
Get started with tracing and observability in ZeroEval
### Get your API key
Create an API key from your [Settings → API Keys](https://app.zeroeval.com/settings?section=api-keys) page.
### Install the SDK
Get started with one of our SDKs:
For Python applications using frameworks like FastAPI, Django, or Flask
For TypeScript and JavaScript applications using Node.js or Bun
# Reference
Source: https://docs.zeroeval.com/tracing/reference
Environment variables and configuration parameters for the ZeroEval tracer
Configure the ZeroEval tracer through environment variables, initialization parameters, or runtime methods.
## Environment Variables
Set before importing ZeroEval to configure default behavior.
| Variable | Type | Default | Description |
| -------------------------------- | ------- | ---------------------------- | --------------------------------------- |
| `ZEROEVAL_API_KEY` | string | `""` | API key for authentication |
| `ZEROEVAL_API_URL` | string | `"https://api.zeroeval.com"` | API endpoint URL |
| `ZEROEVAL_WORKSPACE_NAME` | string | `"Personal Workspace"` | Workspace name |
| `ZEROEVAL_SESSION_ID` | string | auto-generated | Session ID for grouping traces |
| `ZEROEVAL_SESSION_NAME` | string | `""` | Human-readable session name |
| `ZEROEVAL_SAMPLING_RATE` | float | `"1.0"` | Sampling rate (0.0-1.0) |
| `ZEROEVAL_DISABLED_INTEGRATIONS` | string | `""` | Comma-separated integrations to disable |
| `ZEROEVAL_DEBUG` | boolean | `"false"` | Enable debug logging |
**Activation:** Set environment variables before importing the SDK.
```bash theme={null}
export ZEROEVAL_API_KEY="ze_1234567890abcdef"
export ZEROEVAL_SAMPLING_RATE="0.1"
export ZEROEVAL_DEBUG="true"
```
## Initialization Parameters
Configure via `ze.init()` - overrides environment variables.
| Parameter | Type | Default | Description |
| ----------------------- | --------------- | ---------------------------- | -------------------------------- |
| `api_key` | string | `None` | API key for authentication |
| `workspace_name` | string | `"Personal Workspace"` | Workspace name |
| `debug` | boolean | `False` | Enable debug logging with colors |
| `api_url` | string | `"https://api.zeroeval.com"` | API endpoint URL |
| `disabled_integrations` | list\[str] | `None` | Integrations to disable |
| `enabled_integrations` | list\[str] | `None` | Only enable these integrations |
| `setup_otlp` | boolean | `True` | Setup OpenTelemetry OTLP export |
| `service_name` | string | `"zeroeval-app"` | OTLP service name |
| `tags` | dict\[str, str] | `None` | Global tags for all spans |
| `sampling_rate` | float | `None` | Sampling rate (0.0-1.0) |
**Activation:** Pass parameters to `ze.init()`.
```python theme={null}
ze.init(
api_key="ze_1234567890abcdef",
sampling_rate=0.1,
disabled_integrations=["langchain"],
debug=True
)
```
## Runtime Configuration
Configure after initialization via `ze.tracer.configure()`.
| Parameter | Type | Default | Description |
| ---------------------- | ---------------- | ------- | ------------------------------------ |
| `flush_interval` | float | `1.0` | Flush frequency in seconds |
| `max_spans` | int | `20` | Buffer size before forced flush |
| `collect_code_details` | boolean | `True` | Capture code details in spans |
| `integrations` | dict\[str, bool] | `{}` | Enable/disable specific integrations |
| `sampling_rate` | float | `None` | Sampling rate (0.0-1.0) |
**Activation:** Call `ze.tracer.configure()` anytime after initialization.
```python theme={null}
ze.tracer.configure(
flush_interval=0.5,
max_spans=100,
sampling_rate=0.05,
integrations={"openai": True, "langchain": False}
)
```
## Available Integrations
| Integration | User-Friendly Name | Auto-Instruments |
| ---------------------- | ------------------ | -------------------- |
| `OpenAIIntegration` | `"openai"` | OpenAI client calls |
| `GeminiIntegration` | `"gemini"` | Google Gemini calls |
| `LangChainIntegration` | `"langchain"` | LangChain components |
| `LangGraphIntegration` | `"langgraph"` | LangGraph workflows |
| `HttpxIntegration` | `"httpx"` | HTTPX requests |
| `VocodeIntegration` | `"vocode"` | Vocode voice SDK |
**Control via:**
* Environment: `ZEROEVAL_DISABLED_INTEGRATIONS="langchain,langgraph"`
* Init: `disabled_integrations=["langchain"]` or `enabled_integrations=["openai"]`
* Runtime: `ze.tracer.configure(integrations={"langchain": False})`
## Configuration Examples
### Production Setup
```python theme={null}
# High-volume production with sampling
ze.init(
api_key="your_key",
sampling_rate=0.05, # 5% sampling
debug=False,
disabled_integrations=["langchain"]
)
ze.tracer.configure(
flush_interval=0.5, # Faster flushes
max_spans=100 # Larger buffer
)
```
### Development Setup
```python theme={null}
# Full tracing with debug info
ze.init(
api_key="your_key",
debug=True, # Colored logs
sampling_rate=1.0 # Capture everything
)
```
### Memory-Optimized Setup
```python theme={null}
# Minimize memory usage
ze.tracer.configure(
max_spans=5, # Small buffer
collect_code_details=False, # No code capture
flush_interval=2.0 # Less frequent flushes
)
```
# Integrations
Source: https://docs.zeroeval.com/tracing/sdks/python/integrations
Automatic instrumentation for popular AI/ML frameworks
The [ZeroEval Python SDK](https://pypi.org/project/zeroeval/) automatically traces intruments the supported integrations, meaning the only thing to do is to initialize the SDK before importing the frameworks you want to trace.
## OpenAI
```python theme={null}
import zeroeval as ze
ze.init()
import openai
client = openai.OpenAI()
# This call is automatically traced
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
# Streaming is also automatically traced
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
## LangChain
```python theme={null}
import zeroeval as ze
ze.init()
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# All components are automatically traced
model = ChatOpenAI()
prompt = ChatPromptTemplate.from_template("Tell me about {topic}")
chain = prompt | model
response = chain.invoke({"topic": "AI"})
```
## LangGraph
```python theme={null}
import zeroeval as ze
ze.init()
from langgraph.graph import StateGraph, START, END
from langchain_core.messages import HumanMessage
# Define a multi-node graph
workflow = StateGraph(AgentState)
workflow.add_node("reasoning", reasoning_node)
workflow.add_node("agent", agent_node)
workflow.add_node("tools", tool_node)
workflow.add_conditional_edges(
"agent",
should_continue,
{"tools": "tools", "end": END}
)
app = workflow.compile()
# Full graph execution is automatically traced
result = app.invoke({"messages": [HumanMessage(content="Help me plan a trip")]})
# Streaming is also supported
for chunk in app.stream({"messages": [HumanMessage(content="Hello")]}):
print(chunk)
```
## PydanticAI
PydanticAI agents are automatically traced, including multi-turn conversations. The SDK ensures that all LLM calls within an agent execution share the same trace, and consecutive conversation turns share the same trace ID when using shared message history.
```python theme={null}
import zeroeval as ze
ze.init()
from pydantic_ai import Agent
from pydantic import BaseModel
class Response(BaseModel):
message: str
sentiment: str
# Create an agent with structured output
agent = Agent(
model="openai:gpt-4o-mini",
output_type=Response,
system_prompt="You are a helpful assistant."
)
# Single execution - automatically traced
result = await agent.run("Hello!")
# Multi-turn conversation - all turns share the same trace
message_history = []
async with agent.iter("First message", message_history=message_history) as run:
async for node in run:
pass
message_history = run.result.all_messages()
# Second turn reuses the same trace_id
async with agent.iter("Follow-up message", message_history=message_history) as run:
async for node in run:
pass
message_history = run.result.all_messages()
```
When you pass the same `message_history` list across multiple agent runs, ZeroEval automatically groups all runs under a single trace. This provides a unified view of the entire conversation.
## LiveKit
The SDK automatically creates traces for LiveKit agents, including events from the following plugins:
* Cartesia (TTS)
* Deepgram (STT)
* OpenAI (LLM)
```python theme={null}
import zeroeval as ze
ze.init()
from livekit import agents
from livekit.agents import AgentSession, Agent
from livekit.plugins import openai
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
# All agent sessions are automatically traced
session = AgentSession(
llm=openai.realtime.RealtimeModel(voice="coral")
)
await session.start(
room=ctx.room,
agent=Agent(instructions="You are a helpful voice AI assistant.")
)
# Agent interactions are automatically captured
await session.generate_reply(
instructions="Greet the user and offer your assistance."
)
if __name__ == "__main__":
agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
```
Need help? Contact us at [founders@zeroeval.com](mailto:founders@zeroeval.com) or join our [Discord](https://discord.gg/MuExkGMNVz).
# Reference
Source: https://docs.zeroeval.com/tracing/sdks/python/reference
Complete API reference for the Python SDK
## Installation
```bash theme={null}
pip install zeroeval
```
## Core Functions
### `init()`
Initializes the ZeroEval SDK. Must be called before using any other SDK features.
```python theme={null}
def init(
api_key: str = None,
workspace_name: str = "Personal Workspace",
debug: bool = False,
api_url: str = "https://api.zeroeval.com"
) -> None
```
**Parameters:**
* `api_key` (str, optional): Your ZeroEval API key. If not provided, uses `ZEROEVAL_API_KEY` environment variable
* `workspace_name` (str, optional): The name of your workspace. Defaults to `"Personal Workspace"`
* `debug` (bool, optional): If True, enables detailed logging for debugging. Can also be enabled by setting `ZEROEVAL_DEBUG=true` environment variable
* `api_url` (str, optional): The URL of the ZeroEval API. Defaults to `"https://api.zeroeval.com"`
**Example:**
```python theme={null}
import zeroeval as ze
ze.init(
api_key="your-api-key",
workspace_name="My Workspace",
debug=True
)
```
## Decorators
### `@span`
Decorator and context manager for creating spans around code blocks.
```python theme={null}
@span(
name: str,
session_id: Optional[str] = None,
session: Optional[Union[str, dict[str, str]]] = None,
attributes: Optional[dict[str, Any]] = None,
input_data: Optional[str] = None,
output_data: Optional[str] = None,
tags: Optional[dict[str, str]] = None
)
```
**Parameters:**
* `name` (str): Name of the span
* `session_id` (str, optional): **Deprecated** - Use `session` parameter instead
* `session` (Union\[str, dict], optional): Session information. Can be:
* A string containing the session ID
* A dict with `{"id": "...", "name": "..."}`
* `attributes` (dict, optional): Additional attributes to attach to the span
* `input_data` (str, optional): Manual input data override
* `output_data` (str, optional): Manual output data override
* `tags` (dict, optional): Tags to attach to the span
**Usage as Decorator:**
```python theme={null}
import zeroeval as ze
@ze.span(name="calculate_sum")
def add_numbers(a: int, b: int) -> int:
return a + b # Parameters and return value automatically captured
# With manual I/O
@ze.span(name="process_data", input_data="manual input", output_data="manual output")
def process():
# Process logic here
pass
# With session
@ze.span(name="user_action", session={"id": "123", "name": "John's Session"})
def user_action():
pass
```
**Usage as Context Manager:**
```python theme={null}
import zeroeval as ze
with ze.span(name="data_processing") as current_span:
result = process_data()
current_span.set_io(input_data="input", output_data=str(result))
```
### `@experiment`
Decorator that attaches dataset and model information to a function.
```python theme={null}
@experiment(
dataset: Optional[Dataset] = None,
model: Optional[str] = None
)
```
**Parameters:**
* `dataset` (Dataset, optional): Dataset to use for the experiment
* `model` (str, optional): Model identifier
**Example:**
```python theme={null}
import zeroeval as ze
dataset = ze.Dataset.pull("my-dataset")
@ze.experiment(dataset=dataset, model="gpt-4")
def my_experiment():
# Experiment logic
pass
```
## Classes
### `Dataset`
A class to represent a named collection of dictionary records.
#### Constructor
```python theme={null}
Dataset(
name: str,
data: list[dict[str, Any]],
description: Optional[str] = None
)
```
**Parameters:**
* `name` (str): The name of the dataset
* `data` (list\[dict]): A list of dictionaries containing the data
* `description` (str, optional): A description of the dataset
**Example:**
```python theme={null}
dataset = Dataset(
name="Capitals",
description="Country to capital mapping",
data=[
{"input": "France", "output": "Paris"},
{"input": "Germany", "output": "Berlin"}
]
)
```
#### Methods
##### `push()`
Push the dataset to the backend, creating a new version if it already exists.
```python theme={null}
def push(self, create_new_version: bool = False) -> Dataset
```
**Parameters:**
* `self`: The Dataset instance
* `create_new_version` (bool, optional): For backward compatibility. This parameter is no longer needed as new versions are automatically created when a dataset name already exists. Defaults to False
**Returns:** Returns self for method chaining
##### `pull()`
Static method to pull a dataset from the backend.
```python theme={null}
@classmethod
def pull(
cls,
dataset_name: str,
version_number: Optional[int] = None
) -> Dataset
```
**Parameters:**
* `cls`: The Dataset class itself (automatically provided when using `@classmethod`)
* `dataset_name` (str): The name of the dataset to pull from the backend
* `version_number` (int, optional): Specific version number to pull. If not provided, pulls the latest version
**Returns:** A new Dataset instance populated with data from the backend
##### `add_rows()`
Add new rows to the dataset.
```python theme={null}
def add_rows(self, new_rows: list[dict[str, Any]]) -> None
```
**Parameters:**
* `self`: The Dataset instance
* `new_rows` (list\[dict]): A list of dictionaries representing the rows to add
##### `add_image()`
Add an image to a specific row.
```python theme={null}
def add_image(
self,
row_index: int,
column_name: str,
image_path: str
) -> None
```
**Parameters:**
* `self`: The Dataset instance
* `row_index` (int): Index of the row to update (0-based)
* `column_name` (str): Name of the column to add the image to
* `image_path` (str): Path to the image file to add
##### `add_audio()`
Add audio to a specific row.
```python theme={null}
def add_audio(
self,
row_index: int,
column_name: str,
audio_path: str
) -> None
```
**Parameters:**
* `self`: The Dataset instance
* `row_index` (int): Index of the row to update (0-based)
* `column_name` (str): Name of the column to add the audio to
* `audio_path` (str): Path to the audio file to add
##### `add_media_url()`
Add a media URL to a specific row.
```python theme={null}
def add_media_url(
self,
row_index: int,
column_name: str,
media_url: str,
media_type: str = "image"
) -> None
```
**Parameters:**
* `self`: The Dataset instance
* `row_index` (int): Index of the row to update (0-based)
* `column_name` (str): Name of the column to add the media URL to
* `media_url` (str): URL pointing to the media file
* `media_type` (str, optional): Type of media - "image", "audio", or "video". Defaults to "image"
#### Properties
* `name` (str): The name of the dataset
* `description` (str): The description of the dataset
* `columns` (list\[str]): List of all unique column names
* `data` (list\[dict]): List of the data portion for each row
* `backend_id` (str): The ID in the backend (after pushing)
* `version_id` (str): The version ID in the backend
* `version_number` (int): The version number in the backend
#### Example
```python theme={null}
import zeroeval as ze
# Create a dataset
dataset = ze.Dataset(
name="Capitals",
description="Country to capital mapping",
data=[
{"input": "France", "output": "Paris"},
{"input": "Germany", "output": "Berlin"}
]
)
# Push to backend
dataset.push()
# Pull from backend
dataset = ze.Dataset.pull("Capitals", version_number=1)
# Add rows
dataset.add_rows([{"input": "Italy", "output": "Rome"}])
# Add multimodal data
dataset.add_image(0, "flag", "flags/france.png")
dataset.add_audio(0, "anthem", "anthems/france.mp3")
dataset.add_media_url(0, "video_url", "https://example.com/video.mp4", "video")
```
### `Experiment`
Represents an experiment that runs a task on a dataset with optional evaluators.
#### Constructor
```python theme={null}
Experiment(
dataset: Dataset,
task: Callable[[Any], Any],
evaluators: Optional[list[Callable[[Any, Any], Any]]] = None,
name: Optional[str] = None,
description: Optional[str] = None
)
```
**Parameters:**
* `dataset` (Dataset): The dataset to run the experiment on
* `task` (Callable): Function that processes each row and returns output
* `evaluators` (list\[Callable], optional): List of evaluator functions that take (row, output) and return evaluation result
* `name` (str, optional): Name of the experiment. Defaults to task function name
* `description` (str, optional): Description of the experiment. Defaults to task function's docstring
**Example:**
```python theme={null}
import zeroeval as ze
ze.init()
# Pull dataset
dataset = ze.Dataset.pull("Capitals")
# Define task
def capitalize_task(row):
return row["input"].upper()
# Define evaluator
def exact_match(row, output):
return row["output"].upper() == output
# Create and run experiment
exp = ze.Experiment(
dataset=dataset,
task=capitalize_task,
evaluators=[exact_match],
name="Capital Uppercase Test"
)
results = exp.run()
# Or run task and evaluators separately
results = exp.run_task()
exp.run_evaluators([exact_match], results)
```
#### Methods
##### `run()`
Run the complete experiment (task + evaluators).
```python theme={null}
def run(
self,
subset: Optional[list[dict]] = None
) -> list[ExperimentResult]
```
**Parameters:**
* `self`: The Experiment instance
* `subset` (list\[dict], optional): Subset of dataset rows to run the experiment on. If None, runs on entire dataset
**Returns:** List of experiment results for each row
##### `run_task()`
Run only the task without evaluators.
```python theme={null}
def run_task(
self,
subset: Optional[list[dict]] = None,
raise_on_error: bool = False
) -> list[ExperimentResult]
```
**Parameters:**
* `self`: The Experiment instance
* `subset` (list\[dict], optional): Subset of dataset rows to run the task on. If None, runs on entire dataset
* `raise_on_error` (bool, optional): If True, raises exceptions encountered during task execution. If False, captures errors. Defaults to False
**Returns:** List of experiment results for each row
##### `run_evaluators()`
Run evaluators on existing results.
```python theme={null}
def run_evaluators(
self,
evaluators: Optional[list[Callable[[Any, Any], Any]]] = None,
results: Optional[list[ExperimentResult]] = None
) -> list[ExperimentResult]
```
**Parameters:**
* `self`: The Experiment instance
* `evaluators` (list\[Callable], optional): List of evaluator functions to run. If None, uses evaluators from the Experiment instance
* `results` (list\[ExperimentResult], optional): List of results to evaluate. If None, uses results from the Experiment instance
**Returns:** The evaluated results
### `Span`
Represents a span in the tracing system. Usually created via the `@span` decorator.
#### Methods
##### `set_io()`
Set input and output data for the span.
```python theme={null}
def set_io(
self,
input_data: Optional[str] = None,
output_data: Optional[str] = None
) -> None
```
**Parameters:**
* `self`: The Span instance
* `input_data` (str, optional): Input data to attach to the span. Will be converted to string if not already
* `output_data` (str, optional): Output data to attach to the span. Will be converted to string if not already
##### `set_tags()`
Set tags on the span.
```python theme={null}
def set_tags(self, tags: dict[str, str]) -> None
```
**Parameters:**
* `self`: The Span instance
* `tags` (dict\[str, str]): Dictionary of tags to set on the span
##### `set_attributes()`
Set attributes on the span.
```python theme={null}
def set_attributes(self, attributes: dict[str, Any]) -> None
```
**Parameters:**
* `self`: The Span instance
* `attributes` (dict\[str, Any]): Dictionary of attributes to set on the span
##### `set_error()`
Set error information for the span.
```python theme={null}
def set_error(
self,
code: str,
message: str,
stack: Optional[str] = None
) -> None
```
**Parameters:**
* `self`: The Span instance
* `code` (str): Error code or exception class name
* `message` (str): Error message
* `stack` (str, optional): Stack trace information
##### `add_screenshot()`
Attach a screenshot to the span for visual evaluation by LLM judges. Screenshots are uploaded during ingestion and can be evaluated alongside text data.
```python theme={null}
def add_screenshot(
self,
base64_data: str,
viewport: str = "desktop",
width: Optional[int] = None,
height: Optional[int] = None,
label: Optional[str] = None
) -> None
```
**Parameters:**
* `self`: The Span instance
* `base64_data` (str): Base64 encoded image data. Accepts raw base64 or data URL format (`data:image/png;base64,...`)
* `viewport` (str, optional): Viewport type - `"desktop"`, `"mobile"`, or `"tablet"`. Defaults to `"desktop"`
* `width` (int, optional): Image width in pixels
* `height` (int, optional): Image height in pixels
* `label` (str, optional): Human-readable description of the screenshot
**Example:**
```python theme={null}
import zeroeval as ze
with ze.span(name="browser_test", tags={"test": "visual"}) as span:
# Capture and attach a desktop screenshot
span.add_screenshot(
base64_data=desktop_screenshot_base64,
viewport="desktop",
width=1920,
height=1080,
label="Homepage - Desktop"
)
# Also capture mobile view
span.add_screenshot(
base64_data=mobile_screenshot_base64,
viewport="mobile",
width=375,
height=812,
label="Homepage - iPhone"
)
span.set_io(
input_data="Navigate to homepage",
output_data="Captured viewport screenshots"
)
```
##### `add_image()`
Attach a generic image to the span for visual evaluation. Use this for non-screenshot images like charts, diagrams, or UI component states.
```python theme={null}
def add_image(
self,
base64_data: str,
label: Optional[str] = None,
metadata: Optional[dict[str, Any]] = None
) -> None
```
**Parameters:**
* `self`: The Span instance
* `base64_data` (str): Base64 encoded image data. Accepts raw base64 or data URL format
* `label` (str, optional): Human-readable description of the image
* `metadata` (dict, optional): Additional metadata to store with the image
**Example:**
```python theme={null}
import zeroeval as ze
with ze.span(name="chart_generation") as span:
# Generate a chart and attach it
chart_base64 = generate_chart(data)
span.add_image(
base64_data=chart_base64,
label="Monthly Revenue Chart",
metadata={"chart_type": "bar", "data_points": 12}
)
span.set_io(
input_data="Generate revenue chart for Q4",
output_data="Chart generated with 12 data points"
)
```
Images attached to spans can be evaluated by LLM judges configured for multimodal evaluation. See the [Multimodal Evaluation](/judges/multimodal-evaluation) guide for setup instructions.
## Context Functions
### `get_current_span()`
Returns the currently active span, if any.
```python theme={null}
def get_current_span() -> Optional[Span]
```
**Returns:** The currently active Span instance, or None if no span is active
### `get_current_trace()`
Returns the current trace ID.
```python theme={null}
def get_current_trace() -> Optional[str]
```
**Returns:** The current trace ID, or None if no trace is active
### `get_current_session()`
Returns the current session ID.
```python theme={null}
def get_current_session() -> Optional[str]
```
**Returns:** The current session ID, or None if no session is active
### `set_tag()`
Sets tags on a span, trace, or session.
```python theme={null}
def set_tag(
target: Union[Span, str],
tags: dict[str, str]
) -> None
```
**Parameters:**
* `target`: The target to set tags on
* `Span`: Sets tags on the specific span
* `str`: Sets tags on the trace (if valid trace ID) or session (if valid session ID)
* `tags` (dict\[str, str]): Dictionary of tags to set
**Example:**
```python theme={null}
import zeroeval as ze
# Set tags on current span
current_span = ze.get_current_span()
if current_span:
ze.set_tag(current_span, {"user_id": "12345", "environment": "production"})
# Set tags on trace
trace_id = ze.get_current_trace()
if trace_id:
ze.set_tag(trace_id, {"version": "1.5"})
```
### `set_signal()`
Send a signal to a span, trace, or session.
```python theme={null}
def set_signal(
target: Union[Span, str],
signals: dict[str, Union[str, bool, int, float]]
) -> bool
```
**Parameters:**
* `target`: The entity to attach signals to
* `Span`: Sends signals to the specific span
* `str`: Sends signals to the trace (if active trace ID) or session
* `signals` (dict): Dictionary of signal names to values
**Returns:** True if signals were sent successfully, False otherwise
**Example:**
```python theme={null}
import zeroeval as ze
# Send signals to current span
current_span = ze.get_current_span()
if current_span:
ze.set_signal(current_span, {
"accuracy": 0.95,
"is_successful": True,
"error_count": 0
})
# Send signals to trace
trace_id = ze.get_current_trace()
if trace_id:
ze.set_signal(trace_id, {"model_score": 0.85})
```
## Judge Feedback APIs
### `send_feedback()`
Programmatically submit user feedback for a completion or judge evaluation.
```python theme={null}
def send_feedback(
*,
prompt_slug: str,
completion_id: str,
thumbs_up: bool,
reason: Optional[str] = None,
expected_output: Optional[str] = None,
metadata: Optional[dict] = None,
judge_id: Optional[str] = None,
expected_score: Optional[float] = None,
score_direction: Optional[str] = None,
criteria_feedback: Optional[dict] = None
) -> dict
```
**Notes:**
* Existing usage without `criteria_feedback` is unchanged.
* `criteria_feedback` is optional and supported for scored judges.
* `judge_id` is required when sending `expected_score`, `score_direction`, or `criteria_feedback`.
### `get_judge_criteria()`
Fetch normalized criteria metadata for a judge (useful before criterion-level feedback).
```python theme={null}
def get_judge_criteria(
project_id: str,
judge_id: str
) -> dict
```
**Returns:**
* `judge_id`
* `evaluation_type`
* `score_min`, `score_max`, `pass_threshold`
* `criteria` (list of `{key, label, description}`)
## CLI Commands
The ZeroEval SDK includes a CLI tool for running experiments and setup.
### `zeroeval run`
Run a Python script containing ZeroEval experiments.
```bash theme={null}
zeroeval run script.py
```
### `zeroeval setup`
Interactive setup to configure API credentials.
```bash theme={null}
zeroeval setup
```
## Environment Variables
The SDK uses the following environment variables:
* `ZEROEVAL_API_KEY`: Your ZeroEval API key
* `ZEROEVAL_API_URL`: API endpoint URL (defaults to `https://api.zeroeval.com`)
* `ZEROEVAL_DEBUG`: Set to `true` to enable debug logging
* `ZEROEVAL_DISABLED_INTEGRATIONS`: Comma-separated list of integrations to disable
# Setup
Source: https://docs.zeroeval.com/tracing/sdks/python/setup
Get started with ZeroEval tracing in Python applications
The [ZeroEval Python SDK](https://pypi.org/project/zeroeval/) provides seamless integration with your Python applications through automatic instrumentation and a simple decorator-based API.
## Installation
```bash pip theme={null}
pip install zeroeval
```
```bash poetry theme={null}
poetry add zeroeval
```
## Basic Setup
```python theme={null}
import zeroeval as ze
# Option 1: ZEROEVAL_API_KEY in your environment variable file
ze.init()
# Option 2: Provide API key directly from
# https://app.zeroeval.com/settings?tab=api-keys
ze.init(api_key="YOUR_API_KEY")
```
Run `zeroeval setup` once to save your API key securely to
`~/.config/zeroeval/config.json`
## Patterns
### Decorators
The `@span` decorator is the easiest way to add tracing:
```python theme={null}
import zeroeval as ze
@ze.span(name="fetch_data")
def fetch_data(user_id: str):
# Function arguments are automatically captured as inputs
# Return values are automatically captured as outputs
return {"user_id": user_id, "name": "John Doe"}
@ze.span(name="process_data", attributes={"version": "1.0"})
def process_data(data: dict):
# Add custom attributes for better filtering
return f"Welcome, {data['name']}!"
```
### Context Manager
For more control over span lifecycles:
```python theme={null}
import zeroeval as ze
def complex_workflow():
with ze.span(name="data_pipeline") as pipeline_span:
# Fetch stage
with ze.span(name="fetch_stage") as fetch_span:
data = fetch_external_data()
fetch_span.set_io(output_data=str(data))
# Process stage
with ze.span(name="process_stage") as process_span:
processed = transform_data(data)
process_span.set_io(
input_data=str(data),
output_data=str(processed)
)
# Save stage
with ze.span(name="save_stage") as save_span:
result = save_to_database(processed)
save_span.set_io(output_data=f"Saved {result} records")
```
## Advanced Configuration
Fine-tune the tracer behavior:
```python theme={null}
from zeroeval.observability.tracer import tracer
# Configure tracer settings
tracer.configure(
flush_interval=5.0, # Flush every 5 seconds
max_spans=200, # Buffer up to 200 spans
collect_code_details=True # Capture source code context
)
```
## Context
Access current context information:
```python theme={null}
# Get the current span
current_span = ze.get_current_span()
# Get the current trace ID
trace_id = ze.get_current_trace()
# Get the current session ID
session_id = ze.get_current_session()
```
## CLI Tooling
The Python SDK includes helpful CLI commands:
```bash theme={null}
# Save your API key securely
zeroeval setup
# Run scripts with automatic tracing
zeroeval run my_script.py
```
# Integrations
Source: https://docs.zeroeval.com/tracing/sdks/typescript/integrations
Automatic tracing for popular AI/ML libraries
The ZeroEval TypeScript SDK provides automatic tracing for popular AI libraries through the `wrap()` function.
## OpenAI
Wrap your OpenAI client to automatically trace all API calls:
```typescript theme={null}
import { OpenAI } from 'openai';
import * as ze from 'zeroeval';
ze.init();
const openai = ze.wrap(new OpenAI());
// Chat completions are automatically traced
const completion = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: 'Hello!' }]
});
// Streaming is also automatically traced
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
```
### Supported Methods
The OpenAI integration automatically traces:
* `chat.completions.create()` (streaming and non-streaming)
* `embeddings.create()`
* `images.generate()`, `images.edit()`, `images.createVariation()`
* `audio.transcriptions.create()`, `audio.translations.create()`
## Vercel AI SDK
Wrap the Vercel AI SDK module to trace all AI operations:
```typescript theme={null}
import * as ai from 'ai';
import { openai } from '@ai-sdk/openai';
import * as ze from 'zeroeval';
ze.init();
const wrappedAI = ze.wrap(ai);
// Text generation
const { text } = await wrappedAI.generateText({
model: openai('gpt-4'),
prompt: 'Write a haiku about coding'
});
// Streaming
const { textStream } = await wrappedAI.streamText({
model: openai('gpt-4'),
messages: [{ role: 'user', content: 'Hello!' }]
});
for await (const delta of textStream) {
process.stdout.write(delta);
}
// Structured output
import { z } from 'zod';
const { object } = await wrappedAI.generateObject({
model: openai('gpt-4'),
schema: z.object({
name: z.string(),
age: z.number()
}),
prompt: 'Generate a random person'
});
```
### Supported Methods
The Vercel AI SDK integration automatically traces:
* `generateText()`, `streamText()`
* `generateObject()`, `streamObject()`
* `embed()`, `embedMany()`
* `generateImage()`
* `transcribe()`
* `generateSpeech()`
## LangChain / LangGraph
Use the callback handler for LangChain and LangGraph applications:
```typescript theme={null}
import {
ZeroEvalCallbackHandler,
setGlobalCallbackHandler
} from 'zeroeval/langchain';
// Option 1: Set globally (recommended)
setGlobalCallbackHandler(new ZeroEvalCallbackHandler());
// All chain invocations are now automatically traced
const result = await chain.invoke({ topic: 'AI' });
```
```typescript theme={null}
import { ZeroEvalCallbackHandler } from 'zeroeval/langchain';
// Option 2: Per-invocation
const handler = new ZeroEvalCallbackHandler();
const result = await chain.invoke(
{ topic: 'AI' },
{ callbacks: [handler] }
);
```
## Auto-Detection
The `wrap()` function automatically detects which client you're wrapping:
```typescript theme={null}
import { OpenAI } from 'openai';
import * as ai from 'ai';
import * as ze from 'zeroeval';
ze.init();
// Automatically detected as OpenAI client
const openai = ze.wrap(new OpenAI());
// Automatically detected as Vercel AI SDK
const wrappedAI = ze.wrap(ai);
```
If `ze.init()` hasn't been called and `ZEROEVAL_API_KEY` is set in your environment, the SDK will automatically initialize when you first use `wrap()`.
## Using with Prompts
The integrations automatically extract ZeroEval metadata from prompts created with `ze.prompt()`:
```typescript theme={null}
import { OpenAI } from 'openai';
import * as ze from 'zeroeval';
ze.init();
const openai = ze.wrap(new OpenAI());
// Create a version-tracked prompt
const systemPrompt = await ze.prompt({
name: 'customer-support',
content: 'You are a helpful customer support agent for {{company}}.',
variables: { company: 'TechCorp' }
});
// The integration automatically:
// 1. Extracts the prompt metadata
// 2. Links the completion to the prompt version
// 3. Patches the model if one is bound to the prompt version
const response = await openai.chat.completions.create({
model: 'gpt-4', // May be replaced by bound model
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: 'I need help with my order' }
]
});
```
Need help? Check out our [GitHub examples](https://github.com/zeroeval/zeroeval-ts-sdk/tree/main/examples) or reach out on [Discord](https://discord.gg/MuExkGMNVz).
# Reference
Source: https://docs.zeroeval.com/tracing/sdks/typescript/reference
Complete API reference for the TypeScript SDK
## Installation
```bash theme={null}
npm install zeroeval
```
## Core Functions
### `init()`
Initializes the ZeroEval SDK. Must be called before using any other SDK features.
```typescript theme={null}
function init(opts?: InitOptions): void
```
#### Parameters
| Option | Type | Default | Description |
| -------------------- | ------------------------- | -------------------------- | --------------------------------------- |
| `apiKey` | `string` | `ZEROEVAL_API_KEY` env | Your ZeroEval API key |
| `apiUrl` | `string` | `https://api.zeroeval.com` | Custom API URL |
| `flushInterval` | `number` | `10` | Interval in seconds to flush spans |
| `maxSpans` | `number` | `100` | Maximum spans to buffer before flushing |
| `collectCodeDetails` | `boolean` | `true` | Capture source code context |
| `integrations` | `Record` | — | Enable/disable specific integrations |
| `debug` | `boolean` | `false` | Enable debug logging |
#### Example
```typescript theme={null}
import * as ze from 'zeroeval';
ze.init({
apiKey: 'your-api-key',
debug: true
});
```
***
## Wrapper Functions
### `wrap()`
Wraps a supported AI client to automatically trace all API calls.
```typescript theme={null}
function wrap(client: T): WrappedClient
```
#### Supported Clients
* OpenAI SDK (`openai` package)
* Vercel AI SDK (`ai` package)
#### Examples
```typescript theme={null}
// OpenAI
import { OpenAI } from 'openai';
import * as ze from 'zeroeval';
const openai = ze.wrap(new OpenAI());
// Vercel AI SDK
import * as ai from 'ai';
import * as ze from 'zeroeval';
const wrappedAI = ze.wrap(ai);
```
***
## Spans API
### `withSpan()`
Wraps a function execution in a span, automatically capturing timing and errors.
```typescript theme={null}
function withSpan(
opts: SpanOptions,
fn: () => Promise | T
): Promise | T
```
#### SpanOptions
| Option | Type | Required | Description |
| ------------- | ------------------------- | -------- | ------------------------------------- |
| `name` | `string` | Yes | Name of the span |
| `sessionId` | `string` | No | Session ID to associate with the span |
| `sessionName` | `string` | No | Human-readable session name |
| `tags` | `Record` | No | Tags to attach to the span |
| `attributes` | `Record` | No | Additional attributes |
| `inputData` | `any` | No | Manual input data override |
| `outputData` | `any` | No | Manual output data override |
#### Example
```typescript theme={null}
import * as ze from 'zeroeval';
const result = await ze.withSpan(
{ name: 'fetch-user-data' },
async () => {
const user = await fetchUser(userId);
return user;
}
);
```
### `@span` Decorator
Decorator for class methods to automatically create spans.
```typescript theme={null}
span(opts: SpanOptions): MethodDecorator
```
#### Example
```typescript theme={null}
import * as ze from 'zeroeval';
class UserService {
@ze.span({ name: 'get-user' })
async getUser(id: string): Promise {
return await db.users.findById(id);
}
}
```
Requires `experimentalDecorators: true` in your `tsconfig.json`.
***
## Context Functions
### `getCurrentSpan()`
Returns the currently active span, if any.
```typescript theme={null}
function getCurrentSpan(): Span | undefined
```
### `getCurrentTrace()`
Returns the current trace ID.
```typescript theme={null}
function getCurrentTrace(): string | undefined
```
### `getCurrentSession()`
Returns the current session ID.
```typescript theme={null}
function getCurrentSession(): string | undefined
```
### `setTag()`
Sets tags on a span, trace, or session.
```typescript theme={null}
function setTag(
target: Span | string | undefined,
tags: Record
): void
```
#### Parameters
| Parameter | Description |
| ----------- | --------------------------------------- |
| `Span` | Sets tags on the specific span |
| `string` | Sets tags on the trace or session by ID |
| `undefined` | Sets tags on the current span |
***
## Prompts API
### `prompt()`
Creates or fetches versioned prompts from the Prompt Library. Returns decorated content for downstream LLM calls.
```typescript theme={null}
async function prompt(options: PromptOptions): Promise
```
#### PromptOptions
| Option | Type | Required | Description |
| ----------- | ------------------------ | -------- | -------------------------------------------------------------------- |
| `name` | `string` | Yes | Task name associated with the prompt |
| `content` | `string` | No | Raw prompt content (used as fallback or for explicit mode) |
| `variables` | `Record` | No | Template variables to interpolate `{{variable}}` tokens |
| `from` | `string` | No | Version control: `"latest"`, `"explicit"`, or a 64-char SHA-256 hash |
#### Behavior
* **Auto-optimization (default)**: If `content` is provided without `from`, tries to fetch the latest optimized version first, falls back to provided content
* **Explicit mode** (`from: "explicit"`): Always uses provided `content`, bypasses auto-optimization
* **Latest mode** (`from: "latest"`): Requires an optimized version to exist, fails if none found
* **Hash mode** (`from: ""`): Fetches a specific version by its 64-character SHA-256 content hash
#### Examples
```typescript theme={null}
import * as ze from 'zeroeval';
// Auto-optimization mode (recommended)
const prompt = await ze.prompt({
name: 'customer-support',
content: 'You are a helpful {{role}} assistant.',
variables: { role: 'customer service' }
});
// Explicit mode - bypass auto-optimization
const prompt = await ze.prompt({
name: 'customer-support',
content: 'You are a helpful assistant.',
from: 'explicit'
});
// Latest mode - require optimized version
const prompt = await ze.prompt({
name: 'customer-support',
from: 'latest'
});
// Hash mode - specific version
const prompt = await ze.prompt({
name: 'customer-support',
from: 'a1b2c3d4e5f6...' // 64-char SHA-256 hash
});
```
#### Return Value
Returns a decorated prompt string with metadata header used by integrations:
```
{"task":"...", "prompt_version": 1, ...}Your prompt content here
```
#### Errors
| Error | When |
| --------------------- | -------------------------------------------------------------------------- |
| `Error` | Both `content` and `from` provided (except `from: "explicit"`), or neither |
| `PromptRequestError` | `from: "latest"` but no versions exist |
| `PromptNotFoundError` | `from` is a hash that does not exist |
***
### `sendFeedback()`
Sends feedback for a completion to enable prompt optimization.
```typescript theme={null}
async function sendFeedback(options: SendFeedbackOptions): Promise
```
#### SendFeedbackOptions
| Option | Type | Required | Description |
| ---------------- | ------------------------- | -------- | ----------------------------------------- |
| `promptSlug` | `string` | Yes | The slug of the prompt (task name) |
| `completionId` | `string` | Yes | UUID of the span/completion |
| `thumbsUp` | `boolean` | Yes | `true` for positive, `false` for negative |
| `reason` | `string` | No | Explanation of the feedback |
| `expectedOutput` | `string` | No | What the expected output should be |
| `metadata` | `Record` | No | Additional metadata |
| `judgeId` | `string` | No | Judge automation ID for judge feedback |
| `expectedScore` | `number` | No | Expected score for scored judges |
| `scoreDirection` | `'too_high' \| 'too_low'` | No | Score direction for scored judges |
#### Example
```typescript theme={null}
import * as ze from 'zeroeval';
await ze.sendFeedback({
promptSlug: 'support-bot',
completionId: '550e8400-e29b-41d4-a716-446655440000',
thumbsUp: false,
reason: 'Response was too verbose',
expectedOutput: 'A concise 2-3 sentence response'
});
```
***
## Signals API
### `sendSignal()`
Send a signal to a specific entity.
```typescript theme={null}
async function sendSignal(
entityType: 'session' | 'trace' | 'span' | 'completion',
entityId: string,
name: string,
value: string | boolean | number,
signalType?: 'boolean' | 'numerical'
): Promise
```
### `sendTraceSignal()`
Send a signal to the current trace.
```typescript theme={null}
function sendTraceSignal(
name: string,
value: string | boolean | number,
signalType?: 'boolean' | 'numerical'
): void
```
### `sendSessionSignal()`
Send a signal to the current session.
```typescript theme={null}
function sendSessionSignal(
name: string,
value: string | boolean | number,
signalType?: 'boolean' | 'numerical'
): void
```
### `sendSpanSignal()`
Send a signal to the current span.
```typescript theme={null}
function sendSpanSignal(
name: string,
value: string | boolean | number,
signalType?: 'boolean' | 'numerical'
): void
```
### `getEntitySignals()`
Retrieve signals for a specific entity.
```typescript theme={null}
async function getEntitySignals(
entityType: 'session' | 'trace' | 'span' | 'completion',
entityId: string
): Promise
```
#### Example
```typescript theme={null}
import * as ze from 'zeroeval';
await ze.withSpan({ name: 'process-request' }, async () => {
// Process something...
// Send signals
ze.sendSpanSignal('success', true);
ze.sendSpanSignal('latency_ms', 150);
ze.sendTraceSignal('user_satisfied', true);
});
```
***
## Utility Functions
### `renderTemplate()`
Render a template string with variable substitution.
```typescript theme={null}
function renderTemplate(
template: string,
variables: Record,
options?: { ignoreMissing?: boolean }
): string
```
### `extractVariables()`
Extract variable names from a template string.
```typescript theme={null}
function extractVariables(template: string): Set
```
### `sha256Hex()`
Compute SHA-256 hash of text.
```typescript theme={null}
async function sha256Hex(text: string): Promise
```
### `normalizePromptText()`
Normalize prompt text for consistent hashing.
```typescript theme={null}
function normalizePromptText(text: string): string
```
***
## Error Classes
### `PromptNotFoundError`
Thrown when a specific prompt version (by hash) is not found.
```typescript theme={null}
class PromptNotFoundError extends Error {
constructor(message: string)
}
```
### `PromptRequestError`
Thrown when a prompt request fails (e.g., no versions exist for `from: "latest"`).
```typescript theme={null}
class PromptRequestError extends Error {
constructor(message: string, statusCode?: number)
}
```
***
## Types
### `Prompt`
```typescript theme={null}
interface Prompt {
id: string;
prompt_id: string;
content: string;
content_hash: string;
version: number;
model_id?: string;
}
```
### `PromptMetadata`
```typescript theme={null}
interface PromptMetadata {
task: string;
prompt_slug?: string;
prompt_version?: number;
prompt_version_id?: string;
content_hash?: string;
variables?: Record;
}
```
### `Signal`
```typescript theme={null}
interface Signal {
value: string | boolean | number;
type: 'boolean' | 'numerical';
}
```
Need help? Check out our [GitHub examples](https://github.com/zeroeval/zeroeval-ts-sdk/tree/main/examples) or reach out on [Discord](https://discord.gg/MuExkGMNVz).
# Setup
Source: https://docs.zeroeval.com/tracing/sdks/typescript/setup
Get started with ZeroEval tracing in TypeScript and JavaScript applications
The [ZeroEval TypeScript SDK](https://www.npmjs.com/package/zeroeval) provides tracing for Node.js applications through wrapper functions and integration callbacks.
## Installation
```bash npm theme={null}
npm install zeroeval
```
```bash yarn theme={null}
yarn add zeroeval
```
```bash pnpm theme={null}
pnpm add zeroeval
```
## Basic Setup
```typescript theme={null}
import * as ze from 'zeroeval';
// Option 1: ZEROEVAL_API_KEY in your environment variable
ze.init();
// Option 2: Provide API key directly
ze.init({ apiKey: 'YOUR_API_KEY' });
// Option 3: With additional configuration
ze.init({
apiKey: 'YOUR_API_KEY',
apiUrl: 'https://api.zeroeval.com', // optional
flushInterval: 10, // seconds
maxSpans: 100,
});
```
## Patterns
The SDK offers two ways to add tracing to your TypeScript/JavaScript code:
### Function Wrapping
Use `withSpan()` to wrap function executions:
```typescript theme={null}
import * as ze from 'zeroeval';
// Wrap synchronous functions
const fetchData = (userId: string) =>
ze.withSpan({ name: 'fetch_data' }, () => ({
userId,
name: 'John Doe'
}));
// Wrap async functions
const processData = async (data: { name: string }) =>
ze.withSpan(
{
name: 'process_data',
attributes: { version: '1.0' }
},
async () => {
const result = await transform(data);
return `Welcome, ${result.name}!`;
}
);
// Complex workflows with nested spans
async function complexWorkflow() {
return ze.withSpan({ name: 'data_pipeline' }, async () => {
const data = await ze.withSpan(
{ name: 'fetch_stage' },
fetchExternalData
);
const processed = await ze.withSpan(
{ name: 'process_stage' },
() => transformData(data)
);
const result = await ze.withSpan(
{ name: 'save_stage' },
() => saveToDatabase(processed)
);
return result;
});
}
```
### Decorators
Use the `@span` decorator for class methods:
```typescript theme={null}
import { span } from 'zeroeval';
class DataService {
@span({
name: 'fetch_user_data',
tags: { service: 'user_api' }
})
async fetchUser(userId: string) {
const response = await fetch(`/api/users/${userId}`);
return response.json();
}
@span({
name: 'process_order',
attributes: { version: '2.0' }
})
processOrder(orderId: string, items: string[]) {
return { orderId, processed: true };
}
}
```
**Decorators require TypeScript configuration**: Enable `experimentalDecorators` in your `tsconfig.json`:
```json theme={null}
{
"compilerOptions": {
"experimentalDecorators": true
}
}
```
When using runtime tools like `tsx` or `ts-node`, pass the `--experimental-decorators` flag.
## Sessions
Group related spans into sessions:
```typescript theme={null}
import { v4 as uuidv4 } from 'uuid';
import * as ze from 'zeroeval';
const sessionId = uuidv4();
async function userJourney(userId: string) {
return ze.withSpan(
{
name: 'user_journey',
sessionId: sessionId,
sessionName: 'User Onboarding'
},
async () => {
// All nested spans inherit the session
await ze.withSpan({ name: 'step_1' }, () => welcome(userId));
await ze.withSpan({ name: 'step_2' }, () => setupProfile(userId));
await ze.withSpan({ name: 'step_3' }, () => sendConfirmation(userId));
}
);
}
```
## Context
Access current context information:
```typescript theme={null}
import * as ze from 'zeroeval';
// Get the current span
const currentSpan = ze.getCurrentSpan();
// Get the current trace ID
const traceId = ze.getCurrentTrace();
// Get the current session ID
const sessionId = ze.getCurrentSession();
```
## Tagging
Attach tags for filtering and organization:
```typescript theme={null}
import * as ze from 'zeroeval';
// Set tags on the current span
ze.setTag(undefined, { user_id: '12345', environment: 'production' });
// Set tags on a specific trace
const traceId = ze.getCurrentTrace();
if (traceId) {
ze.setTag(traceId, { feature: 'checkout' });
}
// Set tags on a span object
const span = ze.getCurrentSpan();
if (span) {
ze.setTag(span, { action: 'process_payment' });
}
```
## Advanced Configuration
Fine-tune the SDK behavior:
```typescript theme={null}
import * as ze from 'zeroeval';
ze.init({
apiKey: 'YOUR_API_KEY',
apiUrl: 'https://api.zeroeval.com',
flushInterval: 5, // Flush every 5 seconds
maxSpans: 200, // Buffer up to 200 spans
collectCodeDetails: true, // Capture source code context
debug: false, // Enable debug logging
integrations: {
openai: true, // Enable OpenAI integration
vercelAI: true, // Enable Vercel AI SDK integration
}
});
```
Need help? Check out our [GitHub examples](https://github.com/zeroeval/zeroeval-ts-sdk/tree/main/examples) or reach out on [Discord](https://discord.gg/MuExkGMNVz).
# Sessions
Source: https://docs.zeroeval.com/tracing/sessions
Group related spans into sessions for better organization and analysis
Sessions provide a powerful way to group related spans together, making it easier to track and analyze complex workflows, user interactions, or multi-step processes. This guide covers everything you need to know about working with sessions.
For complete API documentation, see the [Python SDK Reference](/tracing/sdks/python/reference).
## Creating Sessions
### Basic Session with ID
The simplest way to create a session is by providing a session ID:
```python theme={null}
import uuid
import zeroeval as ze
# Generate a unique session ID
session_id = str(uuid.uuid4())
@ze.span(name="process_request", session=session_id)
def process_request(data):
# This span belongs to the session
return transform_data(data)
```
### Named Sessions
For better organization in the ZeroEval dashboard, you can provide both an ID and a descriptive name:
```python theme={null}
@ze.span(
name="user_interaction",
session={
"id": session_id,
"name": "Customer Support Chat - User #12345"
}
)
def handle_support_chat(user_id, message):
# Process the support request
return generate_response(message)
```
## Session Inheritance
Child spans automatically inherit the session from their parent span:
```python theme={null}
session_info = {
"id": str(uuid.uuid4()),
"name": "Order Processing Pipeline"
}
@ze.span(name="process_order", session=session_info)
def process_order(order_id):
# These nested calls automatically belong to the same session
validate_order(order_id)
charge_payment(order_id)
fulfill_order(order_id)
@ze.span(name="validate_order")
def validate_order(order_id):
# Automatically part of the parent's session
return check_inventory(order_id)
@ze.span(name="charge_payment")
def charge_payment(order_id):
# Also inherits the session
return process_payment(order_id)
```
## Advanced Session Patterns
### Multi-Agent RAG System
Track complex retrieval-augmented generation workflows with multiple specialized agents:
```python theme={null}
session = {
"id": str(uuid.uuid4()),
"name": "Multi-Agent RAG Pipeline"
}
@ze.span(name="rag_coordinator", session=session)
async def process_query(query):
# Retrieval
docs = await retrieval_agent(query)
# Reranking
ranked = await reranking_agent(query, docs)
# Generation
response = await generation_agent(query, ranked)
return response
@ze.span(name="retrieval_agent")
async def retrieval_agent(query):
# Inherits session from parent
embeddings = await embed(query)
return await vector_search(embeddings)
@ze.span(name="generation_agent")
async def generation_agent(query, context):
return await llm.generate(query, context)
```
### Conversational AI Session
Track a complete conversation with an AI assistant:
```python theme={null}
class ChatSession:
def __init__(self, user_id):
self.session = {
"id": f"chat-{user_id}-{uuid.uuid4()}",
"name": f"AI Chat - User {user_id}"
}
self.history = []
@ze.span(name="process_message", session=lambda self: self.session)
async def process_message(self, message):
# Add to history
self.history.append({"role": "user", "content": message})
# Generate response
response = await self.generate_response()
self.history.append({"role": "assistant", "content": response})
return response
@ze.span(name="generate_response", session=lambda self: self.session)
async def generate_response(self):
return await llm.chat(self.history)
```
### Batch LLM Processing
Process multiple documents with LLMs in a single session:
```python theme={null}
async def batch_summarize(documents):
session = {
"id": f"batch-{uuid.uuid4()}",
"name": f"Batch Summarization - {len(documents)} docs"
}
@ze.span(name="batch_processor", session=session)
async def process():
summaries = []
for i, doc in enumerate(documents):
with ze.span(name=f"summarize_doc_{i}", session=session) as span:
try:
summary = await llm.summarize(doc)
span.set_io(
input_data=f"Doc: {doc['title']}",
output_data=summary[:100]
)
summaries.append(summary)
except Exception as e:
span.set_error(
code=type(e).__name__,
message=str(e)
)
return summaries
return await process()
```
## Context Manager Sessions
You can also use sessions with the context manager pattern:
```python theme={null}
session_info = {
"id": str(uuid.uuid4()),
"name": "Data Pipeline Run"
}
with ze.span(name="etl_pipeline", session=session_info) as pipeline_span:
# Extract phase
with ze.span(name="extract_data") as extract_span:
raw_data = fetch_from_source()
extract_span.set_io(output_data=f"Extracted {len(raw_data)} records")
# Transform phase
with ze.span(name="transform_data") as transform_span:
clean_data = transform_records(raw_data)
transform_span.set_io(
input_data=f"{len(raw_data)} raw records",
output_data=f"{len(clean_data)} clean records"
)
# Load phase
with ze.span(name="load_data") as load_span:
result = save_to_destination(clean_data)
load_span.set_io(output_data=f"Loaded to {result['location']}")
```
# Signals
Source: https://docs.zeroeval.com/tracing/signals
Capture real-world feedback and metrics to enrich your traces, spans, and sessions.
Signals are any piece of user feedback, behavior, or metric you care about – thumbs-up, a 5-star rating, dwell time, task completion, error rates … you name it. Signals help you understand how your AI system performs in the real world by connecting user outcomes to your traces.
You can attach signals to:
* **Completions** (LLM responses)
* **Spans** (individual operations)
* **Sessions** (user interactions)
* **Traces** (entire request flows)
For complete signals API documentation, see the [Python SDK Reference](/tracing/sdks/python/reference#signals).
## Using signals in code
### With the Python SDK
```python theme={null}
import zeroeval as ze
# Initialize the tracer
ze.init(api_key="your-api-key")
# Start a span and add a signal
with ze.trace("user_query") as span:
# Your AI logic here
response = process_user_query(query)
# Add a signal to the current span
ze.set_signal("user_satisfaction", True)
ze.set_signal("response_quality", 4.5)
ze.set_signal("task_completed", "success")
```
### Setting signals on different targets
```python theme={null}
# On the current span
ze.set_signal("helpful", True)
# On a specific span
span = ze.current_span()
ze.set_signal(span, {"rating": 5, "category": "excellent"})
# On the current trace
ze.set_trace_signal("conversion", True)
# On the current session
ze.set_session_signal("user_engaged", True)
```
## API endpoint
For direct API calls, send signals to:
```
POST https://api.zeroeval.com/workspaces//signals
```
Auth is the same bearer API key you use for tracing.
### Payload schema
| field | type | required | notes |
| -------------- | ------------------------------ | -------- | ---------------------------------------------- |
| completion\_id | string | ❌ | **OpenAI completion ID** (for LLM completions) |
| span\_id | string | ❌ | **Span ID** (for specific spans) |
| trace\_id | string | ❌ | **Trace ID** (for entire traces) |
| session\_id | string | ❌ | **Session ID** (for user sessions) |
| name | string | ✅ | e.g. `user_satisfaction` |
| value | string \| bool \| int \| float | ✅ | your data – see examples below |
You must provide at least one of: `completion_id`, `span_id`, `trace_id`, or
`session_id`.
## Common signal patterns
Below are some quick copy-pasta snippets for the most common cases.
### 1. Binary feedback (👍 / 👎)
```python Python SDK theme={null}
import zeroeval as ze
# On current span
ze.set_signal("thumbs_up", True)
# On specific span
ze.set_signal(span, {"helpful": False})
```
```python API theme={null}
import requests
payload = {
"span_id": span.id,
"name": "thumbs_up",
"value": True // or False
}
requests.post(
f"https://api.zeroeval.com/workspaces/{WORKSPACE_ID}/signals",
json=payload,
headers={"Authorization": f"Bearer {ZE_API_KEY}"}
)
```
### 2. Star rating (1–5)
```python theme={null}
ze.set_signal("star_rating", 4)
```
### 3. Continuous metrics
```python theme={null}
# Response time
ze.set_signal("response_time_ms", 1250.5)
# Task completion time
ze.set_signal("time_on_task_sec", 12.85)
# Accuracy score
ze.set_signal("accuracy", 0.94)
```
### 4. Categorical outcomes
```python theme={null}
ze.set_signal("task_status", "success")
ze.set_signal("error_type", "timeout")
ze.set_signal("user_intent", "purchase")
```
### 5. Session-level signals
```python theme={null}
# Track user engagement across an entire session
ze.set_session_signal("pages_visited", 5)
ze.set_session_signal("converted", True)
ze.set_session_signal("user_tier", "premium")
```
### 6. Trace-level signals
```python theme={null}
# Track outcomes for an entire request flow
ze.set_trace_signal("request_successful", True)
ze.set_trace_signal("total_cost", 0.045)
ze.set_trace_signal("model_used", "gpt-4o")
```
## Signal types
Signals are automatically categorized based on their values:
* **Boolean**: `true`/`false` values → useful for success/failure, yes/no feedback
* **Numerical**: integers and floats → useful for ratings, scores, durations, costs
* **Categorical**: strings → useful for status, categories, error types
## Putting it all together
```python theme={null}
import zeroeval as ze
# Initialize tracing
ze.init(api_key="your-api-key")
# Start a session for user interaction
with ze.trace("user_chat_session", session_name="Customer Support") as session:
# Process user query
with ze.trace("process_query") as span:
response = llm_client.chat.completions.create(...)
# Signal on the LLM completion
ze.set_signal("response_generated", True)
ze.set_signal("response_length", len(response.choices[0].message.content))
# Capture user feedback
user_rating = get_user_feedback() # Your feedback collection logic
# Signal on the session
ze.set_session_signal("user_rating", user_rating)
ze.set_session_signal("issue_resolved", user_rating >= 4)
# Signal on the entire trace
ze.set_trace_signal("interaction_complete", True)
```
That's it! Your signals will appear in the ZeroEval dashboard, helping you understand how your AI system performs in real-world scenarios.
# Tags
Source: https://docs.zeroeval.com/tracing/tagging
Simple ways to attach rich, query-able tags to your traces.
Tags are key–value pairs that can be attached to any **span**, **trace**, or **session**. They power the facet filters in the console so you can slice-and-dice your telemetry by *user*, *plan*, *model*, *tenant*, or anything else that matters to your business.
For complete tagging API documentation, see the [Python SDK Reference](/tracing/sdks/python/reference#tags).
## 1. Tag once, inherit everywhere
When you add a `tags` dictionary to the **first** span you create, every child span automatically gets the same tags. That means you set them once and they flow down the entire call-stack.
```python theme={null}
import zeroeval as ze
@ze.span(
name="handle_request",
tags={
"user_id": "42", # who triggered the request
"tenant": "acme-corp", # multi-tenant identifier
"plan": "enterprise" # commercial plan
}
)
def handle_request():
authenticate()
fetch_data()
process()
# Two nested child spans – they automatically inherit *all* the tags
with ze.span(name="fetch_data"):
...
with ze.span(name="process", tags={"stage": "post"}):
...
```
## 2. Tag a single span
If you want to tag only a **single** span (or override a tag inherited from a parent) simply provide the `tags` argument on that specific decorator or context manager.
```python theme={null}
import zeroeval as ze
@ze.span(name="top_level")
def top_level():
# Child span with its own tags – *not* inherited by siblings
with ze.span(name="db_call", tags={"table": "customers", "operation": "SELECT"}):
query_database()
# Another child span without tags – it has no knowledge of the db_call tags
with ze.span(name="render"):
render_template()
```
Under the hood these tags live only on that single span, they are **not** copied to siblings or parents.
## 3. Granular tagging (session, trace, or span)
You can add granular tags at the session, trace, or span level after they've been created:
```python theme={null}
import uuid
from langchain_core.messages import HumanMessage
import zeroeval as ze
DEMO_TAGS = {"example": "langgraph_tags_demo", "project": "zeroeval"}
SESSION_ID = str(uuid.uuid4())
SESSION_INFO = {"id": SESSION_ID, "name": "Tags Demo Session"}
with ze.span(
name="demo.root_invoke",
session=SESSION_INFO,
tags={**DEMO_TAGS, "run": "invoke"},
):
# 1️⃣ Tag the *current* span only
current_span = ze.get_current_span()
ze.set_tag(current_span, {"phase": "pre-run"})
# 2️⃣ Tag the whole trace – root + all children (past *and* future)
current_trace = ze.get_current_trace()
ze.set_tag(current_trace, {"run_mode": "invoke"})
# 3️⃣ Tag the entire session
current_session = ze.get_current_session()
ze.set_tag(current_session, {"env": "local"})
result = app.invoke({"messages": [HumanMessage(content="hello")], "count": 0})
```