Use the callable target for any agent or multi-agent system with a Python entry function. This is the universal integration boundary — frameworks (LangGraph, CrewAI, OpenAI Agents SDK, DSPy, LlamaIndex, AutoGen / MAF, …), custom orchestration, REST clients, and thin model wrappers all qualify.
The callable target has two integration paths:
- Recommended (happy path): OTel-traced agent — central auto-instrumentation helper across supported OpenInference frameworks. The judge cites tool calls, routing decisions, model calls, and latency as evidence.
- Customization: for unsupported frameworks (emit your own OTel spans) or for cases where instrumentation is impossible or unnecessary (plain callable / HTTP endpoint, no traces).
Pick the path that exposes enough internals for the judge to score what matters. OTel is recommended because every other path is strictly narrower.
| Observability for the judge | Plain str return |
LiteLLM-style response | OTel traces (recommended) |
|---|---|---|---|
| Final response text | ✅ | ✅ | ✅ |
| Final tool calls (names + arguments) | — | ✅ | ✅ |
| Token usage | — | ✅ | ✅ |
| Model name | — | ✅ | ✅ |
| Intermediate tool calls (per step) | — | — | ✅ |
| Routing / sub-agent decisions | — | — | ✅ |
| Intermediate model calls | — | — | ✅ |
| Per-span latency | — | — | ✅ |
| Total | 1 / 8 | 4 / 8 | 8 / 8 |
When your agent emits OpenTelemetry spans, the judge can cite tool arguments, routing decisions, model calls, and latency as evidence — not just the final response. This is the integration shape every flagship example uses.
For 33 supported frameworks (OpenAI Agents SDK, LangChain/LangGraph, CrewAI, DSPy, LlamaIndex, AutoGen, MAF, Pydantic AI, Smolagents, Instructor, Haystack, …), instrumentation is a small helper call at the top of your callable module:
# e.g. examples/travel_planner_langgraph/auto_trace.py
from assert_ai import auto_trace
auto_trace.enable() # installs available OpenInference instrumentors without starting Phoenix
def chat_sync(message: str, history: list[dict[str, str]] | None = None) -> str:
return run_my_agent(message, history)Wire the target up in your config:
pipeline:
inference:
target:
callable: examples.travel_planner_langgraph.auto_trace:chat_sync
trace:
backend: phoenix
group_by: session.idSee examples/phoenix_auto_trace/ for one runnable file per framework.
The judge can only score what it sees. With final text only:
- it cannot tell if the agent used the right tool with the right arguments
- it cannot tell which sub-agent or branch made a decision
- "the answer was right but for the wrong reason" looks like a pass
With trace capture, the judge cites specific spans as evidence and catches process failures even when the surface answer looks fine.
The customization paths exist as fallbacks. The judge sees less, so use them only when (a) you cannot instrument the target or (b) you are validating the eval pipeline itself, not the agent.
If your framework is not in the auto-instrument list — or you have custom orchestration — emit OTel spans yourself with the OpenTelemetry SDK. ASSERT's target.trace block reads the same span data either way.
pipeline:
inference:
target:
callable: examples.travel_planner_neurosan.agent:plan_trip_sync
trace:
backend: phoenix
group_by: session.idexamples/travel_planner_neurosan/agent.py shows ~20 lines that wrap a multi-agent flow in tracer.start_as_current_span(...) calls following OpenInference semantic conventions. Same trace visibility as auto-instrumentation; the judge cannot tell the difference.
Omit target.trace only when:
- your target is a black-box API you cannot instrument
- you are smoke-testing a thin wrapper around a hosted model
- you are validating the eval pipeline itself, not the agent
For real agents this is not recommended — the visibility table above shows what the judge loses. To recover tool-call visibility without OTel, return the response object from LiteLLM (a unified Python interface supporting 100+ model providers — Azure OpenAI, Anthropic, Bedrock, Vertex, Ollama, …) directly:
import litellm
def chat(message: str, history: list[dict[str, str]]) -> "litellm.ModelResponse":
return litellm.completion(model="azure/gpt-4o-mini", messages=history)The judge then sees final tool calls, token usage, and model name — still narrower than OTel (no intermediate routing or sub-agent decisions).
Sync or async function with one of two signatures:
def chat(message: str) -> str: ... # single-turn
def chat(message: str, history: list[dict[str, str]]) -> str: # multi-turn
...history follows the OpenAI / LiteLLM chat-messages format, filtered to user / assistant roles only. The current user turn is at history[-1]; message is a convenience for callables that ignore history. System prompts are owned by your callable (target.system_prompt is consumed only by the Prompt Agent target).
To round-trip directly into LiteLLM, pass history as messages — do not re-append message (it is already at history[-1]):
import litellm
def chat(message: str, history: list[dict[str, str]]) -> str:
response = litellm.completion(model="azure/gpt-4o-mini", messages=history)
return response.choices[0].message.contentReturn types and what the judge sees:
| Return type | Judge sees |
|---|---|
str, or dict with text / content |
final response text only |
Any object with a .choices attribute — litellm.ModelResponse, OpenAI's ChatCompletion, etc. — or a assert_ai.core.model_client.ModelResponse returned directly |
final response text plus final tool calls, token usage, and model name (the .choices form is normalized to assert_ai.core.model_client.ModelResponse internally) |
When your agent runs as a service you cannot import as Python, point at its URL:
pipeline:
inference:
target:
endpoint: https://my-agent.internal/chatThe runtime POSTs {"message": "...", "history": [...]} (same history shape as above) and expects {"response": "..."} back. Same black-box visibility as a plain string-returning callable. Requires aiohttp (pip install aiohttp).