travel_planner_langgraph

LangGraph Travel Planner — Flagship `target.callable` Example

This is the recommended starting point for evaluating any agent or multi-agent system with ASSERT. It runs a real LangGraph travel planner through target.callable and target.trace, then uses Phoenix/OpenInference OpenTelemetry spans so the judge can inspect tool calls, routing, and intermediate decisions — not just the final response.

Architecture

agent.py builds a four-node LangGraph StateGraph and exposes chat_sync(message) as the callable entrypoint. auto_trace.py registers Phoenix auto-instrumentation before importing that entrypoint.

generated test case
      |
      v
assert-ai inference loop
      |
      v
auto_trace.enable() -> chat_sync(message)
      |
      v
intent_classifier -- no book_trip/destination --> clarification --> END
      |
      | book_trip + destination
      v
research -- optional ToolNode --> itinerary_optimizer -- good answer --> END
                                      |
                                      v
                                clarification --> END

intent_classifier extracts intent, destination, and budget as JSON.
research binds five tools: search_flights, search_hotels, check_weather, check_travel_advisories, and validate_budget.
itinerary_optimizer creates the final itinerary from prior messages and is instructed not to fabricate details.
clarification asks a follow-up question when details are missing or the final answer is not usable.

Scenario

The eval targets a travel planner that must produce grounded, constraint-respecting itineraries while staying safe under adversarial pressure.

Config area	What this example probes
`behavior.description`	Quality failures: wrong or missing tools, ignored budgets, fabricated flights/hotels/prices. Safety failures: stereotyping, tool-output prompt injection, and sycophantic agreement with bad plans.
`context`	A LangGraph travel planner with flight, hotel, weather, advisory, and budget-validation tools.
`pipeline.systematize`	Generates 6 `behavior_categories` from the behavior spec.
`pipeline.test_set.stratify.dimensions`	Varies `traveler_type` and `trip_type`.
`pipeline.inference`	Runs up to 6 turns against `examples.travel_planner_langgraph.auto_trace:chat_sync`.
`pipeline.judge`	Scores `policy_violation` and `overrefusal` with `safety-core` plus a stricter custom rubric.

Value-add

Trace-aware judging catches process failures that final-text-only scoring can miss:

plausible itinerary, but no check_travel_advisories call
budget claim, but no validate_budget call or wrong arguments
missing destination routed to research instead of clarification
hostile or misleading tool output followed as instruction
flight, hotel, or price not grounded in any tool result

target.trace links the conversation to Phoenix/OpenInference spans so verdicts can cite tool calls, arguments, routing decisions, and intermediate model calls.

Quick Start

From the repo root:

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[otel,langgraph]"
cp .env.example .env
# Edit .env with AZURE_API_BASE and AZURE_API_KEY.
# Optional: set ASSERT_AZURE_DEPLOYMENT; default is gpt-5.4-mini.
phoenix serve  # optional trace UI
assert-ai run --config examples/travel_planner_langgraph/eval_config.yaml

Variable	Required	Notes
`AZURE_API_BASE`	Yes	Azure OpenAI endpoint URL for the shipped `azure/...` model config.
`AZURE_API_KEY`	Yes	Azure OpenAI API key.
`ASSERT_AZURE_DEPLOYMENT`	No	Overrides the deployment used by `agent.py`.

How to use

The important target block is:

target:
  callable: examples.travel_planner_langgraph.auto_trace:chat_sync
  trace:
    backend: phoenix
    group_by: session.id

Artifacts land under artifacts/results/travel-planner-langgraph-v1/demo-1/. Read them in this order:

metrics.json — aggregate rates by judge dimension and behavior category.
scores.jsonl — per-test-case verdicts, reasoning, and evidence.
inference_set.jsonl — conversations or agent actions with trace references.
config.yaml — the exact config snapshot used for reproducibility.

To browse the results locally:

cd viewer
npm install
npm run dev

Open http://localhost:5174 and select travel-planner-langgraph-v1. The viewer reads local artifacts directly; it does not run evaluations or add authentication.

Behavior violation rate results

Not yet measured at n=10. Do not cite a behavior violation rate for this example until a pinned n=10 run has been generated and reviewed.

Measurement	Status	Use today
`n=10` behavior violation rate	Not measured yet	Use local runs to inspect generated `behavior_categories`, trace evidence, and judge rationales.
Quickstart run	Runnable example	Good for validating integration shape, not for benchmarking model quality.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
agent.py		agent.py
auto_trace.py		auto_trace.py
eval_config.yaml		eval_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

LangGraph Travel Planner — Flagship `target.callable` Example

Architecture

Scenario

Value-add

Quick Start

How to use

Behavior violation rate results

FilesExpand file tree

travel_planner_langgraph

Directory actions

More options

Directory actions

More options

Latest commit

History

travel_planner_langgraph

Folders and files

parent directory

README.md

LangGraph Travel Planner — Flagship target.callable Example

Architecture

Scenario

Value-add

Quick Start

How to use

Behavior violation rate results

LangGraph Travel Planner — Flagship `target.callable` Example