Requirements
Evals require a signed-in MCPJam account. As a guest you’ll see a sign-in prompt instead of the testing interface.How it’s organized
Suite
A group of cases plus the defaults they share: attached servers, models to run against, default checks, judge config, and argument-matching mode.
Case
One scenario you want to verify — a prompt (or a sequence of prompts), the tools you expect to fire, an optional expected output, and the checks that decide pass/fail.
Authoring a case
Each case has four moving parts:- Scenario — short label so you can find it later (“Draw a rectangle”, “Refuses unsafe delete”).
- Prompt turns — one or more user messages. Multi-turn is supported; expected tools and checks attach per turn, so you can model “ask, get a result, follow up.”
- Expected tools — for each turn, the tool calls the model should make. Arguments can be exact values or typed placeholders (
"string","number") so you’re not chasing flaky literals. - Expected output — optional free-text description of what a good final answer looks like. Used by the judge.
A negative case is just a case whose checks say “this tool should not fire.” Meta questions (“what params does
search take?”), conversational drift, and ambiguous prompts are the usual shape.Checks — the deterministic gate
Checks are what actually decide pass/fail. They’re pure functions of the iteration transcript, so the verdict is the same every time you replay it — which is the property you want if you’re using Evaluate as a regression gate. Set defaults on the suite; override per case with inherit (use suite defaults), replace (use only the case’s list), or extend (suite defaults, then the case’s list).| Check | Passes when |
|---|---|
| Tool was called with… | The named tool was called with arguments matching your spec (per the suite’s argumentMatching mode). |
| Tool was called at least once | The named tool fired at least once across the turn. |
| Tool was never called | The named tool did not fire. The core of negative cases. |
| First tool called was… | The first tool call in the turn matched the named tool. |
| Response contains… | The final assistant message contains a substring. |
| Response matches regex… | The final assistant message matches a regex. |
| No tool errors | No tool call returned an error. |
| Final message non-empty | The model produced a final assistant message. |
| Token budget under N | Total token usage stayed below the cap. |
Argument matching
Tool-call argument comparison runs in one of three modes, configured at the suite:- partial (default) — every expected key must be present and match; extra keys in the actual call are ignored. Best for “I care about
queryandlimit, not what else the model put in.” - exact — actual args must equal expected args, key-for-key.
- ignore — only the tool name is checked.
"string", "number", "boolean") match any value of that type, which lets you assert shape without locking in a literal.
Validator settings
Each suite has default validator settings controlling how tool calls are matched. Override them at three levels:| Level | Where | Persists? |
|---|---|---|
| Suite default | Suite settings → Default validators | Yes |
| Case override | Test case editor → Validators | Yes |
| Run override | Suite header → validators (sliders) icon | No — one run |
LLM as judge
For cases where “did the right tools fire” isn’t enough — anything graded on the quality of the final answer — the judge grades the run against your expected output if you set one, and against the user prompt otherwise. It’s advisory: it produces a score and a rationale, but doesn’t gate the run unless you ask it to.- On by default at the suite level. The cost is gated by an explicit Run judge click on the run-detail page — it won’t run for every iteration unless you turn auto-run on.
- Calibrate per suite. Judge scores aren’t comparable across domains; a 0.7 on one suite isn’t a 0.7 on another.
- When grading against the prompt rather than an expected output, scores are capped at 0.85 — you can’t get a “perfect” without saying what perfect means.
openai/gpt-5.4-mini) and a threshold (default 0.7), then click Run judge. Each case gets a score, an advisory verdict, a one-line reason, and rubric hits. The judge never runs automatically unless you enable Auto-run in suite settings.
Running
A run needs three things, all picked from the suite header:- Servers — one or more attached to the suite. Cases can attach their own subsets if they only need part of the surface.
- Models — the multi-model picker is the whole point. Each model produces its own iteration per case, so you can see where Claude passes and ChatGPT trips.
- Run all — kicks off every case × every model. Run one from a case row runs just that case.
Frozen execution snapshots
The first run of a suite saves the set of MCP servers used as a frozen snapshot; reruns reuse it, so connecting new servers can’t silently change what a suite runs against. Click Update snapshot in the suite header to re-save the current servers and start a new run.Suite execution config
Each eval suite has a Default Execution Config section that controls the model, system prompt, temperature, tool approval, connection settings, capabilities, and host context used when running the suite.- Model and prompt — Set the default model ID, system prompt, and temperature for all runs in the suite.
- Tool approval — Toggle
requireToolApprovalto pause before each tool call during a run. - Server selection — Servers are not configured here. They come from the suite’s environment. The server picker is intentionally hidden in this editor.
- Save / Reset — Click Save config to persist changes. Click Reset to revert to the last saved state. Unsaved edits are preserved if the page refreshes the config from the server, but are discarded when you switch to a different suite.
How suite defaults apply to test cases
The suite-level system prompt and temperature are runtime defaults: when a test case does not set its own system prompt or temperature, the suite values are used for that iteration. A per-case override always wins — the suite default only fills the gap.If you have existing suites where test cases do not specify a system prompt, those cases will now run with the suite’s system prompt applied. Cases that already set their own system prompt are unaffected.
Reading results
Suite view
- Suite accuracy — pass rate of the most recent run, with the last three runs’ trend so you can see whether you’re improving or regressing.
- Run insights — an AI-written diff against your previous completed run: which cases moved, which tools changed behavior, which models diverged. Skim this first; it usually points at the right rabbit hole.
- Runs tab — every run with its summary metrics. Click in for the iteration-level breakdown.
- Cases tab — every case with its latest verdict and a quick replay button.
- Executions tab — a flat, filterable list of every individual test execution across all cases, sorted most-recent-first. Each row shows the case name, result (passed/failed/pending/cancelled), and timestamp. Click any row to open it in the compare view.
Cross-host matrix
When two or more host configurations are attached to a suite, a By case / By host toggle appears. By host shows a matrix — one column per host, one row per case. Each cell shows pass/fail dots, pass rate, median latency, and token usage. A host detached after runs were recorded stays visible, labelled historical.Run view
- Per-iteration row with case, model, pass/fail, tokens, duration, and the tool calls that actually happened.
- Expected vs actual tool calls side-by-side when an iteration fails. For failed iterations, a Categorized diff above the raw Expected/Actual grids groups discrepancies into four categories:
- Missing — expected tool calls that were never made
- Extra — actual calls that weren’t expected (reported but non-fatal by default)
- Out of order — calls that happened in the wrong sequence (when order checking is enabled)
- Arg mismatch — right tool name, wrong arguments (shown side-by-side)
- Full trace — every turn, every tool call, every token. This is the thing you couldn’t see before; spend time here.
- Per-model breakdown to compare how the same case behaves across models.
- AI Triage — after a run completes, an AI Triage panel ranks tool-quality and workflow issues by impact. Each issue has a Copy button that copies a ready-to-use fix prompt (tool description + input schema) for a coding agent; Copy top 3 copies the top three combined.
- Predicate Gate — when
successPredicatesare configured on a case, an expandable Predicate Gate section in the iteration detail lists each predicate with a PASS/FAIL verdict, a one-line summary, and the evaluator’s reason. Hidden when no predicates are configured.
Comparing two runs
Select any two completed runs and click Compare. The diff view shows per-case status changes: Passed, Still failing, Regressed (pass→fail), Fixed (fail→pass), New, Removed, and Changed (config differed). Summary metrics show deltas for tokens, cost, and duration.Case view
- Pass rate across runs — is this case stable, flaky, or trending down?
- Pass rate by model — does this case only work on one model?
- Every past iteration with its trace, so you can A/B a regression against a working run.
Generating cases from your tools
The Generate button reads your attached servers’ tool catalog and drafts realistic cases — a mix of positive (“callsearch with a query”) and negative (“don’t call delete_user on a meta-question”). Treat it as a draft: skim, edit the prompts to match how your users actually talk, tighten the checks, then save.
When the suite has a saved server attachment spanning two or more servers, the generator produces coverage for each server individually plus at least one cross-server case. Suites without a saved attachment treat all available servers as a single pool.
What to author first
If you’re new to the surface, the shortest useful loop is:- Attach the server you’re shipping.
- Generate a starting set of cases.
- Delete the ones that don’t match real usage; tighten checks on the rest.
- Add the two or three models your users will hit.
- Run all, open the failures, and decide whether the bug is in your server, your prompt, or the model.

