Evaluate - MCPJam Inspector

A benchmark score doesn’t tell you whether your server holds up in production. Evaluate lets you pin down the behaviors you care about — which tools fire, with what arguments, what the final answer looks like — and run them across the models your users actually use.

Requirements

Evals require a signed-in MCPJam account. As a guest you’ll see a sign-in prompt instead of the testing interface.

How it’s organized

Project

Holds your servers and your suites. Everything below lives in one project.

Suite

A group of cases plus the defaults they share: attached servers, models to run against, default checks, judge config, and argument-matching mode.

Case

One scenario you want to verify — a prompt (or a sequence of prompts), the tools you expect to fire, an optional expected output, and the checks that decide pass/fail.

Run

One execution of a suite. Produces iterations — one per case × model. Each iteration has its own transcript, tool calls, tokens, duration, and verdict.

Authoring a case

Each case has four moving parts:

Scenario — short label so you can find it later (“Draw a rectangle”, “Refuses unsafe delete”).
Prompt turns — one or more user messages. Multi-turn is supported; expected tools and checks attach per turn, so you can model “ask, get a result, follow up.”
Expected tools — for each turn, the tool calls the model should make. Arguments can be exact values or typed placeholders ("string", "number") so you’re not chasing flaky literals.
Expected output — optional free-text description of what a good final answer looks like. Used by the judge.

A negative case is just a case whose checks say “this tool should not fire.” Meta questions (“what params does search take?”), conversational drift, and ambiguous prompts are the usual shape.

Checks — the deterministic gate

Checks are what actually decide pass/fail. They’re pure functions of the iteration transcript, so the verdict is the same every time you replay it — which is the property you want if you’re using Evaluate as a regression gate. Set defaults on the suite; override per case with inherit (use suite defaults), replace (use only the case’s list), or extend (suite defaults, then the case’s list).

Check	Passes when
Tool was called with…	The named tool was called with arguments matching your spec (per the suite’s `argumentMatching` mode).
Tool was called at least once	The named tool fired at least once across the turn.
Tool was never called	The named tool did not fire. The core of negative cases.
First tool called was…	The first tool call in the turn matched the named tool.
Response contains…	The final assistant message contains a substring.
Response matches regex…	The final assistant message matches a regex.
No tool errors	No tool call returned an error.
Final message non-empty	The model produced a final assistant message.
Token budget under N	Total token usage stayed below the cap.

Argument matching

Tool-call argument comparison runs in one of three modes, configured at the suite:

partial (default) — every expected key must be present and match; extra keys in the actual call are ignored. Best for “I care about query and limit, not what else the model put in.”
exact — actual args must equal expected args, key-for-key.
ignore — only the tool name is checked.

Type placeholders ("string", "number", "boolean") match any value of that type, which lets you assert shape without locking in a literal.

Validator settings

Each suite has default validator settings controlling how tool calls are matched. Override them at three levels:

Level	Where	Persists?
Suite default	Suite settings → Default validators	Yes
Case override	Test case editor → Validators	Yes
Run override	Suite header → validators (sliders) icon	No — one run

A run override shows an override badge; click Reset in the popover to clear it.

LLM as judge

For cases where “did the right tools fire” isn’t enough — anything graded on the quality of the final answer — the judge grades the run against your expected output if you set one, and against the user prompt otherwise. It’s advisory: it produces a score and a rationale, but doesn’t gate the run unless you ask it to.

On by default at the suite level. The cost is gated by an explicit Run judge click on the run-detail page — it won’t run for every iteration unless you turn auto-run on.
Calibrate per suite. Judge scores aren’t comparable across domains; a 0.7 on one suite isn’t a 0.7 on another.
When grading against the prompt rather than an expected output, scores are capped at 0.85 — you can’t get a “perfect” without saying what perfect means.

Open a completed run and find the Goal completion panel. Pick a judge model (default: openai/gpt-5.4-mini) and a threshold (default 0.7), then click Run judge. Each case gets a score, an advisory verdict, a one-line reason, and rubric hits. The judge never runs automatically unless you enable Auto-run in suite settings.

Running

A run needs three things, all picked from the suite header:

Servers — one or more attached to the suite. Cases can attach their own subsets if they only need part of the surface.
Models — the multi-model picker is the whole point. Each model produces its own iteration per case, so you can see where Claude passes and ChatGPT trips.
Run all — kicks off every case × every model. Run one from a case row runs just that case.

If Run all is disabled, a connected server is missing or no models are selected. The header pickers will tell you which.

Frozen execution snapshots

The first run of a suite saves the set of MCP servers used as a frozen snapshot; reruns reuse it, so connecting new servers can’t silently change what a suite runs against. Click Update snapshot in the suite header to re-save the current servers and start a new run.

Suite execution config

Each eval suite has a Default Execution Config section that controls the model, system prompt, temperature, tool approval, connection settings, capabilities, and host context used when running the suite.

Model and prompt — Set the default model ID, system prompt, and temperature for all runs in the suite.
Tool approval — Toggle requireToolApproval to pause before each tool call during a run.
Server selection — Servers are not configured here. They come from the suite’s environment. The server picker is intentionally hidden in this editor.
Save / Reset — Click Save config to persist changes. Click Reset to revert to the last saved state. Unsaved edits are preserved if the page refreshes the config from the server, but are discarded when you switch to a different suite.

Changes to the suite execution config apply to future runs only. Existing run snapshots are not affected.

How suite defaults apply to test cases

The suite-level system prompt and temperature are runtime defaults: when a test case does not set its own system prompt or temperature, the suite values are used for that iteration. A per-case override always wins — the suite default only fills the gap.

If you have existing suites where test cases do not specify a system prompt, those cases will now run with the suite’s system prompt applied. Cases that already set their own system prompt are unaffected.

Reading results

Suite view

Suite accuracy — pass rate of the most recent run, with the last three runs’ trend so you can see whether you’re improving or regressing.
Run insights — an AI-written diff against your previous completed run: which cases moved, which tools changed behavior, which models diverged. Skim this first; it usually points at the right rabbit hole.
Runs tab — every run with its summary metrics. Click in for the iteration-level breakdown.
Cases tab — every case with its latest verdict and a quick replay button.
Executions tab — a flat, filterable list of every individual test execution across all cases, sorted most-recent-first. Each row shows the case name, result (passed/failed/pending/cancelled), and timestamp. Click any row to open it in the compare view.

When you run a suite against multiple host configurations at once, the Runs list groups them into a single collapsible run group row showing mean accuracy and longest duration across all hosts. Expand it to inspect each host’s individual run. The Runs / Cases segmented control switches between the run history list and the test-cases overview. Performance by Model in the run summary appears only when more than one model was used. To compare two runs, check their boxes in the Runs list and click Compare.

Cross-host matrix

When two or more host configurations are attached to a suite, a By case / By host toggle appears. By host shows a matrix — one column per host, one row per case. Each cell shows pass/fail dots, pass rate, median latency, and token usage. A host detached after runs were recorded stays visible, labelled historical.

Run view

Per-iteration row with case, model, pass/fail, tokens, duration, and the tool calls that actually happened.
Expected vs actual tool calls side-by-side when an iteration fails. For failed iterations, a Categorized diff above the raw Expected/Actual grids groups discrepancies into four categories:
- Missing — expected tool calls that were never made
- Extra — actual calls that weren’t expected (reported but non-fatal by default)
- Out of order — calls that happened in the wrong sequence (when order checking is enabled)
- Arg mismatch — right tool name, wrong arguments (shown side-by-side)
Full trace — every turn, every tool call, every token. This is the thing you couldn’t see before; spend time here.
Per-model breakdown to compare how the same case behaves across models.
AI Triage — after a run completes, an AI Triage panel ranks tool-quality and workflow issues by impact. Each issue has a Copy button that copies a ready-to-use fix prompt (tool description + input schema) for a coding agent; Copy top 3 copies the top three combined.
Predicate Gate — when successPredicates are configured on a case, an expandable Predicate Gate section in the iteration detail lists each predicate with a PASS/FAIL verdict, a one-line summary, and the evaluator’s reason. Hidden when no predicates are configured.

Comparing two runs

Select any two completed runs and click Compare. The diff view shows per-case status changes: Passed, Still failing, Regressed (pass→fail), Fixed (fail→pass), New, Removed, and Changed (config differed). Summary metrics show deltas for tokens, cost, and duration.

Case view

Pass rate across runs — is this case stable, flaky, or trending down?
Pass rate by model — does this case only work on one model?
Every past iteration with its trace, so you can A/B a regression against a working run.

Generating cases from your tools

The Generate button reads your attached servers’ tool catalog and drafts realistic cases — a mix of positive (“call search with a query”) and negative (“don’t call delete_user on a meta-question”). Treat it as a draft: skim, edit the prompts to match how your users actually talk, tighten the checks, then save. When the suite has a saved server attachment spanning two or more servers, the generator produces coverage for each server individually plus at least one cross-server case. Suites without a saved attachment treat all available servers as a single pool.

What to author first

If you’re new to the surface, the shortest useful loop is:

Attach the server you’re shipping.
Generate a starting set of cases.
Delete the ones that don’t match real usage; tighten checks on the rest.
Add the two or three models your users will hit.
Run all, open the failures, and decide whether the bug is in your server, your prompt, or the model.

That last step is the one that benefits most from spending time in the trace view.

​Requirements

​How it’s organized

​Authoring a case

​Checks — the deterministic gate

​Argument matching

​Validator settings

​LLM as judge

​Running

​Frozen execution snapshots

​Suite execution config

​How suite defaults apply to test cases

​Reading results

​Suite view

​Cross-host matrix

​Run view

​Comparing two runs

​Case view

​Generating cases from your tools

​What to author first

Requirements

How it’s organized

Authoring a case

Checks — the deterministic gate

Argument matching

Validator settings

LLM as judge

Running

Frozen execution snapshots

Suite execution config

How suite defaults apply to test cases

Reading results

Suite view

Cross-host matrix

Run view

Comparing two runs

Case view

Generating cases from your tools

What to author first