GitHub - synthetic-lab/synbad: LLM evals to test open-source inference providers

Synbad is a tool for detecting bugs in LLM inference providers, especially open-source ones. Synbad is maintained by Synthetic, as part of our efforts to keep our inference quality as high as possible.

If you find bugs in Synthetic's model hosting, please contribute the bugs here! We will fix them.

Install

Synbad is distributed through npm. Install it with:

npm install -g @syntheticlab/synbad

Results

We keep a running tally of provider+model results for tool calling and reasoning parsing for GLM-4.7, Kimi K2 Thinking, and MiniMax M2. Feel free to add more provider results!

Provider	Model	Success Rate
Synthetic.new	GLM-4.7	✅ 100%
Synthetic.new	Kimi K2 Thinking	✅ 100%
Synthetic.new	MiniMax M2	✅ 100%

Provider	Model	Success Rate
Fireworks	GLM-4.7	❌ 83%
Fireworks	Kimi K2 Thinking	❌ 92%
Fireworks	MiniMax M2	✅ 100%

Provider	Model	Success Rate
Together	Kimi K2 Thinking	❌ 66%

Provider	Model	Success Rate
Parasail	GLM-4.7	❌ 83%
Parasail	Kimi K2 Thinking	❌ 75%

Note for attempting reproductions: generally all tests are reproducible with --count 1 and --count 1 --stream, but for evaluating the response-in-reasoning eval, you generally will need a high count to reproduce the bug: --count 40 and --count 40 --stream typically is sufficient.

All evals must pass both with and without Synbad's --stream parameter (which tests streaming APIs) to be considered a pass.

How do I contribute inference bugs?

If you already have some problematic JSON, head over to the Contributing section. If you don't, don't worry! Synbad makes it easy to capture the problematic JSON you're encountering.

First, run the Synbad Proxy, specifying the local port you want to use and the inference host you want to target. For example, to forward requests from localhost:3000 to Synthetic's API, you'd do:

synbad proxy -p 3000 -t https://api.synthetic.new/openai/v1

Then, configure your coding agent — or whichever local tool you're using — to point to http://localhost:3000 (or whichever port you selected). The Synbad Proxy will log all request bodies to stdout, so all you need to do is reproduce the bug by using your tool or coding agent, and then copy the JSON it printed to stdout.

Now you have reproducible JSON to file a bug via Synbad!

Contributing

First, clone this repo from Github. Then cd into it and run:

npm install

All inference evals are stored in the evals/ directory. They're written in TypeScript. You need to export two things from an eval:

The JSON that reproduces the problem, as the const json. It doesn't have to reproduce it 100% of the time; if the bug appears even 5% of the time, that's fine.
A test function that runs some asserts on the returned assistant message, which detect the error.

For example, we can test parallel tool call support very simply (as we do in the evals/tools/parallel-tool.ts file):

import * as assert from "../../source/asserts.ts";
import { ChatMessage } from "../../source/chat-completion.ts";

export function test({ tool_calls }: ChatMessage) {
  assert.isNotNullish(tool_calls);
  assert.isNotEmptyArray(tool_calls);
  assert.strictEqual(tool_calls.length, 2);
}

export const json = {
  "messages": [
    {"role": "user", "content": "What's the weather in Paris and London?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "City name"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "parallel_tool_calls": true,
  "tool_choice": "auto",
}

The asserts.ts file re-exports all of the built-in NodeJS assertion functions, and also adds a few extra ones, e.g. isNotNullish which checks whether an object is null or undefined.

To run your new eval, use the synbad.sh script in this repo, which auto-recompiles everything (including your new test!) before running the evals. Assuming you're testing the evals/reasoning/reasoning-parsing test, for GLM-4.6 on Synthetic, and you want to run it 5 times since it isn't consistently failing:

./synbad.sh eval --env-var SYNTHETIC_API_KEY \
  --base-url "https://api.synthetic.new/openai/v1" \
  --only evals/reasoning/reasoning-parsing \
  --model "hf:zai-org/GLM-4.6" \
  --count 5

Handling reasoning parsing

The OpenAI spec didn't originally include reasoning content parsing, since the original OpenAI models didn't reason. The open-source community added support for reasoning later, but there are two competing specs:

Storing the reasoning content in message.reasoning_content, or
Storing the reasoning content in message.reasoning.

To make sure your evals work with a wider range of inference providers, use the getReasoning function when testing reasoning parsing like so:

import { getReasoning } from "../../source/chat-completion.ts";

// In your test:

const reasoning = getReasoning(message);

This ensures your test will use the correct reasoning content data regardless of which spec the underlying inference provider is using.

Running Synbad

First, install it:

npm install -g @syntheticlab/synbad

Then run:

synbad eval --env-var SYNTHETIC_API_KEY \
  --base-url "https://api.synthetic.new/openai/v1" \
  --model "hf:zai-org/GLM-4.6"

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
evals		evals
source		source
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
synbad.png		synbad.png
synbad.sh		synbad.sh
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Results

How do I contribute inference bugs?

Contributing

Handling reasoning parsing

Running Synbad

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install

Results

How do I contribute inference bugs?

Contributing

Handling reasoning parsing

Running Synbad

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages