LLM-Eval

A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correctly.

Installation

composer require aysnc/llm-eval

Configuration

Create llm-eval.php in your project root:

<?php

use Aysnc\AI\LlmEval\Providers\AnthropicProvider;

return [
    'provider'    => new AnthropicProvider(getenv('ANTHROPIC_API_KEY')),
    'directory'   => __DIR__ . '/evals',
    'cache'       => true,
    'cacheTtl'    => 0,
    'parallel'    => false,
    'concurrency' => 5,
];

Option	Type	Default	Description
`provider`	`ProviderInterface`	—	The LLM provider shared across all eval files
`directory`	`string`	`'evals'`	Directory containing your eval files
`cache`	`bool\|string`	`false`	`true` uses `.llm-cache/`, or pass a custom path
`cacheTtl`	`int`	`0`	Cache lifetime in seconds (`0` = forever)
`parallel`	`bool`	`false`	Run evals in parallel by default
`concurrency`	`int`	`0`	Max concurrent requests when parallel (`0` = unlimited)

Quick Start

1. Create an eval file in your evals directory. Each file returns an LlmEval instance:

// evals/simple.php
<?php

use Aysnc\AI\LlmEval\Dataset\Dataset;
use Aysnc\AI\LlmEval\LlmEval;

$dataset = Dataset::fromArray([
    ['prompt' => 'What is 2+2? Reply with just the number.', 'expected' => '4'],
    ['prompt' => 'What is the capital of France? Reply with just the city name.', 'expected' => 'Paris'],
    ['prompt' => 'Is the sky blue? Reply with just yes or no.', 'expected' => 'yes'],
]);

return LlmEval::create('quick-start')
    ->dataset($dataset)
    ->assertions(function ($expect, $testCase): void {
        $expect->contains($testCase->getExpected(), caseSensitive: false);
    });

2. Run it:

vendor/bin/llm-eval run

LLM-Eval Runner
===============

  PASS quick-start - Case 0
  PASS quick-start - Case 1
  PASS quick-start - Case 2

Summary
-------
  Total       3
  Passed      3
  Failed      0
  Pass Rate   100.0%
  Duration    1.24s

The config provides the LLM provider, the eval file defines what to test — no ->provider() or ->runAll() needed in the file.

Core Concepts

An evaluation has three parts: a provider (which LLM to call), a dataset (prompts + expected answers), and assertions (how to check the response).

Datasets

A dataset is a collection of test cases. Each test case has a prompt and an optional expected value.

// Inline array
$dataset = Dataset::fromArray([
    ['prompt' => 'What is 2+2?', 'expected' => '4'],
]);

// CSV file (columns: prompt, expected)
$dataset = Dataset::fromCsv(__DIR__ . '/data/capitals.csv');

// JSON file (array of objects with prompt + expected keys)
$dataset = Dataset::fromJson(__DIR__ . '/data/questions.json');

The expected key can be a single value or multiple named values:

// Single — accessed via $testCase->getExpected()
['prompt' => 'What is 2+2?', 'expected' => '4']

// Multiple — accessed via $testCase->getExpected('name'), $testCase->getExpected('age')
['prompt' => 'Return JSON with name and age.', 'expected' => ['name' => 'Alice', 'age' => '30']]

CSV files use column prefixes for multiple values: expected_name, expected_age.

Any keys that aren't prompt or expected become metadata, accessible via $testCase->getData('key').

Assertions

Assertions define what "correct" means for a response. You chain them inside the assertions() callback.

Text

$expect->contains('Paris');
$expect->contains('paris', caseSensitive: false);
$expect->notContains('London');
$expect->matchesRegex('/\d{4}-\d{2}-\d{2}/');
$expect->minLength(10);
$expect->maxLength(500);

JSON

$expect->isJson();

Custom

$expect->assert(new MyCustomAssertion());

There are also assertions for tool calls, multi-turn conversations, and LLM-as-judge — covered in the sections below.

Testing Scenarios

Structured Output

Validate that the LLM returns well-formed JSON with the right content. Combine isJson() with contains() or multiple expected values.

// evals/json-output.php
$dataset = Dataset::fromArray([
    [
        'prompt' => 'Return a JSON object with keys "name" and "age". Use name "Alice" and age 30. Only output JSON.',
        'expected' => ['name' => 'Alice', 'age' => '30'],
    ],
    [
        'prompt' => 'Return a JSON array of three colors: red, green, blue. Only output JSON.',
        'expected' => 'red',
    ],
]);

return LlmEval::create('json-output')
    ->dataset($dataset)
    ->assertions(function ($expect, $testCase): void {
        $expect->isJson()
            ->contains($testCase->getExpected())
            ->contains($testCase->getExpected('name'));
    });

Tool Call Testing

Test that your LLM calls tools with the right parameters — without executing a full conversation loop. This uses LlmEval::create() (not createConversation) since you're only checking the first response.

// evals/tool-test.php
$tools = [
    [
        'name' => 'get_weather',
        'description' => 'Get weather for a location',
        'input_schema' => [
            'type' => 'object',
            'properties' => [
                'location' => ['type' => 'string'],
            ],
            'required' => ['location'],
        ],
    ],
];

return LlmEval::create('tool-test')
    ->option('tools', $tools)
    ->dataset($dataset)
    ->assertions(function ($expect): void {
        $expect->calledTool('get_weather');
        $expect->toolCallHasParam('get_weather', 'location', 'Paris');
    });

Available tool call assertions:

$expect->calledTool('get_weather');
$expect->calledTool('get_weather', times: 2);
$expect->toolCallHasParam('get_weather', 'location');
$expect->toolCallHasParam('get_weather', 'location', 'Paris');
$expect->calledToolCount(3);
$expect->didNotCallTool('dangerous_function');

Multi-Turn Conversations

Test agentic workflows where the LLM calls tools, receives results, and continues reasoning. Use LlmEval::createConversation() with a tool executor that returns simulated results.

// evals/math-agent.php
use Aysnc\AI\LlmEval\Dataset\Dataset;
use Aysnc\AI\LlmEval\LlmEval;
use Aysnc\AI\LlmEval\Providers\CallableToolExecutor;
use Aysnc\AI\LlmEval\Providers\ToolCall;
use Aysnc\AI\LlmEval\Providers\ToolResult;

$tools = [
    [
        'name' => 'calculate',
        'description' => 'Evaluate a math expression',
        'input_schema' => [
            'type' => 'object',
            'properties' => [
                'expression' => ['type' => 'string'],
            ],
            'required' => ['expression'],
        ],
    ],
];

$executor = new CallableToolExecutor([
    'calculate' => function (ToolCall $tc): ToolResult {
        $expr = $tc->getParam('expression');
        $result = match ($expr) {
            '6 * 7', '6*7' => '42',
            default => 'unknown',
        };

        return new ToolResult($tc->id, $result);
    },
]);

$dataset = Dataset::fromArray([
    ['prompt' => 'Use the calculate tool to compute 6 * 7.', 'expected' => '42'],
]);

return LlmEval::createConversation('math-agent')
    ->withTools($tools)
    ->executor($executor)
    ->dataset($dataset)
    ->assertions(function ($expect, $testCase): void {
        $expect->contains($testCase->getExpected())
            ->usedTool('calculate')
            ->turnCount(2);
    });

Available conversation assertions:

$expect->turnCount(2);
$expect->usedTool('calculate');
$expect->conversationContains('42');

Multi-Turn Datasets

Use a turns array to test conversations with multiple user messages. Each turn has its own prompt and optional expected values for per-turn assertions. Use getTurn() to access the 1-indexed turn number.

$dataset = Dataset::fromArray([
    [
        'turns' => [
            ['prompt' => 'What is the weather in Paris?', 'expected' => '22'],
            ['prompt' => 'Now check Tokyo', 'expected' => '18'],
            ['prompt' => 'Which city was warmer?', 'expected' => 'Paris'],
        ],
    ],
]);

return LlmEval::createConversation('multi-turn')
    ->withTools($tools)
    ->executor($executor)
    ->dataset($dataset)
    ->assertions(function ($expect, $testCase): void {
        $expect->contains($testCase->getExpected());

        if ($testCase->getTurn() <= 2) {
            $expect->usedTool('get_weather');
        }
    });

LLM-as-Judge

Use one LLM to evaluate another's response quality. Instead of checking for exact strings, you describe what "good" looks like and a judge model scores the response 0-100%.

// evals/quality-check.php
use Aysnc\AI\LlmEval\Providers\AnthropicProvider;

$judge = new AnthropicProvider(getenv('ANTHROPIC_API_KEY'));

return LlmEval::create('quality-check')
    ->dataset($dataset)
    ->assertions(function ($expect) use ($judge): void {
        $expect->judgedBy(
            judge: $judge,
            criteria: 'Is this response helpful, accurate, and concise?',
            threshold: 0.8,
        );
    });

Judging Conversations

For multi-turn conversations, you can use judgedBy() inside assertions() to judge per-turn, or use ->judge() on the eval to run a single evaluation after all turns complete. The judge receives the full conversation history — all messages, tool calls, and results.

return LlmEval::createConversation('multi-turn')
    ->withTools($tools)
    ->executor($executor)
    ->dataset($dataset)
    ->assertions(function ($expect, $testCase): void {
        $expect->contains($testCase->getExpected());
    })
    ->judge($judge, 'Did the model correctly identify the warmer city based on the earlier temperatures?');

CLI Runner

# Run all eval files in the evals directory
vendor/bin/llm-eval run

# Run a specific eval file
vendor/bin/llm-eval run my-test

# Run in parallel
vendor/bin/llm-eval run --parallel --concurrency=10

# Verbose mode — shows judge reasoning and tool calls for passing tests
vendor/bin/llm-eval run -v

# JSON output
vendor/bin/llm-eval run --format=json

# Clear response cache
vendor/bin/llm-eval cache:clear

# Scaffold a new eval file
vendor/bin/llm-eval init

Output

LLM-Eval Runner
===============

Running evaluations...

  PASS simple - Case 0
  PASS simple - Case 1
  FAIL simple - Case 2
       Got: "The sky appears blue due to Rayleigh scattering..."
       → Text does not contain "yes"
  PASS conversation-json - compare-two-cities - Turn 1
  PASS conversation-json - compare-two-cities - Turn 2
  PASS conversation-json - compare-two-cities - Turn 3
       → Score: 100% (threshold: 70%) - The response correctly identifies Paris as the warmer city.
  PASS llm-judge - photosynthesis
       → Score: 95% (threshold: 70%) - Clear, accurate explanation mentioning plants and sunlight.

Summary
-------
  Total       7
  Passed      6
  Failed      1
  Pass Rate   85.7%
  Duration    4.32s

With -v, passing tests also show judge scores and tool call details.

Providers

Anthropic Claude

Direct API access. Get your key at console.anthropic.com.

$provider = new AnthropicProvider(
    apiKey: getenv('ANTHROPIC_API_KEY'),
);

Default model: claude-sonnet-4-20250514

AWS Bedrock

Uses the Converse API — works with Claude, Titan, Llama, Mistral, and other Bedrock models. Requires composer require aws/aws-sdk-php. See AWS Bedrock docs.

use Aysnc\AI\LlmEval\Providers\BedrockProvider;

// Explicit credentials
$provider = new BedrockProvider(
    region: 'us-east-1',
    accessKeyId: 'AKIA...',
    secretAccessKey: 'secret...',
);

// Or default credential chain (env vars, ~/.aws/credentials, IAM role)
$provider = new BedrockProvider(region: 'us-east-1');

Default model: anthropic.claude-3-5-sonnet-20241022-v2:0

Changing the Model

Use ->model() to override the default model for any provider:

return LlmEval::create('eval-name')
    ->model('claude-opus-4-20250514')
    ->dataset($dataset)
    ->assertions($assertions);

This works with both AnthropicProvider (Anthropic model IDs) and BedrockProvider (Bedrock model IDs).

You can also set ->maxTokens(2048) to override the default max tokens (1024).

Programmatic API

If you need to run evals from PHP code — inside a test suite, a CI script, or anywhere you want to work with the results directly — use ->provider() and ->runAll():

$provider = new AnthropicProvider(getenv('ANTHROPIC_API_KEY'));

$results = LlmEval::create('quick-start')
    ->provider($provider)
    ->dataset($dataset)
    ->assertions(function ($expect, $testCase): void {
        $expect->contains($testCase->getExpected());
    })
    ->runAll();

echo "Pass rate: {$results->passRatePercent()}\n";
// Pass rate: 100.0%

Requirements

PHP 8.3+
guzzlehttp/guzzle ^7.10
aws/aws-sdk-php ^3.0 (optional, for Bedrock)

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
bin		bin
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.php-cs-fixer.dist.php		.php-cs-fixer.dist.php
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
phpstan.neon		phpstan.neon
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Eval

Installation

Configuration

Quick Start

Core Concepts

Datasets

Assertions

Testing Scenarios

Structured Output

Tool Call Testing

Multi-Turn Conversations

Multi-Turn Datasets

LLM-as-Judge

Judging Conversations

CLI Runner

Output

Providers

Anthropic Claude

AWS Bedrock

Changing the Model

Programmatic API

Requirements

About

Uh oh!

Releases 3

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-Eval

Installation

Configuration

Quick Start

Core Concepts

Datasets

Assertions

Testing Scenarios

Structured Output

Tool Call Testing

Multi-Turn Conversations

Multi-Turn Datasets

LLM-as-Judge

Judging Conversations

CLI Runner

Output

Providers

Anthropic Claude

AWS Bedrock

Changing the Model

Programmatic API

Requirements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages