Athanor is an experiment harness, designed for (but not limited to) AI research, built as an Elixir/Phoenix umbrella application. It provides a framework for defining, configuring, executing, and monitoring experiments.
- Code-Defined Experiments - Define experiments as versioned Elixir modules
- Real-Time Web UI - Monitor running experiments with live logs, results, and progress
- Supervised Execution - Each run executes in isolation with graceful cancellation
- MCP Server - Programmatic access via Model Context Protocol for AI agents
- Flexible Results - Store arbitrary structured data for later analysis
The core umbrella contains two applications:
athanor- Core business logic and runtime systemathanor_web- Phoenix web interface with LiveView for real-time experiment management
# Install dependencies
mix setup
# Set up the database; see `config/dev.exs` for credentials
mix ecto.setup
# Start the server
iex -S mix phx.serverThe web interface runs at http://localhost:4000 — you can choose a port by setting the PORT environment variable when starting the app.
Experiments are Elixir modules that use Athanor.Experiment. Each experiment defines its configuration schema and execution logic in code, making experiments versioned and reproducible.
- Instance: A configured experiment with a name, description, and configuration values
- Run: A single execution of an instance
This separation allows the same configuration to be executed multiple times for reproducibility.
Each run executes in its own GenServer under a DynamicSupervisor, isolating failures and enabling cancellation.
Create a module that uses Athanor.Experiment:
defmodule MyExperiment do
use Athanor.Experiment
alias Athanor.Experiment
@impl true
def experiment do
Experiment.Definition.new()
|> Experiment.Definition.name("my_experiment")
|> Experiment.Definition.description("Tests something interesting")
|> Experiment.Definition.configuration(config())
end
defp config do
Experiment.ConfigSchema.new()
|> Experiment.ConfigSchema.field(:iterations, :integer,
default: 10,
min: 1,
max: 100,
label: "Iterations",
description: "Number of test iterations"
)
|> Experiment.ConfigSchema.field(:model, :string,
default: "gpt-4",
label: "Model",
required: true
)
end
@impl true
def run(ctx) do
config = Athanor.Runtime.config(ctx)
total = config["iterations"]
Athanor.Runtime.log(ctx, :info, "Starting experiment with #{total} iterations")
Athanor.Runtime.progress(ctx, 0, total)
for i <- 1..total do
# Check for cancellation
if Athanor.Runtime.cancelled?(ctx), do: throw(:cancelled)
# Do work...
result = perform_iteration(config, i)
# Record result and update progress
Athanor.Runtime.result(ctx, "iteration_#{i}", result)
Athanor.Runtime.progress(ctx, i, total)
end
Athanor.Runtime.complete(ctx)
catch
:cancelled -> {:error, "Cancelled by user"}
end
defp perform_iteration(config, i) do
# Your experiment logic here
%{iteration: i, model: config["model"], output: "..."}
end
endExperiments are auto-discovered by the system at runtime.
The Athanor.Runtime module provides the interface for experiments to interact with the harness during execution:
# Get the instance configuration as a map
config = Athanor.Runtime.config(ctx)# Log messages at different levels
Athanor.Runtime.log(ctx, :info, "Processing item")
Athanor.Runtime.log(ctx, :warn, "Retrying request", %{attempt: 2})
Athanor.Runtime.log(ctx, :error, "Failed to connect")
# Batch multiple log entries
Athanor.Runtime.log_batch(ctx, [
{:info, "Step 1 complete", nil},
{:info, "Step 2 complete", nil}
])Results are persisted to the database and displayed in the web UI:
# Store a result with a key and value
Athanor.Runtime.result(ctx, "model_response", %{
input: prompt,
output: response,
tokens: token_count
})Progress updates are broadcast to the web UI in real-time:
# Update progress (current, total, optional message)
Athanor.Runtime.progress(ctx, 5, 100)
Athanor.Runtime.progress(ctx, 50, 100, "Halfway done")# Mark the run as successfully completed
Athanor.Runtime.complete(ctx)
# Mark the run as failed with an error message
Athanor.Runtime.fail(ctx, "API rate limit exceeded")# Check if the user has requested cancellation
if Athanor.Runtime.cancelled?(ctx) do
# Clean up and exit
end- Navigate to
/experiments - Click "New" to create an instance
- Select an experiment module and configure it
- Click "Run" to execute
- Watch logs, results, and progress update in real-time
# Start a run for an existing instance
{:ok, run} = Athanor.Runtime.start_run(instance)
# Cancel a running experiment
Athanor.Runtime.cancel_run(run)Athanor includes a Model Context Protocol (MCP) server that allows AI agents to programmatically manage experiments, runs, logs, and results. The server exposes 15 tools for complete experiment lifecycle management.
Endpoint: http://localhost:4000/mcp
Available Operations:
- List, create, and update experiments
- Discover available experiment modules and their schemas
- Start, monitor, and cancel runs
- Query logs and results
For detailed documentation on all available tools and usage examples, see docs/MCP_SERVER.md.
# Connect an MCP client to the server
# The client can then call tools like:
# - list_experiments
# - create_experiment
# - start_run
# - get_run_logsResults are stored as a simple key/value store in the run_results table. Each result has:
run_id- The run it belongs tokey- A string identifier (e.g.,"iteration_1","model_response")value- A JSONB column containing arbitrary data
This structure makes results easy to query and analyze outside of Athanor.
# Get all results for a run
Athanor.Experiments.list_results(run_id)
# Query directly with Ecto
import Ecto.Query
Athanor.Repo.all(
from r in Athanor.Experiments.Result,
where: r.run_id == ^run_id,
where: r.key == "model_response"
)Results can be loaded directly into Jupyter notebooks (using Livebook or Python) for analysis:
import psycopg2
import pandas as pd
conn = psycopg2.connect("postgresql://localhost/athanor_dev")
# Load results for a specific run
df = pd.read_sql("""
SELECT key, value, inserted_at
FROM run_results
WHERE run_id = %s
ORDER BY inserted_at
""", conn, params=[run_id])
# The 'value' column contains JSON - expand it
df = pd.concat([df, pd.json_normalize(df['value'])], axis=1)Or with Livebook (Elixir):
# In a Livebook connected to your Athanor node
results = Athanor.Experiments.list_results(run_id)
# Convert to a table for analysis
results
|> Enum.map(fn r -> Map.merge(%{key: r.key}, r.value) end)
|> Kino.DataTable.new()The substrate_shift app contains a complete example experiment that tests whether LLMs can detect when they're running on a different underlying model.
Configuration options:
runs_per_pair- Number of test runs per model pairparallelism- Concurrent pairs to testmodel_pairs- List of model pairs to compare
See apps/substrate_shift/lib/substrate_shift.ex for the full implementation.
# Run tests
mix test
# Format code and run checks
mix precommit
# Start interactive shell with server
iex -S mix phx.serverexperiment_instances- Configured experiments with name, description, and configurationexperiment_runs- Execution records with status, timing, and error inforun_results- Key-value results from each runrun_logs- Log entries with level, message, and metadata
