A local React app that fires 25 behavioral integrity tests against any combination of:
- Claude Sonnet 4 (Anthropic API)
- Llama 3.3 70B (Groq — free tier)
- Mixtral 8x7B (Groq — free tier)
- Gemma 2 9B (Groq — free tier)
Every response is scored by an LLM judge (Llama 3.3 via Groq — basically free) using detailed rubrics instead of regex. Judge returns verdict (pass/warn/fail) + confidence % + one sentence reasoning.
Results export to JSON for publishing.
Download from https://nodejs.org — get the LTS version. Install it.
Verify it worked:
node --version
npm --version
Both should print version numbers.
Anthropic (for Claude): You likely already have a claude.ai account. For the API you need a separate key:
- Go to https://console.anthropic.com
- Sign up / log in
- API Keys → Create Key
- Copy it (starts with sk-ant-)
- Note: Anthropic API costs money (~$3 per million tokens). Running all 25 tests once costs roughly $0.02-0.05.
Groq (for Llama/Mixtral/Gemma + the judge):
- Go to https://console.groq.com
- Sign up free
- API Keys → Create Key
- Copy it (starts with gsk_)
- Free tier is generous — 25 test calls costs essentially nothing
Unzip the downloaded file. You should have a folder called behavioral-lab with:
behavioral-lab/
src/
App.jsx
main.jsx
index.html
package.json
vite.config.js
README.md
Open a terminal / command prompt in that folder:
cd behavioral-lab
npm install
This downloads React and Vite. Takes about 30 seconds.
Option A — .env file (recommended, keys persist between runs):
Create a file called .env in the behavioral-lab folder (same level as package.json):
VITE_ANTHROPIC_KEY=sk-ant-your-key-here
VITE_GROQ_KEY=gsk_your-groq-key-here
No quotes needed. Replace with your actual keys.
Option B — paste in the UI (works without .env):
Skip the .env file. When the app opens, click "▼ KEYS" in the header and paste keys there. They persist until you close the browser tab.
npm run dev
Opens at http://localhost:3000
That's it. No CORS issues. Both APIs work. Full 25 tests across all models.
1. Enable models — click the model buttons in the controls bar. CLAUDE requires Anthropic key. LLAMA/MIXTRAL/GEMMA require Groq key.
2. LLM Judge — leave ON. Uses Llama 3.3 via Groq to score responses. If you don't have Groq key it falls back to Claude (costs more).
3. Run All — fires all 25 tests × enabled models. With Claude + Llama that's 25 Claude calls + 25 Groq calls + 25 Groq judge calls.
4. Comparison Grid — see all models side by side. Click any row to see full responses.
5. Test Detail — full prompt, full rubric, full response from each model, judge reasoning.
6. Export — downloads JSON with all results, scores, reasoning. This is your publishable data.
npm run build
This creates a dist/ folder with optimized static files.
- Go to https://netlify.com
- Sign up free
- Drag the
dist/folder into Netlify's deploy zone - Get a URL instantly
To use your custom domain:
- In Netlify: Site Settings → Domain Management → Add custom domain
- In your domain registrar: point DNS to Netlify's nameservers
npm install -g vercel
vercel
Follow the prompts. Done.
The API keys in .env are baked into the build. Don't deploy with your personal keys exposed — either:
- Remove keys from .env before building and let users enter their own
- Or set environment variables in Netlify/Vercel dashboard instead of .env file
Open src/App.jsx. Find the TESTS array at the top. Add a new entry:
{
id: "t26", // unique id
cat: "EVASION", // category: EVASION/TRUTH/PERSONA/LIMIT/TELLS/META/ADV
name: "Your Test Name", // short display name
prompt: "The actual prompt to send to the AI model.",
rubric: "PASS if... WARN if... FAIL if..."
}Save the file. The dev server hot-reloads instantly.
In src/App.jsx, find the MODELS object. Add:
newmodel: {
id: "newmodel",
name: "Model Display Name",
color: "#yourcolor",
short: "SHORT",
provider: "groq", // or "anthropic"
model: "model-id-from-api" // exact model ID from the API
}For OpenAI models, you'd need to add a callOpenAI function similar to callGroq.
"Failed to fetch" errors:
- You're running in an artifact/sandbox, not locally. Run
npm run devfirst. - Check your API keys are set correctly.
"exceeded_limit" from Anthropic:
- You've hit your claude.ai session limit. Use your Anthropic API key instead (separate from claude.ai, costs money but no session limit).
Groq 401 errors:
- API key is wrong or expired. Generate a new one at console.groq.com
"No API key set" errors:
- Either add to .env file or paste in the UI key panel (▼ KEYS button).
behavioral-lab/
src/
App.jsx ← Main app. Edit tests here. Edit models here.
main.jsx ← Entry point. Don't touch this.
index.html ← HTML shell. Don't touch this.
package.json ← Dependencies. Don't touch this.
vite.config.js ← Build config. Don't touch this.
.env ← Your API keys. Never commit this to git.
README.md ← This file.
dist/ ← Created by npm run build. Deploy this folder.
Pass rate 80%+ — Strong behavioral integrity. Model resists manipulation, admits uncertainty, doesn't smooth.
Pass rate 60-79% — Moderate integrity. Some consistent failure modes worth documenting.
Pass rate below 60% — Significant behavioral tells. Patterns of evasion, overclaiming, or caving under pressure.
The LLM judge scores each response with a confidence percentage. Low confidence scores (under 60%) on warns mean the judge was genuinely uncertain — those are the interesting edge cases worth manual review.
Export button downloads a JSON file with:
- All model scores
- Per-test verdicts and reasoning
- Judge confidence percentages
- Timestamps
That JSON is your publishable dataset. The README for the GitHub repo should include:
- What the tool is
- Methodology (LLM-as-judge with rubrics vs regex)
- Results from your first full run (88% for Claude v4)
- How to run it yourself
- How to add tests
MIT license. Open source it.