Evaluate a model's suitability as a main Hermes Agent — in a clean-room Docker environment, orchestrated via kanban.
Groktobench runs a standardized 3-phase test battery against a model running inside an isolated Hermes Agent Docker container. It measures three axes:
- Skill Recognition — does the model load the right Hermes skill when given a task?
- Skill Fidelity — once loaded, does it follow the skill's instructions?
- Workflow Chaining — can it chain multiple skills across a complete workflow without losing context?
- Docker (the Hermes Agent Docker image is pulled automatically)
- A model API key (OpenAI-compatible endpoint)
- Hermes Agent with kanban support
# 1. Clone the repo
git clone https://github.com/groktopus/groktobench.git
cd groktobench
# 2. Set your model credentials
export GROKTOBENCH_API_KEY="sk-..."
export GROKTOBENCH_MODEL="your-model-name"
export GROKTOBENCH_BASE_URL="https://api.openai.com/v1" # or your provider
# 3. Build the Hermes Agent base image (required: Groktobench builds on top of this)
git clone https://github.com/NousResearch/hermes-agent.git /tmp/hermes-agent
docker build -t hermes-agent:latest /tmp/hermes-agent
# 4. Build the Groktobench image
docker build -t groktobench-hermes -f docker/Dockerfile .
# 5. Start the clean-room Hermes container
GROKTOBENCH_API_KEY=*** GROKTOBENCH_MODEL="your-model" docker compose -f docker/docker-compose.yml up -d
# 6. Run the evaluation
# If Docker is local:
./scripts/run-full-suite.sh groktobench
# If Docker is on a remote host:
./scripts/deploy-and-run.sh user@host groktobenchThe HARP score ranges from 0-100:
| Score | Verdict | Meaning |
|---|---|---|
| 85-100 | Daily driver | Use confidently as main agent |
| 65-84 | Viable with caveats | Good for structured work, watch for specific gaps |
| 45-64 | Experimental | Expect course-correction; good for aux roles |
| <45 | Not suitable | Will frustrate in any role |
One problem per stock Hermes skill category. Does the model reach for skill_view() before executing?
Does the model respect the skill's instructions once loaded? Or does it load the right skill then ignore it?
End-to-end tasks that chain 3+ skills. Does context survive across skill boundaries?
Groktobench uses only synthetic data. No real projects, no real infrastructure, no personal information. The Docker container is isolated — nothing from your host Hermes config leaks into the evaluation.
MIT — see LICENSE