Groktobench — Hermes Agent Readiness Protocol (HARP)

Evaluate a model's suitability as a main Hermes Agent — in a clean-room Docker environment, orchestrated via kanban.

Groktobench runs a standardized 3-phase test battery against a model running inside an isolated Hermes Agent Docker container. It measures three axes:

Skill Recognition — does the model load the right Hermes skill when given a task?
Skill Fidelity — once loaded, does it follow the skill's instructions?
Workflow Chaining — can it chain multiple skills across a complete workflow without losing context?

Prerequisites

Docker (the Hermes Agent Docker image is pulled automatically)
A model API key (OpenAI-compatible endpoint)
Hermes Agent with kanban support

Quick Start

# 1. Clone the repo
git clone https://github.com/groktopus/groktobench.git
cd groktobench

# 2. Set your model credentials
export GROKTOBENCH_API_KEY="sk-..."
export GROKTOBENCH_MODEL="your-model-name"
export GROKTOBENCH_BASE_URL="https://api.openai.com/v1"  # or your provider

# 3. Build the Hermes Agent base image (required: Groktobench builds on top of this)
git clone https://github.com/NousResearch/hermes-agent.git /tmp/hermes-agent
docker build -t hermes-agent:latest /tmp/hermes-agent

# 4. Build the Groktobench image
docker build -t groktobench-hermes -f docker/Dockerfile .

# 5. Start the clean-room Hermes container
GROKTOBENCH_API_KEY=*** GROKTOBENCH_MODEL="your-model" docker compose -f docker/docker-compose.yml up -d

# 6. Run the evaluation
#    If Docker is local:
./scripts/run-full-suite.sh groktobench
#    If Docker is on a remote host:
./scripts/deploy-and-run.sh user@host groktobench

Scoring

The HARP score ranges from 0-100:

Score	Verdict	Meaning
85-100	Daily driver	Use confidently as main agent
65-84	Viable with caveats	Good for structured work, watch for specific gaps
45-64	Experimental	Expect course-correction; good for aux roles
<45	Not suitable	Will frustrate in any role

Protocol Overview

Phase 1: Skill Recognition (8 probes, ~30 min)

One problem per stock Hermes skill category. Does the model reach for skill_view() before executing?

Phase 2: Skill Fidelity (5 probes, ~30 min)

Does the model respect the skill's instructions once loaded? Or does it load the right skill then ignore it?

Phase 3: Workflow Chaining (2 workflows, ~30 min)

End-to-end tasks that chain 3+ skills. Does context survive across skill boundaries?

Privacy

Groktobench uses only synthetic data. No real projects, no real infrastructure, no personal information. The Docker container is isolated — nothing from your host Hermes config leaks into the evaluation.

License

MIT — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
docker		docker
probes		probes
references		references
scripts		scripts
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Groktobench — Hermes Agent Readiness Protocol (HARP)

Prerequisites

Quick Start

Scoring

Protocol Overview

Phase 1: Skill Recognition (8 probes, ~30 min)

Phase 2: Skill Fidelity (5 probes, ~30 min)

Phase 3: Workflow Chaining (2 workflows, ~30 min)

Privacy

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Groktobench — Hermes Agent Readiness Protocol (HARP)

Prerequisites

Quick Start

Scoring

Protocol Overview

Phase 1: Skill Recognition (8 probes, ~30 min)

Phase 2: Skill Fidelity (5 probes, ~30 min)

Phase 3: Workflow Chaining (2 workflows, ~30 min)

Privacy

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages