This repository is a starting point for evaluating skills. It shows a simple workflow: run a task without a skill, run it again with a skill, verify the output locally, and compare the results. From here, you can move to more complex methods for evaluating skills.
The examples are adapted from SkillsBench, but the goal here is not to recreate a full benchmark harness. The goal is to give you a practical starting point for testing your own skills with OpenHands Cloud or a local agent server. All the code can be run in your local environment, and if you want to use an open source observability platform like Laminar, just add your key.
There is also an accompanying blog post and a youtube video which explains the background for this project and demonstrates the evaluation of skills.
Current task examples:
software-dependency-auditsec-financial-reportsales-pivot-analysis
Outcomes:
This repo helps you assess three distinct task examples. You can see overall performance, performance by model, and traces. As you go through these, you will see that sometimes adding a skill leads to large improvements, sometimes marginal improvement, and on occasion a skill can be counterproductive. I intentionally chose these examples to broaden your thinking about skill evaluation.
- For a deeper dive into evaluating skills, check out docs/METHODOLOGY.md
- To add a new task, follow docs/ADDING_A_TASK.md
You can run this on OpenHands Cloud or with a local OpenHands Agent:
Requirements:
- Python 3.12+
uv- OpenHands credentials
- Docker Desktop if you want the local agent-server path
Install:
uv syncChoose a routed model:
export LLM_MODEL=openhands/claude-sonnet-4-5-20250929For OpenHands Cloud:
export OPENHANDS_CLOUD_API_KEY=...OPENHANDS_CLOUD_API_KEY: your OpenHands Cloud API key https://docs.openhands.dev/openhands/usage/cloud/cloud-apiGitHub token: create a token withreposcope if you are using a token-based GitHub connection for repo-backed Cloud runs https://docs.openhands.dev/usage/cloud/github-installation
For the local agent-server path:
export LLM_API_KEY=...LLM_API_KEY: your OpenAI, Anthropic, or OpenHands LLM key https://docs.openhands.dev/openhands/usage/settings/api-keys-settings
Optional tracing using Laminar:
export LMNR_PROJECT_API_KEY=...This is the main tutorial path.
Run against the GitHub repo directly:
uv run python scripts/run_sec_financial_report_eval.py \
--backend cloud \
--execution-mode repo \
--condition no-skill \
--cloud-repo rajshah4/evaluating-skills-tutorial
uv run python scripts/run_sec_financial_report_eval.py \
--backend cloud \
--execution-mode repo \
--condition improved-skill \
--cloud-repo rajshah4/evaluating-skills-tutorialThe same pattern works for sales-pivot-analysis:
uv run python scripts/run_sales_pivot_eval.py --backend cloud --execution-mode repo --condition no-skill --cloud-repo rajshah4/evaluating-skills-tutorial
uv run python scripts/run_sales_pivot_eval.py --backend cloud --execution-mode repo --condition improved-skill --cloud-repo rajshah4/evaluating-skills-tutorialEach task has a thin wrapper script in scripts/ so the tutorial reads like one evaluation per task instead of one giant command with --task everywhere. If you add your own task, copying one of these wrappers is the simplest way to create a task-specific entrypoint while still reusing the shared engine in scripts/run_eval.py.
Note on software-dependency-audit: This task requires separate conversations for skill vs no-skill testing to avoid leakage of the pinned Trivy report. The OpenHands CLI handles this correctly, or use the local agent-server for proper skill comparison.
Skills are authored next to each task under tasks/<task>/skills/<variant>/SKILL.md. After editing task-local skills, regenerate the compatibility copies used by Cloud V1 and AGENTS with:
uv run python scripts/sync_skills.pyUse a local agent server when you want a local runtime with a similar client-to-server shape.
Start the server:
./scripts/start_local_agent_server.shRun an evaluation:
uv run python scripts/run_sec_financial_report_eval.py \
--backend agent-server \
--execution-mode repo \
--condition improved-skillFor software-dependency-audit, use the default upload mode locally as well:
uv run python scripts/run_dependency_audit_eval.py --backend agent-server --condition no-skill
uv run python scripts/run_dependency_audit_eval.py --backend agent-server --condition improved-skillRecommended local env vars:
export OPENHANDS_AGENT_SERVER_URL=http://127.0.0.1:8000For the exact local setup, see IMPLEMENTATION.md.
Validated live after the task-local skill refactor:
uv run python scripts/run_dependency_audit_eval.py --backend cloud --condition improved-skilluv run python scripts/run_sec_financial_report_eval.py --backend agent-server --execution-mode repo --condition improved-skilluv run python scripts/run_sales_pivot_eval.py --backend agent-server --execution-mode repo --condition improved-skill
Verify a saved run:
uv run python verify.py --task software-dependency-audit /path/to/report.json
uv run python verify.py --task sec-financial-report /path/to/answers.json
uv run python verify.py --task sales-pivot-analysis /path/to/result.xlsxGenerate summaries and visuals:
uv run python scripts/compare_runs.py
uv run python scripts/export_metrics_summary.py
uv run python scripts/generate_visuals.pyIncluded outputs:
If you want to compare across multiple models, you can easily do that. The examples here use OpenHands-routed models in the format openhands/<model>.
Validated examples:
openhands/claude-sonnet-4-5-20250929openhands/minimax-m2.5openhands/gemini-3-pro-previewopenhands/gemini-3-flash-previewopenhands/kimi-k2-0711-preview
Example:
uv run python scripts/run_model_matrix.py \
--task sec-financial-report \
--backend agent-server \
--condition improved-skill \
--model openhands/claude-sonnet-4-5-20250929 \
--model openhands/minimax-m2.5 \
--model openhands/gemini-3-pro-preview \
--model openhands/gemini-3-flash-preview \
--model openhands/kimi-k2-0711-previewThis tutorial uses Laminar as the example tracing backend, but the evaluation loop is not tied to Laminar. Traces help explain behavior; the verifier decides correctness. OpenHands is OTEL-compatible, so you can use the observability tool of your choice.
This tutorial is inspired by SkillsBench and reuses its core idea of evaluating skills on deterministic tasks with local verifiers.