by adewale
Agent Skill evaluation harness for paired variants, trace artifacts, and runner adapters
# Add to your Claude Code skills
git clone https://github.com/adewale/skill-eval-harnessGuides for using ai agents skills like skill-eval-harness.
skill-eval-harness is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by adewale. Agent Skill evaluation harness for paired variants, trace artifacts, and runner adapters. It has 50 GitHub stars.
skill-eval-harness's catalog security scan is still queued. You can run an instant dependency and prompt-injection check now with the "Scan for vulnerabilities" button above.
Clone the repository with "git clone https://github.com/adewale/skill-eval-harness" and add it to your Claude Code skills directory (see the Installation section above).
skill-eval-harness is primarily written in Python. It is open-source under adewale on GitHub, so you can review or fork the full source.
Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh skill-eval-harness against similar tools.
No comments yet. Be the first to share your thoughts!
Unlocks once the catalog security scan passes (runs nightly).
The deep catalog scan for this skill is still queued. Run an instant dependency check now instead.
Skill Eval Harness is a Python CLI for testing whether an Agent Skill changes observable output. It reads evals/shared-benchmark.json, emits answer-key-safe task rows, grades files under eval-runs/, and writes benchmark reports you can diff across variants.
The main question is narrow: when the same case runs with and without the skill, what changed, what passed, and did the eval itself leak the answer?
evals/shared-benchmark.json: prompt, split, fixture files, variants, assertions, and ablations.skill-benchmark prepare; generation rows omit expected_behavior and judge rubrics unless you explicitly request them.output.md and optional metadata.json.script oracles.with_skill, without_skill, optional old_skill, and ablation:<id>.tune, holdout, and holdback stay separate.judge/rubric assertions can be exported or run through a user-supplied --judge-cmd; the harness does not choose a model for you.Requires Python 3.10+ and uv. Install from GitHub first:
uv tool install git+https://github.com/adewale/skill-eval-harness.git@v0.4.2
Run these from a skill repo that has evals/shared-benchmark.json:
# 1. Check manifest shape and fixture paths.
skill-benchmark validate evals/shared-benchmark.json
# 2. Emit answer-key-safe task rows for a runner.
skill-benchmark prepare evals/shared-benchmark.json \
--split tune \
--runs-per-variant 3 \
--out /tmp/tasks.jsonl
# 3. Run each task with your agent runner and save:
# eval-runs/latest/<case_id>/<variant>/run-<n>/output.md
# eval-runs/latest/<case_id>/<variant>/run-<n>/metadata.json
# 4. Grade saved outputs. Add --allow-scripts only if you trust repo-owned oracles.
skill-benchmark benchmark evals/shared-benchmark.json \
--runs eval-runs/latest \
--split tune \
--allow-scripts \
--out benchmark.json
# 5. Open a static review page.
skill-benchmark render-viewer \
--benchmark benchmark.json \
--runs eval-runs/latest \
--out review.html
Expected landmarks:
validate -> OK: <skill-name> — <case-count> cases, <ablation-count> ablations
prepare -> /tmp/tasks.jsonl, one JSON object per case/variant/run
benchmark -> benchmark.json with summary, results, and case_flags
viewer -> review.html with assertion evidence and output previews
benchmark.json records one row per case/variant/run, plus aggregate pass rates, timing/token summaries, and flags for saturated, no-lift, flaky, or with-skill-failed cases.
uv tool install git+https://github.com/adewale/skill-eval-harness.git@v0.4.2
skill-benchmark --help
skill-pi-trigger-eval --help
# One-shot without installing globally:
uvx --from git+https://github.com/adewale/skill-eval-harness.git@v0.4.2 skill-benchmark --help
The installed commands are:
| Command | What it does |
|---|---|
skill-benchmark |
Validate manifests, prepare tasks, grade outputs, compare variants, run judges, and import/export runner formats. |
skill-pi-trigger-eval |
Runs Pi without forced --skill and checks whether the model loads the skill from stream events. |
git clone https://github.com/adewale/skill-eval-harness.git
cd skill-eval-harness
uv tool install --editable .
skill-benchmark --help
| File | Use it for |
|---|---|
README.md |
Manifest shape, run layout, and command contracts. |
CHANGELOG.md |
Release history and unreleased repo-surface changes. |
CONTRIBUTING.md |
Local setup, validation commands, and eval-safety rules. |
LESSONS_LEARNED.md |
Design lessons from the multi-skill saturation work. |
docs/vocabulary.md |
Glossary of harness terms: variants, splits, ablations, assertions, trace artifacts, and report flags. |
docs/evals-are-not-tests.md |
Why a skill eval is not a unit test, and what that changes about reading results. |
docs/jetty-support-spec.md |
Jetty payload/import contract and live-token unknowns. |
docs/trace-aware-eval-spec.md |
Trace artifact contract, shipped v0.4.1 runner support, process/efficiency assertions, and remaining trace work. |
docs/repo-effectiveness-audit.md |
good-repo audit, score, package metadata fixes, and manual GitHub settings checklist. |
TODO.md |
Remaining Jetty work: streaming/concurrency, live API validation, materialized ablations, judge export, and per-variant overrides. |
examples/adewale-workspace/ |
Adewale-specific runners for Pi smoke, trigger, ablation, and aggregate reports. |
tests/test_skill_benchmark.py |
Executable examples for grading, leakage lint, script assertions, judge commands, Jetty export/import, trace artifacts, and trigger detection. |
Each skill repo owns an evals/shared-benchmark.json manifest. Add a harness block so readers know which external harness/version to install.
{
"version": 1,
"skill_name": "good-pr",
"harness": {
"name": "skill-eval-harness",
"url": "https://github.com/adewale/skill-eval-harness",
"version": ">=0.4.1"
},
"skill_paths": ["skills/good-pr/SKILL.md"],
"variants": ["with_skill", "without_skill"],
"optional_variants": ["old_skill"],
"split_policy": {
"tune": "Visible cases used during iteration.",
"holdout": "Hidden cases scored only at end-of-round or merge.",
"holdback": "Examples not exposed in skill/docs/eval descriptions until after scoring."
},
"cases": [
{
"id": "pos-security-meaningless-test",
"split": "tune",
"kind": "pr-review",
"domain": "pull-request-quality",
"difficulty": "core",
"trigger_type": "explicit",
"success_goals": ["outcome", "style"],
"prompt": "Security fix PR includes `expect(result).toBeDefined()` as the only auth-bypass test...",
"files": ["fixtures/security-pr/diff.patch"],
"expected_behavior": ["Flag the weak test and require regression proof."],
"assertions": [
{"name": "detect-weak-test", "type": "contains_any", "values": ["weak", "toBeDefined"]},
{"name": "qualitative-review", "type": "judge", "rubric": ["Specific", "maintainer-friendly"]}
],
"tags": ["security", "testing"]
}
],
"ablations": [
{
"id": "no-regression-proof",
"removed_component": "regression-proof requirement",
"expected_regressions": ["Accepts weak tests that still pass without the fix"]
}
]
}
| Split | Purpose | Prompt storage |
|---|---|---|
tune |
Visible cases used while editing the skill and evals. | Inline prompt is fine. |
holdout |
Hidden cases scored at end-of-round or merge. | Prefer private prompt_ref. |
holdback |
Not shown in skill/docs/evals until after scoring; detects memorization. | Prefer private prompt_ref and ignored answer keys. |
prepare fails on missing hidden prompts unless --allow-missing-prompts is used for dry-run planning.
Use optional files for fixture-backed evals. Paths are relative to the manifest's evals/ directory, validated by validate, and emitted by prepare as absolute input_files for the runner.
Objective assertion types:
| Type | Checks |
|---|---|
contains |
One substring is present. |
contains_any |
At least one substring is present. |
contains_all |
Every listed substring is present. |
excludes_any |
No listed substring is present. |
regex |
Regex matches output. |
not_regex |
Regex does not match output. |
file_exists |
A file exists relative to the run directory. |
json_field_equals |
A JSON field equals an expected value. |
script |
Opt-in deterministic oracle command against the output directory. |
skill_invoked |
Trace/process check that the runner loaded the skill, or did not, as expected. |
command_ran / command_not_ran |
Trace/process checks over normalized command events. |
command_order |
Trace/process check that commands appeared in a required order. |
tool_count_le / no_repeated_command_loop |
Trace/process budgets for tool use and thrashing. |
total_tokens_le / elapsed_seconds_le / command_count_le |
Efficiency checks over metrics.json, metadata.json, or normalized events. |
Use script when a keyword check is too weak for the property you care about. The command sees the candidate run directory, so it can inspect output.md, generated files under outputs/, or metadata. Script assertions are blocked unless you pass --allow-scripts to grade, benchmark, aggregate, or export-anthropic:
{
"name": "oracle-pass",
"type": "script",