skill-eval-harness

Name: skill-eval-harness
Author: adewale

Pending

Agent Skill evaluation harness for paired variants, trace artifacts, and runner adapters

50stars

2forks

Python

Installation

# Add to your Claude Code skills
git clone https://github.com/adewale/skill-eval-harness

Getting Started

Guides for using ai agents skills like skill-eval-harness.

Caveman: Cut Claude Token Use by 65%
How agent-side prompt compression works, when to use it, and when not to.
What is an AI Skills Marketplace?
Definitions, how marketplaces work, and how to choose between them in 2026.
Getting Started with AI Skills

README.md

Frequently Asked Questions

What is skill-eval-harness?

skill-eval-harness is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by adewale. Agent Skill evaluation harness for paired variants, trace artifacts, and runner adapters. It has 50 GitHub stars.

Is skill-eval-harness safe to use?

skill-eval-harness's catalog security scan is still queued. You can run an instant dependency and prompt-injection check now with the "Scan for vulnerabilities" button above.

How do I install skill-eval-harness?

Clone the repository with "git clone https://github.com/adewale/skill-eval-harness" and add it to your Claude Code skills directory (see the Installation section above).

What programming language is skill-eval-harness written in?

skill-eval-harness is primarily written in Python. It is open-source under adewale on GitHub, so you can review or fork the full source.

Are there alternatives to skill-eval-harness?

Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh skill-eval-harness against similar tools.

Agentic AI for Beginners

Build your first AI agent from scratch - tool use, ReAct pattern, memory, deployment

41 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

superpowers

by obra

An agentic skills framework & software development methodology that works.

234,966

feishu-whiteboard-pro swarm-factory

Skill Eval Harness

Skill Eval Harness is a Python CLI for testing whether an Agent Skill changes observable output. It reads evals/shared-benchmark.json, emits answer-key-safe task rows, grades files under eval-runs/, and writes benchmark reports you can diff across variants.

The main question is narrow: when the same case runs with and without the skill, what changed, what passed, and did the eval itself leak the answer?

Core loop

Describe cases in evals/shared-benchmark.json: prompt, split, fixture files, variants, assertions, and ablations.
Prepare tasks with skill-benchmark prepare; generation rows omit expected_behavior and judge rubrics unless you explicitly request them.
Run tasks with Pi, Claude Code, Jetty, or another runner; each run writes output.md and optional metadata.json.
Grade outputs with deterministic assertions: string, regex, file, JSON field, and opt-in script oracles.
Inspect the report for pass rates, flaky repeated runs, no-lift cases, saturated assertions, judge tasks, and trigger/no-trigger results.

What the CLI owns

Variant pairing: with_skill, without_skill, optional old_skill, and ablation:<id>.
Split discipline: tune, holdout, and holdback stay separate.
Local grading: deterministic assertions run without model calls.
Eval hygiene: leakage lint, manifest audit, trigger checks, repeated-run stats, and fixture recommendations.
Interop: Anthropic-style exports, static HTML review pages, Pi trigger evals, and Jetty runbook-mode import/export.
Judge plumbing: judge/rubric assertions can be exported or run through a user-supplied --judge-cmd; the harness does not choose a model for you.

Quick start
Installation
Manifest format
Assertions
Run output contract
Commands
Jetty adapter
Contributing

Quick start

Requires Python 3.10+ and uv. Install from GitHub first:
uv tool install git+https://github.com/adewale/skill-eval-harness.git@v0.4.2

Run these from a skill repo that has evals/shared-benchmark.json:

# 1. Check manifest shape and fixture paths.
skill-benchmark validate evals/shared-benchmark.json

# 2. Emit answer-key-safe task rows for a runner.
skill-benchmark prepare evals/shared-benchmark.json \
  --split tune \
  --runs-per-variant 3 \
  --out /tmp/tasks.jsonl

# 3. Run each task with your agent runner and save:
# eval-runs/latest/<case_id>/<variant>/run-<n>/output.md
# eval-runs/latest/<case_id>/<variant>/run-<n>/metadata.json

# 4. Grade saved outputs. Add --allow-scripts only if you trust repo-owned oracles.
skill-benchmark benchmark evals/shared-benchmark.json \
  --runs eval-runs/latest \
  --split tune \
  --allow-scripts \
  --out benchmark.json

# 5. Open a static review page.
skill-benchmark render-viewer \
  --benchmark benchmark.json \
  --runs eval-runs/latest \
  --out review.html

Expected landmarks:

validate  -> OK: <skill-name> — <case-count> cases, <ablation-count> ablations
prepare   -> /tmp/tasks.jsonl, one JSON object per case/variant/run
benchmark -> benchmark.json with summary, results, and case_flags
viewer    -> review.html with assertion evidence and output previews

benchmark.json records one row per case/variant/run, plus aggregate pass rates, timing/token summaries, and flags for saturated, no-lift, flaky, or with-skill-failed cases.

Installation

From GitHub

uv tool install git+https://github.com/adewale/skill-eval-harness.git@v0.4.2
skill-benchmark --help
skill-pi-trigger-eval --help

# One-shot without installing globally:
uvx --from git+https://github.com/adewale/skill-eval-harness.git@v0.4.2 skill-benchmark --help

The installed commands are:

Command	What it does
`skill-benchmark`	Validate manifests, prepare tasks, grade outputs, compare variants, run judges, and import/export runner formats.
`skill-pi-trigger-eval`	Runs Pi without forced `--skill` and checks whether the model loads the skill from stream events.

Local development

git clone https://github.com/adewale/skill-eval-harness.git
cd skill-eval-harness
uv tool install --editable .
skill-benchmark --help

Documentation map

File	Use it for
`README.md`	Manifest shape, run layout, and command contracts.
`CHANGELOG.md`	Release history and unreleased repo-surface changes.
`CONTRIBUTING.md`	Local setup, validation commands, and eval-safety rules.
`LESSONS_LEARNED.md`	Design lessons from the multi-skill saturation work.
`docs/vocabulary.md`	Glossary of harness terms: variants, splits, ablations, assertions, trace artifacts, and report flags.
`docs/evals-are-not-tests.md`	Why a skill eval is not a unit test, and what that changes about reading results.
`docs/jetty-support-spec.md`	Jetty payload/import contract and live-token unknowns.
`docs/trace-aware-eval-spec.md`	Trace artifact contract, shipped v0.4.1 runner support, process/efficiency assertions, and remaining trace work.
`docs/repo-effectiveness-audit.md`	`good-repo` audit, score, package metadata fixes, and manual GitHub settings checklist.
`TODO.md`	Remaining Jetty work: streaming/concurrency, live API validation, materialized ablations, judge export, and per-variant overrides.
`examples/adewale-workspace/`	Adewale-specific runners for Pi smoke, trigger, ablation, and aggregate reports.
`tests/test_skill_benchmark.py`	Executable examples for grading, leakage lint, script assertions, judge commands, Jetty export/import, trace artifacts, and trigger detection.

Manifest format

Each skill repo owns an evals/shared-benchmark.json manifest. Add a harness block so readers know which external harness/version to install.

{
  "version": 1,
  "skill_name": "good-pr",
  "harness": {
    "name": "skill-eval-harness",
    "url": "https://github.com/adewale/skill-eval-harness",
    "version": ">=0.4.1"
  },
  "skill_paths": ["skills/good-pr/SKILL.md"],
  "variants": ["with_skill", "without_skill"],
  "optional_variants": ["old_skill"],
  "split_policy": {
    "tune": "Visible cases used during iteration.",
    "holdout": "Hidden cases scored only at end-of-round or merge.",
    "holdback": "Examples not exposed in skill/docs/eval descriptions until after scoring."
  },
  "cases": [
    {
      "id": "pos-security-meaningless-test",
      "split": "tune",
      "kind": "pr-review",
      "domain": "pull-request-quality",
      "difficulty": "core",
      "trigger_type": "explicit",
      "success_goals": ["outcome", "style"],
      "prompt": "Security fix PR includes `expect(result).toBeDefined()` as the only auth-bypass test...",
      "files": ["fixtures/security-pr/diff.patch"],
      "expected_behavior": ["Flag the weak test and require regression proof."],
      "assertions": [
        {"name": "detect-weak-test", "type": "contains_any", "values": ["weak", "toBeDefined"]},
        {"name": "qualitative-review", "type": "judge", "rubric": ["Specific", "maintainer-friendly"]}
      ],
      "tags": ["security", "testing"]
    }
  ],
  "ablations": [
    {
      "id": "no-regression-proof",
      "removed_component": "regression-proof requirement",
      "expected_regressions": ["Accepts weak tests that still pass without the fix"]
    }
  ]
}

Splits

Split	Purpose	Prompt storage
`tune`	Visible cases used while editing the skill and evals.	Inline `prompt` is fine.
`holdout`	Hidden cases scored at end-of-round or merge.	Prefer private `prompt_ref`.
`holdback`	Not shown in skill/docs/evals until after scoring; detects memorization.	Prefer private `prompt_ref` and ignored answer keys.

prepare fails on missing hidden prompts unless --allow-missing-prompts is used for dry-run planning.

Use optional files for fixture-backed evals. Paths are relative to the manifest's evals/ directory, validated by validate, and emitted by prepare as absolute input_files for the runner.

Assertions

Objective assertion types:

Type	Checks
`contains`	One substring is present.
`contains_any`	At least one substring is present.
`contains_all`	Every listed substring is present.
`excludes_any`	No listed substring is present.
`regex`	Regex matches output.
`not_regex`	Regex does not match output.
`file_exists`	A file exists relative to the run directory.
`json_field_equals`	A JSON field equals an expected value.
`script`	Opt-in deterministic oracle command against the output directory.
`skill_invoked`	Trace/process check that the runner loaded the skill, or did not, as expected.
`command_ran` / `command_not_ran`	Trace/process checks over normalized command events.
`command_order`	Trace/process check that commands appeared in a required order.
`tool_count_le` / `no_repeated_command_loop`	Trace/process budgets for tool use and thrashing.
`total_tokens_le` / `elapsed_seconds_le` / `command_count_le`	Efficiency checks over `metrics.json`, `metadata.json`, or normalized events.

Use script when a keyword check is too weak for the property you care about. The command sees the candidate run directory, so it can inspect output.md, generated files under outputs/, or metadata. Script assertions are blocked unless you pass --allow-scripts to grade, benchmark, aggregate, or export-anthropic:

{
  "name": "oracle-pass",
  "type": "script",

skill-eval-harness

Frequently Asked Questions

What is skill-eval-harness?

Is skill-eval-harness safe to use?

How do I install skill-eval-harness?

What programming language is skill-eval-harness written in?

Are there alternatives to skill-eval-harness?

Related Skills

Skill Eval Harness

Core loop

What the CLI owns

Contents

Quick start

Installation

From GitHub

Local development

Documentation map

Manifest format

Splits

Assertions