by mgechev
"Unit tests" for your agent skills
# Add to your Claude Code skills
git clone https://github.com/mgechev/skillgradeThe easiest way to evaluate your Agent Skills. Tests that AI agents correctly discover and use your skills.
See examples/ — superlint (simple) and angular-modern (TypeScript grader).

Prerequisites: Node.js 20+, Docker
npm i -g skillgrade
1. Initialize — go to your skill directory (must have SKILL.md) and scaffold:
cd my-skill/
GEMINI_API_KEY=your-key skillgrade init # or ANTHROPIC_API_KEY / OPENAI_API_KEY
# Use --force to overwrite an existing eval.yaml
Generates eval.yaml with AI-powered tasks and graders. Without an API key, creates a well-commented template.
2. Edit — customize eval.yaml for your skill (see eval.yaml Reference).
3. Run:
GEMINI_API_KEY=your-key skillgrade --smoke
The agent is auto-detected from your API key: GEMINI_API_KEY → Gemini, ANTHROPIC_API_KEY → Claude, OPENAI_API_KEY → Codex. Override with --agent=claude.
4. Review:
skillgrade preview # CLI report
skillgrade preview browser # web UI → http://localhost:3847
Reports are saved to $TMPDIR/skillgrade/<skill-name>/results/. Override with --output=DIR.
| Flag | Trials | Use Case |
|------|--------|----------|
| --smoke | 5 | Quick capability check |
| --reliable | 15 | Reliable pass rate estimate |
| --regression | 30 | High-confidence regression detection |
| Flag | Description |
|------|-------------|
| --trials=N | Override trial count |
| --parallel=N | Run trials concurrently |
| --agent=gemini\|claude\|codex | Override agent (default: auto-detect from API key) |
| --provider=docker\|local | Override provider |
| --output=DIR | Output directory (default: $TMPDIR/skillgrade) |
| --validate | Verify graders using reference solutions |
| --ci | CI mode: exit non-zero if below threshold |
| --threshold=0.8 | Pass rate threshold for CI mode |
| --preview | Show CLI results after running |
version: "1"
# Optional: explicit path to skill directory (defaults to auto-detecting SKILL.md)
# skill: path/to/my-skill
defaults:
agent: gemini # gemini | claude | codex
provider: docker # docker | local
trials: 5
timeout: 300 # seconds
threshold: 0.8 # for --ci mode
grader_model: gemini-3-flash-preview # default LLM grader model
docker:
base: node:20-slim
setup: | # extra commands run during image build
apt-get update && apt-get install -y jq
environment: # container resource limits
cpus: 2
memory_mb: 2048
tasks:
- name: fix-linting-errors
instruction: |
Use the superlint tool to fix coding standard violations in app.js.
workspace: ...
No comments yet. Be the first to share your thoughts!