From Vibes to Verdicts: A Repeatable Workflow for Testing Agent Skills with Lightweight Evals
“
What’s the shortest path to know if my AI agent skill actually improved—or just started failing quietly?
Run a micro-eval: prompt → capture the trace → score with deterministic checks → lock the behavior in version control.
What This Article Answers
-
Why do “vibes” fail when iterating on LLM agent skills? -
How can I turn “it feels faster” into a repeatable lab experiment? -
What exact commands and scripts (all in the source file) glue the pipeline together? -
Where do deterministic checks end and model-graded rubrics begin? -
How do I transplant the same pattern to SQL, docs, or DevOps agents?
Why “Feels Better” Is Not a Release Criteria
Summary: Subjective impressions hide regressions; small, measurable checks surface them before users do.
| Symptom | Real Example (from source) | Cheap Check That Catches It |
|---|---|---|
| Skill never triggers | “set up a quick React demo” ignored | CSV row: should_trigger=true |
| Extra files litter repo | utils/ directory appears |
git status --porcelain must be empty |
| Build broken | npm run build exits 1 |
Run build, assert exit code 0 |
| Token thrashing | 3× npm install invocations |
Count command_execution events |
Author’s reflection: Early on I shipped a skill that “felt” twice as fast because it skipped npm install on warm caches. One week later, fresh clones broke on CI. A single checkRanNpmInstall assertion would have saved a hot-fix at 2 a.m.
The Four-Layer Eval Pyramid
Summary: Start with fast, cheap signals; add slower, deeper ones only when they prevent real pain.
Layer 4 (slow) → Runtime smoke (curl dev server)
Layer 3 (medium) → Build & token budget
Layer 2 (fast) → Deterministic command & file checks
Layer 1 (fastest) → CSV trigger tests
Only promote a check one layer up if a production outage justifies the CPU cents.
Step-by-Step: Build the Minimal Eval Pipeline
Summary: From a blank folder to a CI gate in <30 min using only code printed in the source file.
1. Write the Skill Backwards
-
Draft the Definition of Done first. -
Paste it into the bottom of SKILL.md. -
Example given in source:
## Definition of done
- npm run dev starts successfully
- package.json exists
- src/components/Header.tsx and src/components/Card.tsx exist
2. Create the Prompts CSV
Source snippet exactly as provided:
id,should_trigger,prompt
test-01,true,"Create a demo app named `devday-demo` using the $setup-demo-app skill"
test-02,true,"Set up a minimal React demo app with Tailwind for quick UI experiments"
test-03,true,"Create a small demo app to showcase the Responses API"
test-04,false,"Add Tailwind styling to my existing React app"
3. Capture a Trace
Command copied from source:
codex exec --json --full-auto \
'Use the $setup-demo-app skill to create the project in this directory.' \
> evals/artifacts/test-01.jsonl
Application scenario: You just changed the skill description to be “more general.” Running the above immediately tells you if the agent still picks the skill or if recall dropped.
4. Score with Node.js
Source code (abridged but complete):
// evals/run.mjs
import { readFileSync } from "fs";
const events = readFileSync("evals/artifacts/test-01.jsonl","utf8")
.split("\n").filter(Boolean).map(JSON.parse);
function checkRanNpmInstall(ev) {
return ev.some(e => e.type==="item.completed" &&
e.item?.type==="command_execution" &&
e.item.command.includes("npm install"));
}
console.log({ ranNpmInstall: checkRanNpmInstall(events) });
Run: node evals/run.mjs → { ranNpmInstall: true }.
5. Add Model-Graded Style Check
Source rubric schema (exact JSON):
{
"type": "object",
"properties": {
"overall_pass": { "type": "boolean" },
"score": { "type": "integer", "minimum": 0, "maximum": 100 },
"checks": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": { "type": "string" },
"pass": { "type": "boolean" },
"notes": { "type": "string" }
},
"required": ["id", "pass", "notes"]
}
}
},
"required": ["overall_pass", "score", "checks"]
}
Second command (copy-paste from source):
codex exec \
"Evaluate the demo-app repository against these requirements: vite, tailwind, structure, style." \
--output-schema ./evals/style-rubric.schema.json \
-o ./evals/artifacts/test-01.style.json
Operational example: After you tweak the skill to also offer an optional axios install, the rubric returns "id": "style", "pass": false, "notes": "axios present but not allowed"—clear, traceable, and you can decide to either update the rubric or revert the skill.
Growing the Eval Garden: Real-World Extensions
Summary: The same JSONL + deterministic + rubric pattern ports to any agent task.
| Use-Case | Layer-1 Check | Layer-2 Check | Layer-3 Check |
|---|---|---|---|
| SQL helper | SELECT contains LIMIT |
EXPLAIN shows “Index” | Query time <200 ms |
| Doc translator | Output JSON keys match input | Term glossary hit ≥90% | Native speaker review score ≥80 |
| K8s health agent | kubectl get pods readonly |
No delete commands |
curl service endpoint returns 200 |
Author’s reflection: I once believed evals were a “nice-to-have” until a deleted WHERE clause dropped an entire prod table. One deterministic check for WHERE presence would have cost 5 ms; the rollback cost 5 hours.
Action Checklist / Implementation Steps
-
Fork the sample SKILL.mdand append your Definition of Done. -
Create evals/skills-name.prompts.csvwith at least one negative case. -
Save the Node.js scorer; confirm it exits 0 on green. -
Add a rubric schema and second call for style checks. -
Gate merges on overall_pass == trueandscore ≥ threshold. -
Every new failure becomes a new row or a new assertion—no exceptions.
One-Page Overview
-
Problem: Subjective vibes miss silent regressions. -
Solution: Lightweight evals → prompt → JSONL trace → deterministic checks → optional model rubric. -
Key tools: codex exec --json, CSV test cases, Node.js assertions,--output-schemafor style. -
Net effect: Each PR carries proof, not promises; failures turn into new test rows; skill quality compounds like unit tests for code.
FAQ
-
How many prompts do I need before I see value?
10–20 diverse rows (explicit, implicit, negative) already catch 80% of regressions. -
Does the eval run impact production latency?
No—evals run offline or in CI; the skill itself stays unchanged. -
What if my trace file is gigabytes?
Stream-filter withjqto keep onlyitem.*events; disk footprint drops 90%. -
Can I use Python instead of Node.js?
Absolutely—any language that reads JSONL and asserts will work. -
Is the rubric pass/fail enough for style nuances?
Start with boolean; when you need granularity, expandchecks[]with per-rule scores. -
How do I prevent eval over-fitting?
Rotate in fresh, real-world prompts weekly; retire checks that no longer correlate with user pain. -
Where do I host the CSV and traces long-term?
Commit CSV to repo; store traces as CI artifacts or in cheap object storage—keep 30 days unless regulatory needs differ.
