Testing AI Agent Skills: From Vibes to Verdicts with Lightweight Evals

高效码农

2 months ago

From Vibes to Verdicts: A Repeatable Workflow for Testing Agent Skills with Lightweight Evals

“

What’s the shortest path to know if my AI agent skill actually improved—or just started failing quietly?
Run a micro-eval: prompt → capture the trace → score with deterministic checks → lock the behavior in version control.

What This Article Answers

Why do “vibes” fail when iterating on LLM agent skills?
How can I turn “it feels faster” into a repeatable lab experiment?
What exact commands and scripts (all in the source file) glue the pipeline together?
Where do deterministic checks end and model-graded rubrics begin?
How do I transplant the same pattern to SQL, docs, or DevOps agents?

Why “Feels Better” Is Not a Release Criteria

Summary: Subjective impressions hide regressions; small, measurable checks surface them before users do.

Symptom	Real Example (from source)	Cheap Check That Catches It
Skill never triggers	“set up a quick React demo” ignored	CSV row: `should_trigger=true`
Extra files litter repo	`utils/` directory appears	`git status --porcelain` must be empty
Build broken	`npm run build` exits 1	Run build, assert exit code 0
Token thrashing	3× `npm install` invocations	Count `command_execution` events

Author’s reflection: Early on I shipped a skill that “felt” twice as fast because it skipped npm install on warm caches. One week later, fresh clones broke on CI. A single checkRanNpmInstall assertion would have saved a hot-fix at 2 a.m.

The Four-Layer Eval Pyramid

Summary: Start with fast, cheap signals; add slower, deeper ones only when they prevent real pain.

Layer 4 (slow) → Runtime smoke (curl dev server)
Layer 3 (medium) → Build & token budget
Layer 2 (fast) → Deterministic command & file checks
Layer 1 (fastest) → CSV trigger tests

Only promote a check one layer up if a production outage justifies the CPU cents.

Step-by-Step: Build the Minimal Eval Pipeline

Summary: From a blank folder to a CI gate in <30 min using only code printed in the source file.

1. Write the Skill Backwards

Draft the Definition of Done first.
Paste it into the bottom of SKILL.md.
Example given in source:

## Definition of done
- npm run dev starts successfully
- package.json exists
- src/components/Header.tsx and src/components/Card.tsx exist

2. Create the Prompts CSV

Source snippet exactly as provided:

id,should_trigger,prompt
test-01,true,"Create a demo app named `devday-demo` using the $setup-demo-app skill"
test-02,true,"Set up a minimal React demo app with Tailwind for quick UI experiments"
test-03,true,"Create a small demo app to showcase the Responses API"
test-04,false,"Add Tailwind styling to my existing React app"

3. Capture a Trace

Command copied from source:

codex exec --json --full-auto \
  'Use the $setup-demo-app skill to create the project in this directory.' \
  > evals/artifacts/test-01.jsonl

Application scenario: You just changed the skill description to be “more general.” Running the above immediately tells you if the agent still picks the skill or if recall dropped.

4. Score with Node.js

Source code (abridged but complete):

// evals/run.mjs
import { readFileSync } from "fs";
const events = readFileSync("evals/artifacts/test-01.jsonl","utf8")
               .split("\n").filter(Boolean).map(JSON.parse);

function checkRanNpmInstall(ev) {
  return ev.some(e => e.type==="item.completed" &&
                      e.item?.type==="command_execution" &&
                      e.item.command.includes("npm install"));
}
console.log({ ranNpmInstall: checkRanNpmInstall(events) });

Run: node evals/run.mjs → { ranNpmInstall: true }.

5. Add Model-Graded Style Check

Source rubric schema (exact JSON):

{
  "type": "object",
  "properties": {
    "overall_pass": { "type": "boolean" },
    "score": { "type": "integer", "minimum": 0, "maximum": 100 },
    "checks": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "id": { "type": "string" },
          "pass": { "type": "boolean" },
          "notes": { "type": "string" }
        },
        "required": ["id", "pass", "notes"]
      }
    }
  },
  "required": ["overall_pass", "score", "checks"]
}

Second command (copy-paste from source):

codex exec \
  "Evaluate the demo-app repository against these requirements: vite, tailwind, structure, style." \
  --output-schema ./evals/style-rubric.schema.json \
  -o ./evals/artifacts/test-01.style.json

Operational example: After you tweak the skill to also offer an optional axios install, the rubric returns "id": "style", "pass": false, "notes": "axios present but not allowed"—clear, traceable, and you can decide to either update the rubric or revert the skill.

Growing the Eval Garden: Real-World Extensions

Summary: The same JSONL + deterministic + rubric pattern ports to any agent task.

Use-Case	Layer-1 Check	Layer-2 Check	Layer-3 Check
SQL helper	`SELECT` contains `LIMIT`	EXPLAIN shows “Index”	Query time <200 ms
Doc translator	Output JSON keys match input	Term glossary hit ≥90%	Native speaker review score ≥80
K8s health agent	`kubectl get pods` readonly	No `delete` commands	`curl` service endpoint returns 200

Author’s reflection: I once believed evals were a “nice-to-have” until a deleted WHERE clause dropped an entire prod table. One deterministic check for WHERE presence would have cost 5 ms; the rollback cost 5 hours.

Action Checklist / Implementation Steps

Fork the sample SKILL.md and append your Definition of Done.
Create evals/skills-name.prompts.csv with at least one negative case.
Save the Node.js scorer; confirm it exits 0 on green.
Add a rubric schema and second call for style checks.
Gate merges on overall_pass == true and score ≥ threshold.
Every new failure becomes a new row or a new assertion—no exceptions.

One-Page Overview

Problem: Subjective vibes miss silent regressions.
Solution: Lightweight evals → prompt → JSONL trace → deterministic checks → optional model rubric.
Key tools: codex exec --json, CSV test cases, Node.js assertions, --output-schema for style.
Net effect: Each PR carries proof, not promises; failures turn into new test rows; skill quality compounds like unit tests for code.

FAQ

How many prompts do I need before I see value?
10–20 diverse rows (explicit, implicit, negative) already catch 80% of regressions.
Does the eval run impact production latency?
No—evals run offline or in CI; the skill itself stays unchanged.
What if my trace file is gigabytes?
Stream-filter with jq to keep only item.* events; disk footprint drops 90%.
Can I use Python instead of Node.js?
Absolutely—any language that reads JSONL and asserts will work.
Is the rubric pass/fail enough for style nuances?
Start with boolean; when you need granularity, expand checks[] with per-rule scores.
How do I prevent eval over-fitting?
Rotate in fresh, real-world prompts weekly; retire checks that no longer correlate with user pain.
Where do I host the CSV and traces long-term?
Commit CSV to repo; store traces as CI artifacts or in cheap object storage—keep 30 days unless regulatory needs differ.