t’s 2 a.m. Slack is screaming. Your customer-support agent just gave a 15-year-old a vape-discount code, the legal team is drafting headlines, and your unit tests are still green. Sound familiar? Traditional QA wasn’t built for conversational, policy-bound, stochastically creative creatures. That’s exactly why Qualifire open-sourced Rogue—an A2A-native red-team that turns written policies into CI/CD gates. Below is the full field manual: install it, abuse it, ship with confidence.
1. The Gap No One Talks About
What classic tests check | What agents actually break |
---|---|
Single-turn intent accuracy | Multi-turn memory loss |
Static prompt answers | Policy circumvention |
Scalar “LLM-as-Judge” score | Audit-trail vacuum |
Agents drift with context. Give them enough turns and they’ll legally misbehave. Rogue closes that hole by automating adversarial, multi-agent conversations and emitting machine-readable evidence you can block releases on.
2. Rogue in One Breath
-
Red-team: spins up an EvaluatorAgent
that chats to your agent over Google’s A2A protocol -
Compliance officer: converts PDF policies into executable assertions -
Court stenographer: streams transcripts, verdicts, token costs and model lineage into a single Markdown report your auditors will actually read
3. Zero-to-“WTF” in 5 Minutes
3.1 One-liner install (uvx)
# never installed uv? 10 s curl
curl -LsSf https://astral.sh/uv/install.sh | sh
uvx rogue-ai --example=tshirt_store
Prefer pip? We’ve got your back:
git clone https://github.com/qualifire-dev/rogue.git
cd rogue && pip install -e .[examples]
3.2 Pick your interface
uvx rogue-ai # Server + beautiful TUI (Bubble Tea)
uvx rogue-ai ui # Gradio web dashboard
uvx rogue-ai cli # headless for CI
The TUI pops open automatically. Left pane: live chat. Right pane: policy score turning red the second your agent slips.
Fig. Real-time adversarial chat with policy verdicts
3.3 Read the receipts
.rogue/report.md
contains:
-
per-policy pass/fail with transcript spans -
token usage & latency per turn -
exact model weights → reproducible forensics
4. Shipping to Prod? Wire the CLI into Your Pipeline
Example GitHub Actions gate:
- name: Start Rogue server
run: uvx rogue-ai server &
- name: Run evaluation
run: |
uvx rogue-ai cli \
--evaluated-agent-url https://staging.agent.internal \
--evaluated-agent-auth-type bearer_token \
--evaluated-agent-credentials ${{ secrets.AGENT_TOKEN }} \
--business-context-file policies.md \
--judge-llm openai/gpt-4o-mini \
--output-report-file report.md
- name: Gate release
run: grep -q 'Overall: PASS' report.md
Fail score → blocked merge. No human in the loop, no 2 a.m. pages.
5. Pro Tips to Make Your Agent Sweat
-
Triple-layer prompts
Context → Emotion → Specific detail. Forces the agent across policy boundaries.
Example chain: refund request → furious single parent → photoshopped receipt. -
Policy-as-Code
Convert “no alcohol to minors” into an assertionage < 18 and item.alcohol → FAIL
. Rogue auto-checks every turn. -
Judge jury
Run GPT-4o, Claude-3, Gemini in parallel; majority vote reduces single-model bias. -
Leave breadcrumbs
Every transcript is timestamped with model commit hash. When regulators knock, hand them the Markdown—no log spelunking.
6. SEO-Friendly FAQ (AnswerThePublic style)
Q1: My agent doesn’t speak A2A. Can I still use Rogue?
A: Yes. Expose any HTTP /send_message
endpoint; Rogue wraps it into A2A calls under the hood.
Q2: Will sensitive data leak during tests?
A: Rogue defaults to synthetic PII. Flip --synthetic-pii
and zero real data leaves your VPC.
Q3: Cost of judge LLMs?
A: ~100 scenarios × 10 turns each ≈ 80¢ on GPT-4o-mini. Cheaper than a single support ticket nightmare.
Q4: Non-English agents supported?
A: Absolutely. Write your business context in Chinese, Japanese, Klingon—EvaluatorAgent follows suit.
7. Key Takeaway
Unit tests guard functions; Rogue guards behaviour. Ship agents that can survive 2 a.m. adversaries, compliance audits and your own stress levels—without hiring an overnight red-team.
Install once, sleep forever.
uvx rogue-ai --help
References
[1] Qualifire AI. Rogue Official Repo. https://github.com/qualifire-dev/rogue
[2] Google A2A Protocol Spec. https://github.com/google/A2A