t’s 2 a.m. Slack is screaming. Your customer-support agent just gave a 15-year-old a vape-discount code, the legal team is drafting headlines, and your unit tests are still green. Sound familiar? Traditional QA wasn’t built for conversational, policy-bound, stochastically creative creatures. That’s exactly why Qualifire open-sourced Rogue—an A2A-native red-team that turns written policies into CI/CD gates. Below is the full field manual: install it, abuse it, ship with confidence.


1. The Gap No One Talks About

What classic tests check What agents actually break
Single-turn intent accuracy Multi-turn memory loss
Static prompt answers Policy circumvention
Scalar “LLM-as-Judge” score Audit-trail vacuum

Agents drift with context. Give them enough turns and they’ll legally misbehave. Rogue closes that hole by automating adversarial, multi-agent conversations and emitting machine-readable evidence you can block releases on.


2. Rogue in One Breath

  • Red-team: spins up an EvaluatorAgent that chats to your agent over Google’s A2A protocol
  • Compliance officer: converts PDF policies into executable assertions
  • Court stenographer: streams transcripts, verdicts, token costs and model lineage into a single Markdown report your auditors will actually read

3. Zero-to-“WTF” in 5 Minutes

3.1 One-liner install (uvx)

# never installed uv? 10 s curl
curl -LsSf https://astral.sh/uv/install.sh | sh
uvx rogue-ai --example=tshirt_store

Prefer pip? We’ve got your back:

git clone https://github.com/qualifire-dev/rogue.git
cd rogue && pip install -e .[examples]

3.2 Pick your interface

uvx rogue-ai          # Server + beautiful TUI (Bubble Tea)
uvx rogue-ai ui       # Gradio web dashboard
uvx rogue-ai cli      # headless for CI

The TUI pops open automatically. Left pane: live chat. Right pane: policy score turning red the second your agent slips.

rogue-tui
Fig. Real-time adversarial chat with policy verdicts

3.3 Read the receipts

.rogue/report.md contains:

  • per-policy pass/fail with transcript spans
  • token usage & latency per turn
  • exact model weights → reproducible forensics

4. Shipping to Prod? Wire the CLI into Your Pipeline

Example GitHub Actions gate:

- name: Start Rogue server
  run: uvx rogue-ai server &

- name: Run evaluation
  run: |
    uvx rogue-ai cli \
      --evaluated-agent-url https://staging.agent.internal \
      --evaluated-agent-auth-type bearer_token \
      --evaluated-agent-credentials ${{ secrets.AGENT_TOKEN }} \
      --business-context-file policies.md \
      --judge-llm openai/gpt-4o-mini \
      --output-report-file report.md

- name: Gate release
  run: grep -q 'Overall: PASS' report.md

Fail score → blocked merge. No human in the loop, no 2 a.m. pages.


5. Pro Tips to Make Your Agent Sweat

  1. Triple-layer prompts
    Context → Emotion → Specific detail. Forces the agent across policy boundaries.
    Example chain: refund request → furious single parent → photoshopped receipt.

  2. Policy-as-Code
    Convert “no alcohol to minors” into an assertion age < 18 and item.alcohol → FAIL. Rogue auto-checks every turn.

  3. Judge jury
    Run GPT-4o, Claude-3, Gemini in parallel; majority vote reduces single-model bias.

  4. Leave breadcrumbs
    Every transcript is timestamped with model commit hash. When regulators knock, hand them the Markdown—no log spelunking.


6. SEO-Friendly FAQ (AnswerThePublic style)

Q1: My agent doesn’t speak A2A. Can I still use Rogue?
A: Yes. Expose any HTTP /send_message endpoint; Rogue wraps it into A2A calls under the hood.

Q2: Will sensitive data leak during tests?
A: Rogue defaults to synthetic PII. Flip --synthetic-pii and zero real data leaves your VPC.

Q3: Cost of judge LLMs?
A: ~100 scenarios × 10 turns each ≈ 80¢ on GPT-4o-mini. Cheaper than a single support ticket nightmare.

Q4: Non-English agents supported?
A: Absolutely. Write your business context in Chinese, Japanese, Klingon—EvaluatorAgent follows suit.


7. Key Takeaway

Unit tests guard functions; Rogue guards behaviour. Ship agents that can survive 2 a.m. adversaries, compliance audits and your own stress levels—without hiring an overnight red-team.

Install once, sleep forever.

uvx rogue-ai --help

References

[1] Qualifire AI. Rogue Official Repo. https://github.com/qualifire-dev/rogue
[2] Google A2A Protocol Spec. https://github.com/google/A2A