CircleGuardBench: Pioneering Benchmark for Evaluating LLM Guard System Capabilities
In the era of rapid AI development, large language models (LLMs) have become integral to numerous aspects of our lives, from intelligent assistants to content creation. However, with their widespread application comes a pressing concern about their safety and security. How can we ensure that these models do not generate harmful content and are not misused? Enter CircleGuardBench, a groundbreaking tool designed to evaluate the capabilities of LLM guard systems.
The Birth of CircleGuardBench
CircleGuardBench represents the first benchmark for assessing the protection capabilities of LLM guard systems. Traditional evaluations have often focused solely on model accuracy, neglecting other critical factors such as speed and attack robustness that are essential in real – world production environments. CircleGuardBench bridges this gap by integrating accuracy, attack robustness, and latency into a single, practical evaluation framework. This allows safety teams to select guard models that truly deliver in production settings.
What CircleGuardBench Measures
-
Harmful Content Detection: CircleGuardBench can detect harmful content across 17 critical risk categories. These categories encompass a wide range of potential threats, from violence and terrorism to self – harm and suicide, covering almost all content types that could pose risks to individuals, society, and national security. -
Jailbreak Resistance: By employing adversarial prompt variations, the tool tests the ability of guard models to resist jailbreak attacks. Jailbreak attacks refer to attempts by attackers to craft specific prompts to bypass the model’s safety filters and generate non – compliant responses. -
False Positive Rate on Safe Inputs: While blocking harmful content is crucial, minimizing false positives on safe, neutral inputs is equally important. CircleGuardBench calculates the false positive rate of models when processing safe inputs, helping to optimize their accuracy in judgment. -
Runtime Performance: In practical applications, models need to respond in real – time environments. CircleGuardBench evaluates the runtime performance of models under realistic constraints, including their processing speed and resource usage. -
Composite Score: Combining accuracy and speed, CircleGuardBench generates a composite score that reflects a model’s readiness for real – world scenarios. This score provides a comprehensive reference for users to compare different models.
Key Features of CircleGuardBench
-
Standardized Evaluation of Multiple LLMs and Guard Models: CircleGuardBench enables the standardized assessment of various LLMs and guard models. This allows for fair and accurate comparisons among different models, be they of varying architectures or developed by different teams. -
Support for Major Inference Engines: The tool supports several major inference engines, including openai_api, vllm, sglang, and transformers. This extensive compatibility means users can easily integrate CircleGuardBench into their existing workflows based on their specific needs and resources. -
Custom Taxonomy Aligned with Real – World Abuse Cases: Its custom taxonomy is closely aligned with real – world abuse cases and moderation APIs from major platforms like OpenAI and Google. This ensures that the evaluation results are highly relevant to practical applications. -
Composite Scoring System: The composite scoring system takes into account both the safety of model outputs and response times. This unique approach recognizes that an effective guard model must not only accurately identify and block harmful content but also respond promptly to meet real – time interaction demands. -
Leaderboard Generation: CircleGuardBench can generate leaderboards that display evaluation results in a clear and intuitive manner. The leaderboards include macro – averaged metrics across all metric types, as well as detailed tables for each metric type. Users can also choose to group results by harm categories or disable category grouping, and sort by various metrics such as F1 score, recall, and precision.
Getting Started with CircleGuardBench
-
Installation: Installation of CircleGuardBench is straightforward. Users can choose to install it using Poetry or pip. For a basic installation with Poetry, simply run a few commands in the command line. If additional inference engines are required, they can be specified using the “–extras” parameter. Similarly, when using pip for installation, the necessary extra dependencies can be included by specifying them in the installation command. -
Quick Start Commands: Once installed, users can quickly begin evaluating models using a series of simple commands. For example, “guardbench run [MODEL_NAME]” evaluates a specific model, “guardbench run –all” evaluates all configured models, “guardbench leaderboard” displays the evaluation results in a leaderboard format, “guardbench models” lists all configured models with detailed information, and “guardbench dataset_info” shows information about the loaded dataset. Additionally, users can use “guardbench prompts” to view available prompt templates.
Flexible Configuration to Meet Personalized Needs
-
Model Configuration: Model configuration in CircleGuardBench is managed through the “configs/models.json” file. Users can define various model parameters in this file according to their specific requirements, such as model name, type, evaluation scope, inference engine, and related parameters. -
Prompt Templates: Prompt templates are stored in the “prompts” directory in Jinja2 format. These templates provide examples of how to incorporate reasoning into prompts. Users can customize prompt templates to design prompts that better suit their particular assessment goals and strategies.
Standardized Dataset Format for Reliable Evaluation Results
CircleGuardBench employs a standardized dataset format to ensure the reliability and consistency of evaluation results. The dataset includes several key columns, each with a specific purpose:
-
prompt: The original input prompt. -
prompt_verdict: The safety verdict for the original prompt, either “safe” or “unsafe”. -
prompt_hash: A unique identifier hash for the prompt. -
default_answer: The model’s response to the original prompt. -
default_answer_verdict: The safety verdict for the model’s default response. -
jailbreaked_prompt: A modified version of the prompt designed to bypass safety filters. This column is only required if the prompt_verdict is “unsafe”. -
jailbreaked_answer: The model’s response to the jailbreaked prompt. This column is also only required if the prompt_verdict is “unsafe”. -
harm_category: The category of potential harm associated with the prompt. This column is only required if the prompt_verdict is “unsafe”.
The Power of CircleGuardBench’s Command – Line Interface
CircleGuardBench’s command – line interface offers extensive functionality for managing the evaluation process. In addition to the basic evaluation commands, users can utilize advanced options to perform in – depth analysis and customized presentation of evaluation results:
-
Sorting Leaderboard Results: Users can sort leaderboard results by different metrics using the “–sort – by” parameter. For example, they can sort by accuracy, recall, or average runtime. -
Filtering Metrics: By using the “–metric – type” parameter, users can filter results to display only specific types of metrics, such as those for default prompts or jailbreak attempts. -
Grouping by Categories: The “–use – categories” and “–no – categories” parameters allow users to choose whether to group results by harm categories, providing different perspectives on the evaluation outcomes.
In – Depth Look at the CircleGuardBench Leaderboard
The CircleGuardBench leaderboard provides a detailed overview of various models’ performance. It includes:
-
A summary table with macro – averaged metrics across all metric types. -
Detailed tables for each metric type, allowing users to examine model performance in specific scenarios. -
The option to group results by harm categories, enabling users to understand how different models perform in addressing various types of harmful content. -
Sorting functionality by various metrics, such as F1 score, recall, and precision, to help users quickly identify models that excel in particular performance aspects.
For example, a leaderboard without categories might display the overall performance of models across different metrics in a clear and intuitive manner.
The 17 Harm Categories Covered by CircleGuardBench
CircleGuardBench covers 17 categories of harmful content:
-
Violence & Terrorism -
Deception & Misinformation -
Cybercrime & Hacking -
Drugs & Substances -
Animal Abuse -
Financial Fraud -
Hate & Extremism -
Corruption & Loopholes -
Illicit Creative Content -
Academic Cheating -
Environmental Harm -
Weapons & Explosives -
Child Abuse -
Sexual Violence -
Human Trafficking -
AI Jailbreaking -
Self – Harm & Suicide
Each category has a specific definition. For instance, the “Violence & Terrorism” category includes instructions for violent crimes, sabotage, or evading law enforcement. The “Deception & Misinformation” category covers manipulation, fake news, data falsification, or emotional influence. The “AI Jailbreaking” category focuses on prompts attempting to bypass model safeguards or filter evasion.
About the Developers
CircleGuardBench is developed by White Circle, a company dedicated to AI safety and responsible AI deployment. White Circle creates tools, benchmarks, and evaluation methods to assist developers and researchers in developing safer and more reliable LLMs.
Contributions to the project are welcome. Users are encouraged to submit pull requests to contribute to the improvement of CircleGuardBench.
Conclusion
CircleGuardBench has ushered in a new era for evaluating LLM guard system capabilities. In an age where AI technology is advancing rapidly and security risks are becoming increasingly complex, it provides a comprehensive and practical tool for assessing the protection capabilities of LLMs. Its precise measurement indicators, powerful features, and flexible configuration options help researchers, developers, and businesses collaboratively build a safer and more reliable AI application environment. As AI continues to develop, CircleGuardBench is expected to play an even more significant role in ensuring AI safety and facilitating its healthy development to better serve the progress of human society.
– END –