CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabilities at Scale
Introduction
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously. Thoroughly assessing their cybersecurity capabilities is critical and urgent, given the high stakes in this domain. However, existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope. To address this gap, we introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities found and patched across 188 large software projects.
While it includes tasks of various settings, CyberGym primarily focuses on the generation of proof-of-concept (PoC) tests for vulnerability reproduction, based on text descriptions and corresponding source repositories. Solving this task is particularly challenging, as it requires comprehensive reasoning across entire codebases to locate relevant code fragments and produce effective PoCs that accurately trigger the target vulnerability starting from the program’s entry point.
Our evaluation across 4 state-of-the-art agent frameworks and 9 LLMs reveals that even the best combination (OpenHands and Claude-3.7-Sonnet) achieves only an 11.9% reproduction success rate, mainly on simpler cases. Beyond reproducing historical vulnerabilities, we find that PoCs generated by LLM agents can reveal new vulnerabilities, identifying 15 zero-days affecting the latest versions of the software projects.
§
What is CyberGym?
A Realistic Cybersecurity Benchmark
CyberGym is designed to evaluate AI agents’ ability to analyze and exploit real-world vulnerabilities. Unlike simplified capture-the-flag (CTF) challenges or narrow-scoped benchmarks, CyberGym:
-
Leverages Real Vulnerabilities: Built from 1,507 vulnerabilities patched in 188 open-source projects (e.g., OpenCV, FFmpeg, Binutils). -
Focuses on PoC Generation: Tests an agent’s ability to create test cases that trigger vulnerabilities using codebases and vulnerability descriptions. -
Scales to Complexity: Codebases often include thousands of files and millions of lines of code, mimicking real-world environments.
Key Features
-
Multi-Level Difficulty:
-
Level 0: Only the pre-patch codebase is provided. -
Level 1: Includes a vulnerability description (primary task). -
Level 2: Adds crash stack traces from ground-truth PoCs. -
Level 3: Provides patch diffs and post-patch codebases.
-
-
Robust Evaluation Metrics:
-
Vulnerability Reproduction: PoC triggers a crash in the pre-patch version but not the patched version. -
Post-Patch Discovery: PoC finds new vulnerabilities in the patched version.
-
-
Containerized Execution:
-
Modular setup allows scalable testing of agents in isolated environments.
-
§
Why CyberGym Matters
Limitations of Existing Benchmarks
Prior cybersecurity benchmarks suffer from:
-
Small Codebases: CTF challenges like Cybench or NYU CTF Bench use tiny codebases, unlike real-world projects. -
Narrow Scope: Tools like CVE-Bench or PentestGPT focus on limited tasks (e.g., web exploits).
CyberGym’s Advantages
Aspect | CyberGym | Traditional Benchmarks |
---|---|---|
Codebase Size | 1,000+ files, 387k+ lines (median) | Few files |
Task Complexity | Repository-wide reasoning required | Localized code edits |
Real-World Relevance | Derived from OSS-Fuzz vulnerabilities | Simplified CTF-style challenges |
§
Experimental Results
Agent Performance
We tested 4 agent frameworks (OpenHands, Codex, ENiGMA, Cybench) with 9 LLMs (GPT-4.1, Claude-3.7-Sonnet, etc.). Key findings:
-
Best Combination: OpenHands + Claude-3.7-Sonnet achieved 11.9% success in reproducing vulnerabilities. -
General Trends: -
CTF-focused agents (ENiGMA, Cybench) excelled at post-patch discovery. -
General coding agents (OpenHands) performed better at vulnerability reproduction. -
All agents struggled with complex PoCs (success rate <8% for PoCs >100 bytes).
-
Zero-Day Discoveries
Agent-generated PoCs revealed 15 previously undisclosed vulnerabilities in latest software versions, highlighting AI’s potential for proactive security testing.
§
How CyberGym Works
Task Design
-
Inputs:
-
Pre-patch codebase + vulnerability description (Level 1). -
Optional: Crash stack traces, patch diffs (Levels 2–3).
-
-
Agent Actions:
-
Browse code, compile executables, write scripts, and iteratively refine PoCs. -
Submit PoCs to a containerized environment for validation.
-
Example Workflow
An agent tasked with reproducing a vulnerability in an image parser:
-
Step 1–4: Search codebase for ReadMNGImage()
function usinggrep
/find
. -
Step 5–6: Inspect binary structure with xxd
to identify malformed MNG chunks. -
Step 7–8: Craft PoC, test, mutate input, and trigger a heap overflow.
§
Key Takeaways
1. AI Agents Are Promising but Limited
-
Strengths: -
Scripting (Python/Bash) to generate complex PoCs. -
Code navigation (e.g., grep
,ls
).
-
-
Weaknesses: -
Fail on long/structured inputs (e.g., PoCs >100 bytes). -
Struggle with multi-step reasoning (success peaks at 20–40 steps).
-
2. Zero-Day Potential
Even with low success rates, agents uncovered 15 zero-days, showing value in automated fuzzing.
3. Future Directions
-
Improve Context Handling: Better parse large codebases. -
Hybrid Systems: Combine AI with traditional fuzzers (e.g., OSS-Fuzz). -
Tool Augmentation: Provide code structure visualizations or debug helpers.
§
Conclusion
CyberGym provides a rigorous benchmark for evaluating AI agents’ cybersecurity capabilities. While current agents show promise in discovering new vulnerabilities, their success rates on known vulnerabilities remain low. As AI evolves, frameworks like CyberGym will be critical for ensuring robust, secure AI systems.
Data Availability:
§
SEO Keywords: AI cybersecurity benchmark, LLM vulnerability testing, PoC generation, zero-day discovery, OpenHands, Claude-3.7-Sonnet, OSS-Fuzz, real-world vulnerabilities.
Meta Description: Discover how CyberGym evaluates AI agents’ ability to reproduce and discover vulnerabilities in real-world software projects. Learn key findings on LLM performance and zero-day potential.
Related Articles: