★BUGFARM: How to Mass-Produce “Hard-to-Spot, Hard-to-Fix” Bugs for AI Testing★
Table of Contents
- 🍄
Quick Snapshot - 🍄
Do I Need BUGFARM? - 🍄
Inside BUGFARM: A 3-Step Walk-Through - 🍄
Hands-On Lab: 10 Minutes From Zero to First Bug - 🍄
Frequently Asked Questions - 🍄
BUGFARM vs. LEAM vs. μBERT - 🍄
Reusing the Paper’s Public Data - 🍄
Bottom Line
Quick Snapshot
BUGFARM is a training-free, language-agnostic framework that:
-
Takes any code snippet you feed it. -
Figures out which statements a transformer model “cares about” the least. -
Asks a large-language model (GPT-3.5 by default) to plant bugs only in those low-attention spots. -
Returns bug-injected versions that look almost identical to the original code yet break tests—perfect stress-tests for AI-driven bug detectors and repair tools.
Do I Need BUGFARM?
Scenario | Should You Use BUGFARM? |
---|---|
You’re building a new AI bug-prediction model and need hard negatives | Yes |
You want to see if your AI repair tool has blind spots | Yes |
You only need traditional unit-test failures | No (BUGFARM targets AI models) |
You want real-world human bugs for case studies | No (BUGFARM creates synthetic, AI-style mistakes) |
Inside BUGFARM: A 3-Step Walk-Through
1. Method Extractor
- 🍄
Scans any Maven-based Java project (other build systems can be added). - 🍄
Builds an Abstract Syntax Tree (AST) and pulls out every method or constructor. - 🍄
Strips docstrings but keeps signatures and bodies.
2. Attention Analyzer
- 🍄
Loads a transformer model (CodeBERT, CodeT5, or NATGEN). - 🍄
Computes attention weights for every token, then averages across layers and heads. - 🍄
Ranks statements by the fraction of “low-attention” tokens they contain. - 🍄
Marks the bottom k % (default 10 %) as Least-Attended Statements (LAS).
3. Bug Generator
- 🍄
Crafts a prompt containing: - 🍄
Original method (with line numbers). - 🍄
Instruction: “Inject N bugs by changing only the LAS lines.”
- 🍄
- 🍄
Sends the prompt to GPT-3.5-turbo. - 🍄
Filters out: - 🍄
Syntax errors. - 🍄
Any patch that changes non-LAS lines. - 🍄
Any patch that noticeably shifts attention weights.
- 🍄
- 🍄
Keeps the rest as confirmed bugs if they fail existing unit tests.
Hands-On Lab: 10 Minutes From Zero to First Bug
Option A: Docker (Recommended)
# 1. Clone the repository (link in the paper)
git clone https://github.com/projectinvestigator/BUGFARM
cd BUGFARM
# 2. Build the image
docker build -t bugfarm .
# 3. Launch a shell inside the container
docker run -it bugfarm bash
# 4. Run the pipeline
python -m src.attention_analyzer --project sample-maven
python -m src.bug_generator --method sample-maven/src/Util.java:42
On first run the script downloads ~2 GB of model weights.
Option B: Manual Install
-
Install miniconda. -
conda env create -f environment.yaml && conda activate bugfarm
-
Install the tokenizer tool ( bash setup.sh
handles this). -
Same Python commands as above.
“
When finished, check the
output/bugs/
folder for JSON files containing the patched source code.
Frequently Asked Questions
-
Do these synthetic bugs look like real-world mistakes?
They mimic AI-generated mistakes, not human ones. Their job is to test AI tools, not to replace real defect datasets. -
Can I use BUGFARM for Python or C++?
The framework is language-agnostic; only the parser is Java-specific today. Swapping in a Python or C++ parser is possible. -
Why is the default threshold 10 %?
Paper experiments show that 10 % LAS already raises the False-Negative Rate by up to 40 %. Higher k increases patch size with diminishing returns. -
How many variants per method?
Default is N = 3. You can raise it, but LAS lines often limit diversity. -
Will the tool crash my project?
All patches must compile and fail existing tests; anything crashing the build is discarded. -
Where are the model weights stored?
Hugging Face cache, handled automatically by the setup script. -
Is a GPU required?
Not for inference; CPU works. GPU (RTX 3090 tested) speeds up attention extraction to ~68 ms. -
How are the generated files named?
{OriginalClass}_{Method}_bug{Index}.java
— easy to script against. -
How do you decide if a bug is “equivalent”?
If the existing test-suite still passes, it’s labeled equivalent. Manual review handles edge cases. -
Can I use BUGFARM commercially?
MIT license allows commercial use, but the authors caution it’s meant for research benchmarks. -
Can I switch to GPT-4?
Change one line in the Bug Generator config; cost rises ~10×. -
Does my private code leak to OpenAI?
Yes, if you use the cloud API. For sensitive code, run an on-prem LLaMa or Alpaca model instead.
BUGFARM vs. LEAM vs. μBERT
Metric | BUGFARM | LEAM | μBERT |
---|---|---|---|
Training required | None | 24 h on GitHub commits | None |
Avg. statements changed per bug | 3.2 | 2.7 | 2.0 |
False-Negative Rate boost vs. baseline | +40 % | baseline | baseline |
Repair success (FitRepair) | 28 % | 36 % | 49 % |
Time per method (end-to-end) | 9 s | 35 s | 28 s |
Overlap with other tools | 7.6 % LEAM, 4 % μBERT | — | — |
Takeaway: BUGFARM sacrifices some repairability to become a tougher test for AI detectors.
Reusing the Paper’s Public Data
Everything is on Zenodo under DOI 10.5281/zenodo.13886318:
File | What You’ll Find |
---|---|
mutants.zip |
434 k confirmed bugs plus original and patched source |
defect_datasets.zip |
Training/testing datasets built from BugSwarm, Defects4J, RegMiner, LEAM, μBERT |
apr.zip |
FitRepair patches generated for 1 000 sampled bugs |
human_study.zip |
97 human labelers’ votes on 300 survived mutants |
Unzip and feed any CSV or JSON file into your own pipeline—the schema is self-documented.
Bottom Line
BUGFARM is a red-team toolkit for AI-driven software engineering:
no training, no buzzwords, just reliable, reproducible bugs that expose blind spots in modern defect detectors and repair tools.
If you’re building or benchmarking such tools, keep BUGFARM in your pocket.