BUGFARM Framework: How to Stress-Test AI Bug Detectors Without Training

高效码农

5 months ago

★BUGFARM: How to Mass-Produce “Hard-to-Spot, Hard-to-Fix” Bugs for AI Testing★

🍄

Quick Snapshot
🍄

Do I Need BUGFARM?
🍄

Inside BUGFARM: A 3-Step Walk-Through
🍄

Hands-On Lab: 10 Minutes From Zero to First Bug
🍄

Frequently Asked Questions
🍄

BUGFARM vs. LEAM vs. μBERT
🍄

Reusing the Paper’s Public Data
🍄

Bottom Line

Quick Snapshot

BUGFARM is a training-free, language-agnostic framework that:

Takes any code snippet you feed it.
Figures out which statements a transformer model “cares about” the least.
Asks a large-language model (GPT-3.5 by default) to plant bugs only in those low-attention spots.
Returns bug-injected versions that look almost identical to the original code yet break tests—perfect stress-tests for AI-driven bug detectors and repair tools.

Do I Need BUGFARM?

Scenario	Should You Use BUGFARM?
You’re building a new AI bug-prediction model and need hard negatives	Yes
You want to see if your AI repair tool has blind spots	Yes
You only need traditional unit-test failures	No (BUGFARM targets AI models)
You want real-world human bugs for case studies	No (BUGFARM creates synthetic, AI-style mistakes)

Inside BUGFARM: A 3-Step Walk-Through

1. Method Extractor

🍄

Scans any Maven-based Java project (other build systems can be added).
🍄

Builds an Abstract Syntax Tree (AST) and pulls out every method or constructor.
🍄

Strips docstrings but keeps signatures and bodies.

2. Attention Analyzer

🍄

Loads a transformer model (CodeBERT, CodeT5, or NATGEN).
🍄

Computes attention weights for every token, then averages across layers and heads.
🍄

Ranks statements by the fraction of “low-attention” tokens they contain.
🍄

Marks the bottom k % (default 10 %) as Least-Attended Statements (LAS).

3. Bug Generator

🍄
Crafts a prompt containing:
- 🍄
  
  Original method (with line numbers).
- 🍄
  
  Instruction: “Inject N bugs by changing only the LAS lines.”
🍄

Sends the prompt to GPT-3.5-turbo.
🍄
Filters out:
- 🍄
  
  Syntax errors.
- 🍄
  
  Any patch that changes non-LAS lines.
- 🍄
  
  Any patch that noticeably shifts attention weights.
🍄

Keeps the rest as confirmed bugs if they fail existing unit tests.

Hands-On Lab: 10 Minutes From Zero to First Bug

Option A: Docker (Recommended)

# 1. Clone the repository (link in the paper)
git clone https://github.com/projectinvestigator/BUGFARM
cd BUGFARM

# 2. Build the image
docker build -t bugfarm .

# 3. Launch a shell inside the container
docker run -it bugfarm bash

# 4. Run the pipeline
python -m src.attention_analyzer --project sample-maven
python -m src.bug_generator --method sample-maven/src/Util.java:42

On first run the script downloads ~2 GB of model weights.

Option B: Manual Install

Install miniconda.
conda env create -f environment.yaml && conda activate bugfarm
Install the tokenizer tool (bash setup.sh handles this).
Same Python commands as above.

“

When finished, check the output/bugs/ folder for JSON files containing the patched source code.

Frequently Asked Questions

Do these synthetic bugs look like real-world mistakes?
They mimic AI-generated mistakes, not human ones. Their job is to test AI tools, not to replace real defect datasets.
Can I use BUGFARM for Python or C++?
The framework is language-agnostic; only the parser is Java-specific today. Swapping in a Python or C++ parser is possible.
Why is the default threshold 10 %?
Paper experiments show that 10 % LAS already raises the False-Negative Rate by up to 40 %. Higher k increases patch size with diminishing returns.
How many variants per method?
Default is N = 3. You can raise it, but LAS lines often limit diversity.
Will the tool crash my project?
All patches must compile and fail existing tests; anything crashing the build is discarded.
Where are the model weights stored?
Hugging Face cache, handled automatically by the setup script.
Is a GPU required?
Not for inference; CPU works. GPU (RTX 3090 tested) speeds up attention extraction to ~68 ms.
How are the generated files named?
{OriginalClass}_{Method}_bug{Index}.java — easy to script against.
How do you decide if a bug is “equivalent”?
If the existing test-suite still passes, it’s labeled equivalent. Manual review handles edge cases.
Can I use BUGFARM commercially?
MIT license allows commercial use, but the authors caution it’s meant for research benchmarks.
Can I switch to GPT-4?
Change one line in the Bug Generator config; cost rises ~10×.
Does my private code leak to OpenAI?
Yes, if you use the cloud API. For sensitive code, run an on-prem LLaMa or Alpaca model instead.

BUGFARM vs. LEAM vs. μBERT

Metric	BUGFARM	LEAM	μBERT
Training required	None	24 h on GitHub commits	None
Avg. statements changed per bug	3.2	2.7	2.0
False-Negative Rate boost vs. baseline	+40 %	baseline	baseline
Repair success (FitRepair)	28 %	36 %	49 %
Time per method (end-to-end)	9 s	35 s	28 s
Overlap with other tools	7.6 % LEAM, 4 % μBERT	—	—

Takeaway: BUGFARM sacrifices some repairability to become a tougher test for AI detectors.

Reusing the Paper’s Public Data

Everything is on Zenodo under DOI 10.5281/zenodo.13886318:

File	What You’ll Find
`mutants.zip`	434 k confirmed bugs plus original and patched source
`defect_datasets.zip`	Training/testing datasets built from BugSwarm, Defects4J, RegMiner, LEAM, μBERT
`apr.zip`	FitRepair patches generated for 1 000 sampled bugs
`human_study.zip`	97 human labelers’ votes on 300 survived mutants

Unzip and feed any CSV or JSON file into your own pipeline—the schema is self-documented.

Bottom Line

BUGFARM is a red-team toolkit for AI-driven software engineering:
no training, no buzzwords, just reliable, reproducible bugs that expose blind spots in modern defect detectors and repair tools.
If you’re building or benchmarking such tools, keep BUGFARM in your pocket.