Site icon Efficient Coder

BUGFARM Framework: How to Stress-Test AI Bug Detectors Without Training

BUGFARM: How to Mass-Produce “Hard-to-Spot, Hard-to-Fix” Bugs for AI Testing


Table of Contents


Quick Snapshot

BUGFARM is a training-free, language-agnostic framework that:

  1. Takes any code snippet you feed it.
  2. Figures out which statements a transformer model “cares about” the least.
  3. Asks a large-language model (GPT-3.5 by default) to plant bugs only in those low-attention spots.
  4. Returns bug-injected versions that look almost identical to the original code yet break tests—perfect stress-tests for AI-driven bug detectors and repair tools.

Do I Need BUGFARM?

Scenario Should You Use BUGFARM?
You’re building a new AI bug-prediction model and need hard negatives Yes
You want to see if your AI repair tool has blind spots Yes
You only need traditional unit-test failures No (BUGFARM targets AI models)
You want real-world human bugs for case studies No (BUGFARM creates synthetic, AI-style mistakes)

Inside BUGFARM: A 3-Step Walk-Through

1. Method Extractor

  • 🍄
    Scans any Maven-based Java project (other build systems can be added).
  • 🍄
    Builds an Abstract Syntax Tree (AST) and pulls out every method or constructor.
  • 🍄
    Strips docstrings but keeps signatures and bodies.

2. Attention Analyzer

  • 🍄
    Loads a transformer model (CodeBERT, CodeT5, or NATGEN).
  • 🍄
    Computes attention weights for every token, then averages across layers and heads.
  • 🍄
    Ranks statements by the fraction of “low-attention” tokens they contain.
  • 🍄
    Marks the bottom k % (default 10 %) as Least-Attended Statements (LAS).

3. Bug Generator

  • 🍄
    Crafts a prompt containing:
    • 🍄
      Original method (with line numbers).
    • 🍄
      Instruction: “Inject N bugs by changing only the LAS lines.”
  • 🍄
    Sends the prompt to GPT-3.5-turbo.
  • 🍄
    Filters out:
    • 🍄
      Syntax errors.
    • 🍄
      Any patch that changes non-LAS lines.
    • 🍄
      Any patch that noticeably shifts attention weights.
  • 🍄
    Keeps the rest as confirmed bugs if they fail existing unit tests.

Hands-On Lab: 10 Minutes From Zero to First Bug

Option A: Docker (Recommended)

# 1. Clone the repository (link in the paper)
git clone https://github.com/projectinvestigator/BUGFARM
cd BUGFARM

# 2. Build the image
docker build -t bugfarm .

# 3. Launch a shell inside the container
docker run -it bugfarm bash

# 4. Run the pipeline
python -m src.attention_analyzer --project sample-maven
python -m src.bug_generator --method sample-maven/src/Util.java:42

On first run the script downloads ~2 GB of model weights.

Option B: Manual Install

  1. Install miniconda.
  2. conda env create -f environment.yaml && conda activate bugfarm
  3. Install the tokenizer tool (bash setup.sh handles this).
  4. Same Python commands as above.

When finished, check the output/bugs/ folder for JSON files containing the patched source code.


Frequently Asked Questions

  1. Do these synthetic bugs look like real-world mistakes?
    They mimic AI-generated mistakes, not human ones. Their job is to test AI tools, not to replace real defect datasets.

  2. Can I use BUGFARM for Python or C++?
    The framework is language-agnostic; only the parser is Java-specific today. Swapping in a Python or C++ parser is possible.

  3. Why is the default threshold 10 %?
    Paper experiments show that 10 % LAS already raises the False-Negative Rate by up to 40 %. Higher k increases patch size with diminishing returns.

  4. How many variants per method?
    Default is N = 3. You can raise it, but LAS lines often limit diversity.

  5. Will the tool crash my project?
    All patches must compile and fail existing tests; anything crashing the build is discarded.

  6. Where are the model weights stored?
    Hugging Face cache, handled automatically by the setup script.

  7. Is a GPU required?
    Not for inference; CPU works. GPU (RTX 3090 tested) speeds up attention extraction to ~68 ms.

  8. How are the generated files named?
    {OriginalClass}_{Method}_bug{Index}.java — easy to script against.

  9. How do you decide if a bug is “equivalent”?
    If the existing test-suite still passes, it’s labeled equivalent. Manual review handles edge cases.

  10. Can I use BUGFARM commercially?
    MIT license allows commercial use, but the authors caution it’s meant for research benchmarks.

  11. Can I switch to GPT-4?
    Change one line in the Bug Generator config; cost rises ~10×.

  12. Does my private code leak to OpenAI?
    Yes, if you use the cloud API. For sensitive code, run an on-prem LLaMa or Alpaca model instead.


BUGFARM vs. LEAM vs. μBERT

Metric BUGFARM LEAM μBERT
Training required None 24 h on GitHub commits None
Avg. statements changed per bug 3.2 2.7 2.0
False-Negative Rate boost vs. baseline +40 % baseline baseline
Repair success (FitRepair) 28 % 36 % 49 %
Time per method (end-to-end) 9 s 35 s 28 s
Overlap with other tools 7.6 % LEAM, 4 % μBERT

Takeaway: BUGFARM sacrifices some repairability to become a tougher test for AI detectors.


Reusing the Paper’s Public Data

Everything is on Zenodo under DOI 10.5281/zenodo.13886318:

File What You’ll Find
mutants.zip 434 k confirmed bugs plus original and patched source
defect_datasets.zip Training/testing datasets built from BugSwarm, Defects4J, RegMiner, LEAM, μBERT
apr.zip FitRepair patches generated for 1 000 sampled bugs
human_study.zip 97 human labelers’ votes on 300 survived mutants

Unzip and feed any CSV or JSON file into your own pipeline—the schema is self-documented.


Bottom Line

BUGFARM is a red-team toolkit for AI-driven software engineering:
no training, no buzzwords, just reliable, reproducible bugs that expose blind spots in modern defect detectors and repair tools.
If you’re building or benchmarking such tools, keep BUGFARM in your pocket.

Exit mobile version