GEPA: Teaching Large Language Models to Learn Smarter, Not Harder


Quick takeaway
If you give a language model a few tries and let it write a short “what went wrong” note after each try, you can often beat heavyweight reinforcement-learning systems—while using up to 35 times fewer training runs.


Table of Contents

  1. Why Traditional RL Is Becoming Too Expensive
  2. The Core Insight: Words Are Data Too
  3. How GEPA Works in Three Simple Steps
  4. Real Results: Four Tasks, Two Models, Three Baselines
  5. Frequently Asked Questions
  6. Try It Yourself: A 15-Minute Walkthrough
  7. Key Takeaways and Next Steps

Why Traditional RL Is Becoming Too Expensive

Method Typical Rollouts Needed Main Bottleneck
GRPO with LoRA ~24 000 GPU hours, API fees
Few-shot prompt tuning 2 000–7 000 Token cost, prompt length

In short, every time the model makes an attempt (a rollout), you pay for compute.
If your system also calls external tools—search engines, code runners, or private APIs—the bill grows even faster.


The Core Insight: Words Are Data Too

Traditional reinforcement learning (RL) squashes each attempt into a single number: success = 1, failure = 0.
GEPA keeps the entire conversation—the prompt, the reasoning steps, the tool outputs, even the compiler error messages—and asks a large language model to write a short critique in plain English.

RL Signal GEPA Signal
reward = 0.7 “The second-hop query forgot the year. Add a date filter next time.”

Because language models already understand language, these critiques are dense, human-readable training signals.


How GEPA Works in Three Simple Steps

Step 1: Treat Prompts Like Genes

  • Start with one very simple prompt per module.
  • Mutate it in two ways:

    1. Reflective Prompt Mutation—let the model read the last run and rewrite its own prompt.
    2. System-aware Merge—swap the best parts of two different “family trees” of prompts.

Step 2: Let the Model Reflect

After each small batch of tasks, the system:

  1. Reads the full trace (prompt → reasoning → tool calls → final score).
  2. Writes a short paragraph explaining what went right or wrong.
  3. Rewrites the prompt for the next round.

Example reflection:

“The retrieval step brought back documents about the song, but the question asked for the album. Add an instruction to target album-level documents.”

Step 3: Keep a Pareto Front Instead of One “Best” Prompt

  • Record the highest score ever seen on each training example.
  • Only breed from prompts that are the best on at least one example.
  • This prevents the system from getting stuck on one “pretty good” strategy.

Real Results: Four Tasks, Two Models, Three Baselines

Tasks at a Glance

Task What It Measures Train / Val / Test
HotpotQA Multi-hop reading comprehension 150 / 300 / 300
IFBench Exact instruction following 150 / 300 / 294
HoVer Fact verification across documents 150 / 300 / 300
PUPA Privacy-aware query rewriting 111 / 111 / 221

Overall Performance (Average Score)

Model Baseline MIPROv2 GRPO (24 k) GEPA GEPA + Merge
Qwen3-8B 48.9 55.1 51.1 61.3 57.6
GPT-4.1 mini 52.7 59.7 67.0 68.7
  • 10–20 % higher than GRPO
  • 35× fewer rollouts
  • Prompts 3–9× shorter than MIPROv2

Cost & Prompt-Length Comparison

Optimizer Avg Prompt Tokens Relative Length
MIPROv2 7 500
GEPA 2 600 0.35×
Best case 2 000 0.11×

Frequently Asked Questions

Does GEPA replace fine-tuning?

No. GEPA is ideal when rollouts are expensive. With unlimited data, full fine-tuning can still win.
Think of the two methods as complementary tools on your shelf.

Isn’t asking a model to critique its own output costly?

The reflection step uses the same model size as the rollout itself, but only a few dozen tokens are needed for the critique.
Because the signal is so dense, GEPA often converges in <100 training examples.

How do I drop GEPA into my existing pipeline?
  1. Start with the simplest prompt you have.
  2. Add a feedback_fn that returns one short paragraph of critique.
  3. Run the genetic loop shown in the walkthrough section.

That’s it—no extra infrastructure required.

Will my prompts become unreadably long?

Not in practice. GEPA’s critiques often shorten prompts by removing redundant instructions (see Figure 15 in the paper).

Can GEPA optimize few-shot examples too?

The current study focuses on instructions only.
The authors list few-shot tuning as future work, but instruction-tuning alone already outperformed prior baselines.

My system has many modules. How does GEPA handle that?

Each module’s prompt is treated as a separate gene.
The Merge step can swap individual module prompts between two high-scoring candidates.

Does GEPA overfit to the training set?

The Pareto-based selection + small validation set keeps the validation-test gap equal to or smaller than MIPROv2 (Figure 14).

Can I use GEPA at inference time for brand-new tasks?

Yes. Simply treat the new tasks as the “training set” and let GEPA overfit to them.
Section 6 shows promising results on NPUEval and KernelBench.

What about privacy when using external APIs?

The PUPA benchmark explicitly tests privacy vs. performance; GEPA’s rewritten prompts kept zero PII leakage while scoring above 94 %.

Where is the code?

The paper is a pre-print; no official repo yet.
All algorithms are given in pseudo-code and can be re-implemented quickly.


Try It Yourself: A 15-Minute Walkthrough

Prerequisites

  • Python 3.9+
  • OpenAI API key (or any compatible LLM)
  • Evaluation script for your task (HotpotQA, IFBench, etc.)

Three Quick Steps

Step Action Key Parameter
① Seed prompt One plain-sentence instruction per module Keep it short
② Feedback function Return a short paragraph of critique Example below
③ Launch loop 3–10 examples per minibatch Budget ≈ 1/10 of MIPROv2

Minimal Feedback Example (HotpotQA second-hop)

def feedback_fn(trace, gold_docs):
    retrieved = set(trace['docs'])
    if not gold_docs.issubset(retrieved):
        missing = gold_docs - retrieved
        return f"Query missed {missing}; add date or entity filters."
    return "Retrieval complete, no change needed."

Skeleton Loop

def run_gepa(seed_prompts, eval_fn, budget=500):
    pool = [{'prompts': seed_prompts, 'parent': None}]
    for _ in range(budget):
        parent = select_pareto(pool)
        child = reflective_mutate(parent, eval_fn)
        if child['score'] > parent['score']:
            pool.append(child)
    return best_candidate(pool)

Key Takeaways and Next Steps

  • Signal density matters: A short, targeted note beats thousands of scalar rewards.
  • Cost matters even more: Cutting rollouts by 35× makes otherwise impossible experiments feasible.
  • Next frontiers:

    1. Combine prompt evolution with light weight-updates.
    2. Use GEPA as a real-time search strategy during inference.
    3. Optimize few-shot examples alongside instructions.

If you maintain a multi-step LLM system and every API call hurts your wallet, give GEPA a spin.
It lets your model learn from its own words—and save you both time and money in the process.