GEPA: Teaching Large Language Models to Learn Smarter, Not Harder
Quick takeaway
If you give a language model a few tries and let it write a short “what went wrong” note after each try, you can often beat heavyweight reinforcement-learning systems—while using up to 35 times fewer training runs.
Table of Contents
-
Why Traditional RL Is Becoming Too Expensive -
The Core Insight: Words Are Data Too -
How GEPA Works in Three Simple Steps -
Real Results: Four Tasks, Two Models, Three Baselines -
Frequently Asked Questions -
Try It Yourself: A 15-Minute Walkthrough -
Key Takeaways and Next Steps
Why Traditional RL Is Becoming Too Expensive
Method | Typical Rollouts Needed | Main Bottleneck |
---|---|---|
GRPO with LoRA | ~24 000 | GPU hours, API fees |
Few-shot prompt tuning | 2 000–7 000 | Token cost, prompt length |
In short, every time the model makes an attempt (a rollout), you pay for compute.
If your system also calls external tools—search engines, code runners, or private APIs—the bill grows even faster.
The Core Insight: Words Are Data Too
Traditional reinforcement learning (RL) squashes each attempt into a single number: success = 1, failure = 0.
GEPA keeps the entire conversation—the prompt, the reasoning steps, the tool outputs, even the compiler error messages—and asks a large language model to write a short critique in plain English.
RL Signal | GEPA Signal |
---|---|
reward = 0.7 | “The second-hop query forgot the year. Add a date filter next time.” |
Because language models already understand language, these critiques are dense, human-readable training signals.
How GEPA Works in Three Simple Steps
Step 1: Treat Prompts Like Genes
-
Start with one very simple prompt per module. -
Mutate it in two ways: -
Reflective Prompt Mutation—let the model read the last run and rewrite its own prompt. -
System-aware Merge—swap the best parts of two different “family trees” of prompts.
-
Step 2: Let the Model Reflect
After each small batch of tasks, the system:
-
Reads the full trace (prompt → reasoning → tool calls → final score). -
Writes a short paragraph explaining what went right or wrong. -
Rewrites the prompt for the next round.
Example reflection:
“The retrieval step brought back documents about the song, but the question asked for the album. Add an instruction to target album-level documents.”
Step 3: Keep a Pareto Front Instead of One “Best” Prompt
-
Record the highest score ever seen on each training example. -
Only breed from prompts that are the best on at least one example. -
This prevents the system from getting stuck on one “pretty good” strategy.
Real Results: Four Tasks, Two Models, Three Baselines
Tasks at a Glance
Task | What It Measures | Train / Val / Test |
---|---|---|
HotpotQA | Multi-hop reading comprehension | 150 / 300 / 300 |
IFBench | Exact instruction following | 150 / 300 / 294 |
HoVer | Fact verification across documents | 150 / 300 / 300 |
PUPA | Privacy-aware query rewriting | 111 / 111 / 221 |
Overall Performance (Average Score)
Model | Baseline | MIPROv2 | GRPO (24 k) | GEPA | GEPA + Merge |
---|---|---|---|---|---|
Qwen3-8B | 48.9 | 55.1 | 51.1 | 61.3 | 57.6 |
GPT-4.1 mini | 52.7 | 59.7 | — | 67.0 | 68.7 |
-
10–20 % higher than GRPO -
35× fewer rollouts -
Prompts 3–9× shorter than MIPROv2
Cost & Prompt-Length Comparison
Optimizer | Avg Prompt Tokens | Relative Length |
---|---|---|
MIPROv2 | 7 500 | 1× |
GEPA | 2 600 | 0.35× |
Best case | 2 000 | 0.11× |
Frequently Asked Questions
Does GEPA replace fine-tuning?
No. GEPA is ideal when rollouts are expensive. With unlimited data, full fine-tuning can still win.
Think of the two methods as complementary tools on your shelf.
Isn’t asking a model to critique its own output costly?
The reflection step uses the same model size as the rollout itself, but only a few dozen tokens are needed for the critique.
Because the signal is so dense, GEPA often converges in <100 training examples.
How do I drop GEPA into my existing pipeline?
-
Start with the simplest prompt you have. -
Add a feedback_fn
that returns one short paragraph of critique. -
Run the genetic loop shown in the walkthrough section.
That’s it—no extra infrastructure required.
Will my prompts become unreadably long?
Not in practice. GEPA’s critiques often shorten prompts by removing redundant instructions (see Figure 15 in the paper).
Can GEPA optimize few-shot examples too?
The current study focuses on instructions only.
The authors list few-shot tuning as future work, but instruction-tuning alone already outperformed prior baselines.
My system has many modules. How does GEPA handle that?
Each module’s prompt is treated as a separate gene.
The Merge step can swap individual module prompts between two high-scoring candidates.
Does GEPA overfit to the training set?
The Pareto-based selection + small validation set keeps the validation-test gap equal to or smaller than MIPROv2 (Figure 14).
Can I use GEPA at inference time for brand-new tasks?
Yes. Simply treat the new tasks as the “training set” and let GEPA overfit to them.
Section 6 shows promising results on NPUEval and KernelBench.
What about privacy when using external APIs?
The PUPA benchmark explicitly tests privacy vs. performance; GEPA’s rewritten prompts kept zero PII leakage while scoring above 94 %.
Where is the code?
The paper is a pre-print; no official repo yet.
All algorithms are given in pseudo-code and can be re-implemented quickly.
Try It Yourself: A 15-Minute Walkthrough
Prerequisites
-
Python 3.9+ -
OpenAI API key (or any compatible LLM) -
Evaluation script for your task (HotpotQA, IFBench, etc.)
Three Quick Steps
Step | Action | Key Parameter |
---|---|---|
① Seed prompt | One plain-sentence instruction per module | Keep it short |
② Feedback function | Return a short paragraph of critique | Example below |
③ Launch loop | 3–10 examples per minibatch | Budget ≈ 1/10 of MIPROv2 |
Minimal Feedback Example (HotpotQA second-hop)
def feedback_fn(trace, gold_docs):
retrieved = set(trace['docs'])
if not gold_docs.issubset(retrieved):
missing = gold_docs - retrieved
return f"Query missed {missing}; add date or entity filters."
return "Retrieval complete, no change needed."
Skeleton Loop
def run_gepa(seed_prompts, eval_fn, budget=500):
pool = [{'prompts': seed_prompts, 'parent': None}]
for _ in range(budget):
parent = select_pareto(pool)
child = reflective_mutate(parent, eval_fn)
if child['score'] > parent['score']:
pool.append(child)
return best_candidate(pool)
Key Takeaways and Next Steps
-
Signal density matters: A short, targeted note beats thousands of scalar rewards. -
Cost matters even more: Cutting rollouts by 35× makes otherwise impossible experiments feasible. -
Next frontiers: -
Combine prompt evolution with light weight-updates. -
Use GEPA as a real-time search strategy during inference. -
Optimize few-shot examples alongside instructions.
-
If you maintain a multi-step LLM system and every API call hurts your wallet, give GEPA a spin.
It lets your model learn from its own words—and save you both time and money in the process.