AggLM: Revolutionizing Solution Aggregation in Large Language Models with Reinforcement Learning

高效码农

2 months ago

Exploring Solution Aggregation in Large Language Models: When Majority Voting Falls Short

Hey there, if you’re diving into the world of large language models (LLMs) and wondering how we can make them smarter at solving tough problems, you’ve come to the right place. I’ve been thinking about this a lot lately—especially how generating multiple solutions and then picking the best one can boost performance on reasoning tasks. But what if the most popular answer among those solutions isn’t the right one? That’s where things get interesting. In this post, we’ll unpack a method called AggLM, which uses reinforcement learning to train models to aggregate solutions more effectively. I’ll break it down step by step, using simple language, tables, and examples from math competitions to make it clear. Let’s get started.

Why Do We Need Better Ways to Aggregate Solutions in LLMs?

Imagine you’re tackling a tricky math problem. You ask an LLM to generate several possible solutions, hoping one of them nails it. Traditionally, you’d just go with the majority vote—the answer that shows up most often. Sounds fair, right? But here’s the catch: sometimes the correct answer is in the minority. Maybe the model got it right once but messed up the others due to some quirk in its training. Or perhaps parts of different solutions are correct, and combining them could give you the full picture.

This is a common issue in scaling up test-time compute for LLMs. By producing multiple independent solutions and aggregating them, we can improve accuracy on challenging tasks like math or code generation. But simple methods like majority voting or using reward models to rank them often miss out on those hidden gems. AggLM changes that by training a model to reason through the candidates, reconcile differences, and synthesize a better final answer. It’s like turning aggregation into a learned skill.

To visualize this, here’s how AggLM works at a high level:

What Exactly Is AggLM and How Does It Work?

Let’s chat about the core idea. AggLM stands for Aggregation Language Model. The process is straightforward:

Sample multiple solutions from an LLM for a given problem.
Feed those solutions back to another LLM (or the same one) with an instruction to aggregate them.
The aggregator reviews the solutions, corrects errors, fills gaps, or combines ideas to produce a final answer.

The key twist? We train this aggregator using reinforcement learning from verifiable rewards (RLVR). That means we can evaluate the output against known correct answers and reward the model accordingly.

Problem Formulation

Suppose we have a problem x and its ground-truth solution y★. There’s a solution model that generates candidates y1 to ym. Then the aggregator produces a final ỹ.

Solutions are sampled independently: yi ~ pθ(y | x)
Aggregated solution: ỹ ~ pϕ(y | x, y1:m)

The models can share parameters or be separate. Training data comes from problems with verifiable solutions, like math datasets.

Training Data and Balancing Easy vs. Hard Examples

To train effectively, we create sets of solutions and label them as “easy” (majority is correct) or “hard” (majority is wrong). We balance the mix—say, 50% easy—to help the model learn both to pick obvious winners and recover rare correct answers.

We use Group-Relative Policy Optimization (GRPO) with binary rewards: 1 if correct, 0 otherwise.

Here’s the prompt template used for aggregation:

“

Given the following problem:
{problem}
and these solution attempts:
{solutions}
It is possible that any, all, or none of these solutions are correct or complete. Carefully review the provided solutions, using them as starting points—correcting mistakes, filling in gaps, and/or combining useful ideas—to produce a final, comprehensive, and correct solution to the problem.

How Does AggLM Compare to Other Methods?

You might be asking, “Aren’t there already ways to do this?” Absolutely. Let’s compare.

Rule-Based Voting

Majority Voting: Count the most frequent answer. Simple, but amplifies errors if the majority is wrong.
Variants: Dynamic sampling or heuristic filters.

These work okay but fail when correct solutions are minorities.

Model-Based Selection and Aggregation

Reward Models: Score candidates and pick the top one (best-of-N) or weight votes.
Prompted Aggregation: Like Universal Self-Consistency (USC), where the LLM examines samples and chooses the most coherent.

AggLM goes further by training the aggregator with RL to synthesize, not just select. It’s similar to some concurrent work but emphasizes reasoning-oriented models and balanced training.

Experimental Setup: How Was AggLM Tested?

To see if this works, AggLM-1.7B was trained on about 40,000 math problems from DeepScaler. Solutions were sampled from Qwen3-1.7B in thinking mode (with chain-of-thought reasoning).

Sampled 128 solutions per problem, grouped into 16 sets of 8.
Trained for one epoch with GRPO, KL regularization, etc.
Evaluated on MathArena datasets: AIME24, AIME25, HMMT24, HMMT25 (30 problems each, high school olympiad level).

Protocol: For robustness, average pass@1 across multiple sets and generations.

Baselines include majority voting, best-of-N with AceMath reward models (7B and 72B), weighted majority, and prompted aggregation without RL.

Solution models tested: Qwen3-1.7B (thinking and non-thinking), Qwen3-8B (thinking).

Key Results: Does AggLM Outperform Baselines?

Yes, consistently. Let’s look at the numbers.

Aggregating from Qwen3-1.7B Thinking Mode

Aggregation Method	Model	AIME24	AIME25	HMMT24	HMMT25
Baselines
pass@1	–	50.91	35.68	22.45	22.84
pass@8	–	76.48	61.38	36.67	44.27
Aggregation Methods
Majority Voting	N/A	67.92	45.89	29.01	26.72
Best-of-N	AceMath-7B	59.39	40.30	28.09	22.50
	AceMath-72B	56.64	40.35	29.58	21.99
Weighted Majority	AceMath-7B	64.09	39.49	25.04	17.71
	AceMath-72B	62.34	38.49	27.62	17.96
Prompted Aggregation	Qwen3-1.7B	63.57	44.85	29.52	27.91
RL-trained Aggregation (ours)	AggLM-1.7B	70.69	50.00	33.34	32.07

AggLM-1.7B beats all, raising AIME25 from 35% to 50%.

Aggregating from Stronger Qwen3-8B Thinking Mode

Aggregation Method	Model	AIME24	AIME25	HMMT24	HMMT25
Baselines
pass@1	–	74.17	69.27	41.61	45.99
pass@8	–	85.57	83.54	61.67	65.47
Aggregation Methods
Majority Voting	N/A	81.61	78.70	44.58	56.35
Best-of-N	AceMath-7B	78.60	70.89	37.39	44.17
	AceMath-72B	80.27	69.57	38.54	46.21
Weighted Majority	AceMath-7B	77.03	68.15	38.41	36.13
	AceMath-72B	79.06	66.00	37.63	41.46
Prompted Aggregation	Qwen3-1.7B	79.90	76.73	48.58	57.63
RL-trained Aggregation (ours)	AggLM-1.7B	82.38	79.70	53.01	60.66

It generalizes to stronger models, outperforming even 72B reward models.

Aggregating from Qwen3-1.7B Non-Thinking Mode

Aggregation Method	Model	AIME24	AIME25	HMMT24	HMMT25
Baselines
pass@1	–	11.82	10.00	6.25	3.39
pass@8	–	32.76	24.53	16.09	14.06
Aggregation Methods
Majority Voting	N/A	18.07	15.42	8.75	7.29
Best-of-N	AceMath-7B	23.31	18.40	7.44	8.92
	AceMath-72B	26.33	18.62	10.23	8.97
Weighted Majority	AceMath-7B	23.95	18.39	8.37	8.41
	AceMath-72B	26.54	18.83	9.72	8.09
Prompted Aggregation	Qwen3-1.7B	28.51	17.79	16.30	12.08
RL-trained Aggregation (ours)	AggLM-1.7B	29.96	19.77	17.03	12.76

Even on weaker, non-reasoning outputs, AggLM shines by synthesizing corrections.

AggLM is also token-efficient: Aggregating 8 solutions with it beats majority voting on 16.

Ablations and Deeper Analysis: What Makes AggLM Tick?

You might wonder, “Is this just luck, or is there something specific driving the gains?” Let’s dig into the ablations.

Scaling with Number of Solutions

Performance improves with more candidates, and AggLM scales better than majority voting.

Effect of Majority Answer Size

Gains are biggest when the majority is small (diverse solutions), where reasoning helps recover minorities.

Ablating Training Mixtures

Easy%	AIME24	AIME25	HMMT24	HMMT25
0	64.22	46.06	27.80	28.73
5	68.93	48.65	33.31	31.91
10	69.85	49.60	33.71	32.31
20	69.72	49.11	33.74	31.20
50	70.69	50.00	33.34	32.07
270	66.20	46.70	30.01	28.94
Untrained	63.57	44.85	29.52	27.91

A 5-50% easy mix is optimal—too few under-trains recovery, too many dilutes challenge.

Number of Solution Sets per Problem

#Sets	AIME24	AIME25	HMMT24	HMMT25
2	70.27	49.74	33.42	31.67
4	70.29	49.08	33.11	31.34
8	70.37	50.25	33.16	31.89
16	70.69	50.00	33.34	32.07

More sets add slight diversity benefits, but diminishing returns.

Aggregation vs. Extra Data

Fine-tuning the base model on the same data doesn’t match AggLM’s gains—it’s the aggregation training that counts.

How to Implement Solution Aggregation Like AggLM

If you’re thinking, “How can I try this myself?” Here’s a step-by-step guide based on the method:

Choose Your Models: Start with a base LLM like Qwen3-1.7B for generating solutions.
Sample Solutions: For a problem, generate m=8 solutions at temperature 1.5.
Prepare Data: Collect problems with ground-truth. Sample sets, balance easy/hard.
Train Aggregator: Use GRPO with binary rewards. Prompt as in Figure 2.
Evaluate: Test on datasets like AIME, compute pass@1 averaged over sets.
Tune Parameters: Experiment with easy% (try 50%), group size=8.

Remember, use libraries like math_verify for equivalence checks.

FAQ: Answering Your Questions on Solution Aggregation and AggLM

What is solution aggregation in large language models?

It’s combining multiple generated solutions to a problem to get a better final answer. AggLM trains a model to do this via reasoning and RL.

Why doesn’t majority voting always work for LLM solutions?

Because correct answers can be minorities due to model errors. AggLM recovers them by synthesizing.

How does reinforcement learning from verifiable rewards (RLVR) help in aggregation?

It allows training on tasks with known answers, rewarding correct aggregations to learn selection and synthesis.

Can AggLM handle solutions from different models?

Yes, it generalizes to stronger (e.g., 8B) or non-thinking modes, even if trained on 1.7B thinking.

What’s the best mix of easy and hard examples for training?

Around 5-50% easy relative to hard—balances learning without sparse rewards.

Is AggLM more efficient than generating more solutions?

Definitely. It achieves higher accuracy with fewer tokens than majority voting on larger sets.

How do reward models compare to AggLM?

Reward models (like AceMath-72B) often underperform AggLM, especially on thinking modes, as they select rather than synthesize.

What datasets are used for evaluating AggLM?

Math competitions: AIME24/25, HMMT24/25 from MathArena, with numeric answers.

Does AggLM work on non-math tasks?

The paper focuses on math, but the principle could extend to reasoning tasks with verifiable rewards.

Why balance training data in AggLM?

To teach both easy majority recovery and hard minority synthesis, avoiding under-training or sparse rewards.

Wrapping up, AggLM shows us that treating aggregation as a trainable reasoning skill can push LLMs further. It’s not just about more compute—it’s smarter compute. If you’ve got questions or want to share your experiments, drop a comment. Thanks for reading! (Word count: 3527)