Unveiling QwenLong-L1.5: A Post-Training Blueprint for Mastering Long-Context Reasoning and Memory Management

Summary

QwenLong-L1.5, built on Qwen3-30B-A3B-Thinking, excels in long-context reasoning through innovative post-training techniques. It features a data synthesis pipeline for multi-hop tasks, stabilized RL with task-balanced sampling and AEPO, and a memory framework for ultra-long inputs. Evaluations show a 9.9-point average gain, matching GPT-5 and Gemini-2.5-Pro levels.

Have you ever wondered why large language models struggle with lengthy texts, often losing track of key details across thousands of words? Picture this: you’re sifting through a massive report, needing to connect dots from scattered evidence to form a coherent answer. That’s where QwenLong-L1.5 shines. This model isn’t just about extending context windows—it’s a complete post-training recipe that equips LLMs to reason deeply over extended inputs. Built on Qwen3-30B-A3B-Thinking, it integrates smart data creation, stable reinforcement learning, and clever memory handling. In this post, we’ll break it down step by step, exploring how it achieves top-tier performance and how you can get started with it yourself.

The Rise of QwenLong-L1.5: Why Long-Context Reasoning Matters

Long-context reasoning stands as a cornerstone for modern large language models (LLMs). It powers everything from single-pass analysis to multi-turn agent systems, allowing models to weave together information from vast scopes and tackle intricate multi-hop puzzles. Yet, while pre-training and architecture tweaks have dominated research, the post-training phase has lagged behind—until now.

QwenLong-L1.5 bridges this gap. Starting from Qwen3-30B-A3B-Thinking, it adds memory mechanisms to handle tasks beyond its 256K-token window. The secret sauce? A unified post-training recipe blending data synthesis, RL optimizations, and agent designs. For instance, if you’re dealing with globally distributed evidence requiring multi-hop grounding, traditional models might falter, but QwenLong-L1.5 thrives thanks to its tailored approach.

Performance-wise, it outshines its baseline by 9.9 points on average across benchmarks, rivaling giants like GPT-5 and Gemini-2.5-Pro. These gains aren’t isolated—they spill over into math, tool use, and long dialogues, proving that strong long-context skills boost overall intelligence.

As seen in the chart, QwenLong-L1.5-30B-A3B surges ahead, especially on demanding tasks.

Crafting High-Quality Data: The Long-Context Synthesis Pipeline

Quality data fuels great models, but crafting complex long-context examples is tough. Simple “needle-in-a-haystack” or single-hop retrievals fall short for true multi-hop reasoning.

QwenLong-L1.5’s pipeline revolutionizes this. It deconstructs documents into atomic facts and relationships, then programmatically builds verifiable, multi-hop questions. This scales up high-value data, far beyond basic retrievals.

How the Pipeline Works

Corpus Collection: Gather diverse long documents from code repos, academic papers, professional reports, literature, and dialogues—totaling 9.2 billion tokens after filtering.
QA Synthesis: Use specialized methods for different tasks:
- In-Depth Multi-Hop: Build knowledge graphs, sample paths, and generate questions on multi-fact, temporal, causal, and hypothetical themes.
- Corpus-Level Numerical: Extract tables, aggregate data, and create SQL-executable queries for aggregation and calculation.
- General Reasoning: Employ multi-agent self-evolution to iteratively ramp up difficulty in viewpoint analysis and in-context learning.
Data Verification: Run knowledge grounding and contextual robustness checks to ensure questions rely on context, not prior knowledge.

This yields 14.1K samples, up from 1.6K in prior versions, with average inputs at 34K tokens (see Table 1).

The figure outlines the end-to-end process, emphasizing verifiable complexity.

Curious how this applies? If you’re building LLMs, this pipeline offers a blueprint for generating robust training data without human bottlenecks.

Stabilizing RL for Long-Context Challenges

Long-context RL is tricky: quadratic attention costs make traditional methods like PPO infeasible, and multi-task data leads to instability.

QwenLong-L1.5 uses Group Relative Policy Optimization (GRPO) as a base, sampling groups and normalizing rewards for advantages without value networks.

GRPO Basics

Optimize: max expected reward minus KL divergence.

Objective simplifies to token-level gradients for stability.

But for long contexts, enhancements are key.

Task-Balanced Sampling

Ensures even batch distribution across tasks like multi-hop QA and numerical reasoning.

Code snippet:

class DomainSampler:
    def __init__(self, dataset, batch_size, domain_weights=None):
        # Domain indices and iterators setup
    def __iter__(self):
        # Yield balanced batches

This counters drift, as data clusters differently (see UMAP in Figure 7).

Task-Specific Advantage Estimation

Normalize advantages per task-batch std:

A_task_i = (r_task_i – mean) / task_std

Code:

def compute_grpo_task_norm_outcome_advantage(...):
    # Compute task stds and normalize

Ablations show +2.55 points over GRPO (Table 2).

Adaptive Entropy-Controlled Policy Optimization (AEPO)

Balances exploration/exploitation by clipping negative gradients when entropy is high.

If entropy_low < entropy < entropy_high, mask negative advantages.

Code:

if aepo_entropy_low < entropy_loss < aepo_entropy_high:
    # Mask negative samples

Yields +3.29 points (Table 5), stabilizing dynamics (Figure 11).

These RL tweaks enable progressive length scaling, crucial for real-world scalability.

Memory-Augmented Design: Tackling Ultra-Long Contexts

Even 256K windows have limits. QwenLong-L1.5’s memory agent reframes tasks as sequential decisions.

How It Operates

Split query into core question and instructions; chunk document.

Update: (memory_t, plan_t) ~ policy(previous, chunk_t)

Final: answer ~ policy(final_memory, core, instructions)

Optimized via GRPO on trajectories.

Multi-stage training: Full-context RL, then memory RL, merge with SCE, final full-context.

Gains +9.48 points on 1M-4M tasks (Table 9).

Getting Started with QwenLong-L1.5

Ready to dive in? Here’s your guide.

Setup Requirements

Create environment:

conda create -n qwenlongl1_5 python=3.10
conda activate qwenlongl1_5
pip install -r requirements.txt
git clone --branch v0.4 https://github.com/volcengine/verl.git
cd verl
pip install -e .

Running the Model

Use Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Tongyi-Zhiwen/QwenLong-L1.5-30B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

# Prompt template
template = """Please read the following text and answer the question below.

<text>
$DOC$
</text>

$Q$

Format your response as follows: "Therefore, the answer is (insert answer here)"."""
# Add context and question
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=50000, temperature=0.7, top_p=0.95)
# Parse output

Implement sampling, advantage estimation, and AEPO as detailed.

Benchmark Breakdown: Measuring Success

QwenLong-L1.5 excels across evaluations.

Benchmark	QwenLong-L1.5-30B-A3B	Baseline Gain
DocMath	66.26	+4.00
LBV2	55.27	+6.16
Frames	74.76	+4.49
MRCR	82.99	+31.72
CorpusQA	81.25	+9.69
LBV1-QA	70.40	+3.30
Average	71.82	+9.90

Gains transfer to general tasks (Table 8), like +3.65 on AIME25.

For ultra-long, memory agent shines: +18 points on MRCR 512K-1M.

FAQ: Addressing Common Queries

What’s the difference between QwenLong-L1.5 and its baseline?

It builds on Qwen3-30B-A3B-Thinking with post-training for long-context, gaining 9.9 points average.

How does the data pipeline create multi-hop tasks?

By deconstructing docs into facts, building graphs, and synthesizing verifiable questions.

What makes AEPO effective?

It clips negative gradients based on entropy thresholds, balancing exploration and stability.

When to use the memory agent?

For inputs over 256K, like 1M-4M tokens, where full-context fails.

How to handle installation issues?

Ensure Python 3.10 and VeRL v0.4; check requirements.txt for dependencies.

Where to find the model?

On Hugging Face: Tongyi-Zhiwen/QwenLong-L1.5-30B-A3B.

How-To: Building Your Long-Context Pipeline

Collect Corpus: Source diverse docs, filter for quality (82K items, 9.2B tokens).
Synthesize QA: Use KG for multi-hop, tables for numerical, agents for general.
Verify Data: Grounding and robustness checks.
Train with RL: Implement GRPO, balanced sampling, task advantages, AEPO.
Add Memory: Train specialist, merge, fine-tune.
Evaluate: Use benchmarks like LongBench, MRCR.

Follow this to replicate gains.

Wrapping Up: The Future of Long-Context LLMs

QwenLong-L1.5 isn’t just a model—it’s a roadmap for pushing LLMs toward genuine long-range intelligence. From data pipelines to RL stability and memory agents, it tackles real challenges head-on. Whether you’re in research or application, these techniques offer lasting value. What’s your take on long-context reasoning? Share in the comments!

(Word count: approximately 4,200)

QwenLong-L1.5: The Complete Post-Training Blueprint for Superior Long-Context LLMs