What’s Hiding Inside Your LLM? A New “Bottom-Up” Perspective on Optimization

Have you ever wondered what actually happens inside a large language model like ChatGPT or DeepSeek when it generates an answer? We typically view it as a black box: question in, answer out. However, a recent study titled “Your Language Model Policy Secretly Contains Internal Policies” reveals a groundbreaking discovery: An LLM is not a single, unified policy. Instead, every internal layer and module is executing its own distinct “sub-policy,” working in concert to complete the reasoning process.

This research acts like a “neural CT scan,” providing the first clear visualization of how reasoning evolves inside the model. More importantly, based on this discovery, it proposes a novel reinforcement learning paradigm—Bottom-up Policy Optimization (BuPO). This method, which optimizes the model’s “foundational thinking,” significantly boosts performance on complex reasoning tasks.

This article will break down this complex research into an accessible guide. We’ll explore how an LLM’s “thought process” is built layer by layer and why starting optimization from the “ground up” leads to better results.

Introduction: We’ve Been Oversimplifying Large Language Models

Reinforcement Learning (RL) has been a key driver in advancing the complex reasoning capabilities of Large Language Models (LLMs). From OpenAI’s early work to the success of DeepSeek R1, Reinforcement Learning with Verifiable Rewards (RLVR) has become a powerful post-training paradigm for enhancing model policies.

Yet, until now, the vast majority of research has treated the LLM as a unified, monolithic policy. This is like only focusing on a person’s final decision while completely ignoring how different regions of their brain collaborate and gradually form that idea. Efforts have largely centered on surface-level algorithmic design—crafting better reward functions or entropy regularization.

While interpretability tools (like attention visualization) have helped demystify the “black box,” they often overlook the intrinsic nature of the language model policy and the rich information latent in the model’s internal residual stream.

This study starts by asking a more fundamental question: How does a language model’s “policy” evolve across different network layers and modules? Understanding this is crucial for enabling more targeted optimization and unraveling complex internal reasoning mechanisms.

Core Discovery: LLMs Harbor “Layered Policies” Within

The researchers’ key insight leverages two inherent properties of the Transformer architecture, allowing them to decompose the model’s policy like splitting a beam of light.

1. Theoretical Foundation: The Residual Stream & Samplable Policies

First, the residual connections in the Transformer naturally support an additive decomposition. Each layer’s output is the sum of the previous layer’s input and the transformation computed within the current layer. This means the final hidden state can be seen as the accumulation of the initial embedding and the contributions from all intermediate layers (both self-attention and feed-forward network modules).

Second, the researchers point out a critical equivalence: Any hidden state, when combined with the model’s “unembedding matrix,” can be transformed into a probability distribution over the vocabulary. This distribution is itself a “policy” that can be sampled from and optimized.

Based on these two points, they define two types of “Internal Policies”:

Internal Layer Policy: Transforms the hidden state at the end of each layer into a probability distribution. This captures the cumulative reasoning result up to that layer.
Internal Modular Policy: Separately transforms the outputs of the self-attention module and the feed-forward network (FFN) module within each layer into probability distributions. This isolates the specific contributions of these two core components.

2. Entropy: Measuring the “Uncertainty” of Internal Policies

To analyze the behavior of these internal policies, the researchers use entropy as their primary metric. In information theory and reinforcement learning, entropy measures the disorder or uncertainty of a probability distribution. High entropy means the policy is widely “exploring” various possible choices. Low entropy means the policy is highly certain, concentrating on exploiting a few options.

By calculating the entropy of each internal policy at every layer and module, a detailed map of the model’s “thinking process” is drawn.

A Striking Difference: Qwen’s “Progressive Reasoning” vs. Llama’s “Last-Minute Decision”

The study systematically analyzed popular model series like Qwen and Llama, uncovering critical architectural differences beneath a universal pattern.

1. The Universal Pattern: From Exploration to Convergence

All models exhibit a consistent internal reasoning structure:

Early Layers (Bottom): Maintain high entropy. Like a brain first approaching a problem, they broadly explore the solution space, keeping multiple possibilities open.
Top Layers: Entropy converges to near zero. The model has made its final prediction with high certainty.

2. The Critical Divergence: Drastically Different Convergence Patterns

Despite the shared trend, the pace and manner of convergence differ dramatically between model families. This is one of the study’s most striking findings.

The Llama Series: Exhibits a pattern of “Sudden Convergence.” Throughout most intermediate layers, the model’s prediction space remains relatively uncertain. It’s only in the final two or three layers that entropy plummets, rapidly locking in an answer. This suggests Llama’s reasoning process performs its key decision integration very late.
The Qwen Series (especially Qwen3): Shows a pattern of “Progressive Contraction,” with a reasoning process that is more staged and reminiscent of human cognition.
- Even more remarkably, this staging is particularly clear in the Feed-Forward Network (FFN) module. The study introduces a “Entropy Change” metric to measure the increase or decrease in uncertainty after processing by a module.
- In Qwen3, the entropy change of the FFN module reveals a distinct “Exploration-Integration-Convergence” triad:
  1. Exploration: In lower layers (e.g., the first ~6 layers), entropy change is positive. The FFN actively expands the exploration space.
  2. Integration: Across a wide swath of middle layers (e.g., layers 7-26), entropy change hovers around zero. Here, the FFN is primarily retrieving and integrating the parametric knowledge stored within it (FFNs are widely viewed as the model’s “knowledge memory”), rather than drastically shifting direction.
  3. Convergence: In higher layers (e.g., layer 27+), entropy change turns negative. The FFN begins to progressively narrow down the possibilities, converging toward the final answer.

The Progressive Reasoning Pattern of Qwen3
Figure 1: Qwen3 exhibits a clear “Exploration-Integration-Convergence” triad in its Feed-Forward Network module.

In simple terms, you could imagine:

Llama is like a student who scribbles down the answer at the very last minute before the exam bell rings, spending most of the time ruminating.
Qwen3 is like a methodical problem-solver: first brainstorming all possible approaches (Exploration), then applying known formulas and knowledge (Integration), and finally stepwise narrowing down to the unique solution (Convergence).

This structured, progressive reasoning pattern may be an intrinsic reason behind the Qwen3 series’ strong ability to absorb knowledge during post-training.

From Insight to Method: Bottom-Up Policy Optimization (BuPO)

The discoveries above led to a profound implication: If reasoning emerges progressively from the bottom up, could optimization also adopt a bottom-up perspective?

1. Experimental Validation: What Happens When We Optimize an Internal Policy?

To test this idea, the researchers first ran an experiment: Directly applying reinforcement learning optimization to the internal policy of a specific middle layer (they call this InterGRPO), instead of optimizing the final output policy.

The results were both expected and revealing:

Optimizing only an internal policy eventually leads to model collapse, as it misaligns with the overall objective.
However, the optimization process induced a significant “feature refinement” phenomenon. When optimizing the policy of a lower layer (e.g., layer 6), the hidden state representations of that layer became more similar to those of higher, even final, layers.

What does this mean? It means that through lower-layer alignment, the optimization forces the network’s bottom layers to preemptively capture the high-level reasoning information needed later in the process, laying a more robust foundation for subsequent “thinking.” It’s like considering the structural needs of the upper floors while still pouring the foundation.

2. Introducing BuPO: A Novel Two-Phase Optimization Paradigm

Based on this key insight, the researchers formally proposed Bottom-up Policy Optimization (BuPO).

The core idea of BuPO is straightforward: Optimize in two phases: first “shape the foundation,” then “build the top.”

Phase 1: Internal Policy Alignment. In the early stages of training (e.g., the first 30 steps), select a lower layer in its “Exploration” phase (e.g., layer 6 in Qwen3) and optimize its internal layer policy, π_Layer^6. The goal of this phase is to reconstruct and strengthen the model’s foundational reasoning capabilities.
Phase 2: Holistic Policy Optimization. For the remainder of the training, switch back to the conventional method, optimizing the language model’s final output policy, π_θ. Now, the model has a reinforced, more expressive “bottom-up thought process,” allowing higher-level reasoning to proceed more efficiently on this improved foundation.

BuPO Training Dynamics
Figure 2: Schematic of the Bottom-up Policy Optimization (BuPO) training process.

Does It Work? Let the Data Speak

The theory is compelling, but empirical results are the ultimate test. The researchers evaluated BuPO on several challenging mathematical reasoning benchmarks, including MATH, AMC, and AIME.

Baseline Methods: Included standard PPO, Reinforce++, RLOO, and the strong contemporary baseline GRPO.

Results: BuPO achieved consistent performance improvements across almost all models and datasets.

Model	Method	AMC23 (Avg@16)	MATH500 (Avg@16)	AIME24 (Avg@32)	AIME25 (Avg@32)	Average
Qwen3-4B	GRPO (Baseline)	76.88	82.41	32.19	28.85	55.08
	BuPO (Opt. Layer 6)	81.09 (+4.21)	84.90 (+2.49)	36.88 (+4.69)	31.15 (+2.30)	58.51 (+3.43)
Qwen3-8B	GRPO (Baseline)	85.94	88.05	49.48	33.54	64.23
	BuPO (Opt. Layer 6)	89.22 (+3.28)	87.76	54.06 (+4.58)	34.38 (+0.84)	66.36 (+2.13)
Llama-OctoThinker-8B	GRPO (Baseline)	34.84	56.89	2.50	2.19	24.11
	BuPO (Opt. Layer 31)	37.66 (+2.82)	62.05 (+5.16)	4.69 (+2.19)	6.77 (+4.58)	27.79 (+3.68)

Table 1: Performance of BuPO across different models and mathematical reasoning benchmarks (Avg@K, higher is better).

The table clearly shows that the bottom-up optimization strategy led to significant gains for both Qwen and Llama series models on complex reasoning tasks. Improvements were particularly notable on the more difficult AIME competition problems.

Deep Dive: Why Does BuPO Work?

1. Training Dynamics: Healthier Exploration

Analysis of the training process revealed that during BuPO’s first phase (optimizing the internal layer), the entropy of the overall model policy maintains a more stable and healthier level of exploration. This avoids the problem in traditional RL training where policy entropy can collapse to zero too quickly (leading to mode collapse and lack of diversity). Moderate low-level exploration provides richer “cognitive material” for subsequent learning.

2. The Golden Rule: Moderate “Foundation Shaping”

The study found that the number of early internal policy optimization steps needs careful calibration. It’s like kneading dough—too little time is ineffective, too long overworks it.

Moderate Optimization (e.g., 30 steps): Effectively boosts final performance.
Excessive Optimization (e.g., 70 steps): Causes model collapse and sharp performance decline.
This confirms that BuPO’s core is “rebuilding the foundation,” not “replacing the whole.” Only moderate intervention maximizes the benefit.

Frequently Asked Questions (FAQ)

Q1: What is the main contribution of this research?
A1: There are three primary contributions: 1) The first formal definition and decomposition of a language model’s internal policies. 2) Using entropy analysis to reveal the distinct yet stable internal reasoning patterns of different model families (like Qwen vs. Llama). 3) Proposing the innovative Bottom-up Policy Optimization paradigm based on these findings, which effectively enhances model performance.

Q2: How is an “Internal Policy” different from the previously known “Logit Lens”?
A2: They are related but have different perspectives. The Logit Lens primarily decodes intermediate layer states into the most probable discrete tokens, to observe “what word the model wants to output at a middle layer.” The internal policy perspective treats it as a complete, samplable probability distribution and analyzes its entropy and dynamics like a policy in reinforcement learning, focusing more on understanding “how the uncertainty of the decision process evolves.”

Q3: What are the practical implications of this discovery for developers or practitioners?
A3: First, it deepens our understanding of how LLMs work. Second, BuPO provides a more effective tool for fine-tuning LLMs, especially for complex reasoning tasks like mathematics or coding. In the future, model architects might design more efficient architectures inspired by Qwen’s progressive structure.

Q4: Is BuPO suitable for all models? How do I choose which layer to optimize?
A4: The research shows models with clear progressive reasoning structures (like Qwen3) benefit most. Choosing the layer is guided by the internal entropy analysis: typically, select the last layer that is still in an exploration phase, showing positive entropy change (for Qwen3, this is layer 6). A preliminary internal entropy analysis of the target model is recommended to identify this.

Q5: What are the limitations of this study?
A5: Current experiments are mainly focused on the Qwen and Llama families. The internal patterns of other architectures (e.g., GPT, Gemini) remain to be verified. Also, the optimal internal layer and training steps may vary depending on the model and specific task, requiring some experimentation to determine.

Conclusion and Future Directions

This research is akin to a “neuroscientific” exploration of large language models. It tells us that an LLM’s “thinking” is not a chaotic blur but follows specific, interpretable patterns across its network layers. The progressive, structured reasoning pattern exhibited by Qwen3 bears a fascinating resemblance to human problem-solving cognition, which may be an intrinsic factor in its strong performance on complex tasks.

More importantly, the study doesn’t stop at discovery. The proposal of Bottom-up Policy Optimization successfully translates understanding of internal mechanisms into tangible algorithmic improvement, completing the loop of “knowing what it is, knowing why it is, and using that knowledge to make it better.”

Looking ahead, we can anticipate more innovations grounded in a deeper understanding of model internals:

Smarter Architecture Design: Directly designing models with desirable internal reasoning patterns.
More Granular Optimization: Customizing training for different layers and modules based on their specific characteristics.
More General Interpretability Frameworks: The internal policy perspective could become a new standard tool for understanding model behavior.

The black box of large language models is gradually being opened. Illuminating its secrets is research that combines profound analysis with practical innovation. This matters not only for the forefront of technology but also for how we collaborate more effectively and transparently with increasingly intelligent AI systems.

Bottom-Up Policy Optimization: The Secret to LLM Reasoning Revealed