DeepConf: Slash LLM Compute Costs 85% While Boosting Reasoning Accuracy

高效码农

2 months ago

DeepConf: Enhancing LLM Reasoning Efficiency Through Confidence-Based Filtering

Figure 1: DeepConf system overview showing parallel thinking with confidence filtering

The Challenge of Efficient LLM Reasoning

Large language models (LLMs) have revolutionized complex reasoning tasks, but their computational demands present significant barriers to practical deployment. Traditional methods like majority voting improve accuracy by generating multiple reasoning paths, but suffer from:

Diminishing returns: Adding more reasoning paths yields smaller accuracy improvements
Linear cost scaling: Each additional path increases compute requirements proportionally
Quality blindness: All reasoning paths receive equal consideration regardless of quality

This article explores DeepConf, a novel approach that leverages internal confidence signals to optimize both reasoning quality and computational efficiency.

Understanding Model Confidence Metrics

DeepConf relies on three key confidence measurements to evaluate reasoning path quality:

1. Token-Level Confidence Indicators

Metric	Calculation	Interpretation
Token Entropy	$H_{i} = - \sum_{j} P_{i} (j) lo g P_{i} (j)$	Measures prediction uncertainty at each step
Token Confidence	$C_{i} = - k 1 \sum_{j = 1 k} lo g P_{i} (j)$	Quantifies model certainty for top-k token predictions

Figure 2: Confidence distribution comparison between correct (blue) and incorrect (red) reasoning paths

2. Trace-Level Confidence Aggregations

Metric	Description	Use Case
Average Trace Confidence	Mean of all token confidences	Global quality assessment
Group Confidence	Sliding window average of token confidences	Localized quality monitoring
Bottom 10% Group Confidence	Mean of least confident segments	Identifies critical reasoning failures
Tail Confidence	Confidence of final reasoning segment	Captures conclusion reliability

DeepConf Methodology

The system operates in two complementary modes:

1. Offline Mode: Post-Processing Optimization

Process Workflow:

Generate complete reasoning traces (4,096 paths per problem)
Calculate confidence metrics for each trace
Apply filtering and weighted voting

Key Techniques:

Confidence-Weighted Voting: Assigns higher weights to high-confidence traces
Top-η% Filtering: Retains only highest-confidence traces (η=10% or 90%)

Figure 3: Offline confidence analysis and filtering architecture

2. Online Mode: Real-Time Generation Control

Implementation Strategy:

Warmup Phase: Generate initial batch (N=16 traces) to establish confidence baseline
Dynamic Termination:
- Monitors group confidence during generation
- Stops trace generation when confidence drops below threshold
Adaptive Sampling: Adjusts total traces based on answer consensus

Algorithm Features:

Early stopping prevents wasting compute on low-quality paths
Maintains accuracy while reducing token generation by up to 84.7%
Two variants balance efficiency/accuracy tradeoffs

Figure 4: Real-time confidence monitoring during trace generation

Experimental Results

Performance Summary (AIME 2025 Benchmark)

Model	Method	Accuracy	Token Reduction
GPT-OSS-120B	DeepConf@512	99.9%	84.7%
Qwen3-32B	Majority Voting	80.1%	Baseline
DeepSeek-8B	DeepConf-low	86.4%	69.0%

Figure 7: Token efficiency gains while maintaining accuracy

Key Findings:

Accuracy Improvements:
- 99.9% on AIME 2025 with GPT-OSS-120B (vs 97.0% baseline)
- 9.38 percentage point gain on HMMT 2025 with DeepSeek-8B
Efficiency Gains:
- 84.7% token reduction on AIME 2025 (GPT-OSS-120B)
- 77.9% reduction on AIME 24 (DeepSeek-8B)
Model Scalability:
- Consistent improvements across 8B-120B parameter models
- Effective on both mathematical and STEM reasoning tasks

Implementation Details

Configuration Parameters

Parameter	DeepConf-low	DeepConf-high
Filtering Threshold (η)	10%	90%
Consensus Threshold (τ)	0.95	0.95
Warmup Traces (N_init)	16	16
Group Window Size	2048 tokens	2048 tokens

Integration with vLLM

Minimal code modifications required:

Add confidence tracking to LogprobsProcessor
Insert early-stop check in output processor
Enable via API parameters:

response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[{"role": "user", "content": "Solve: x²+5x+6=0"}],
    extra_body={
        "enable_conf": True,
        "window_size": 2048,
        "threshold": 17
    }
)

Practical Applications

Recommended Use Cases

Application	Configuration	Expected Benefit
Real-time Tutoring	DeepConf-low	60-80% compute savings
Technical Documentation	DeepConf-high	<5% accuracy impact
Research Assistance	Offline mode	Max accuracy priority
Code Generation	1024-token window	Balance speed/quality

Future Directions

Training Integration: Incorporate confidence signals into RL training
Error Detection: Address overconfidence on incorrect paths
Multi-Modal Extension: Apply to image-text reasoning tasks
Confidence Calibration: Develop better uncertainty quantification

FAQ

Q: Does DeepConf require model retraining?
A: No, it works with existing models through inference-time modifications.

Q: How to choose window size?
A: Use 2048 tokens for math problems, 1024 for code generation.

Q: Is this compatible with Chinese text?
A: Yes, confidence metrics are language-agnostic.

Q: What’s the typical speed-accuracy tradeoff?
A: DeepConf-low reduces tokens by ~70% with accuracy gains in most cases.

Figure 10: Performance on graduate-level STEM questions

Based on 2508.15260v1.pdf. Project page: jiaweizzhao.github.io/deepconf