Site icon Efficient Coder

DeepConf: Slash LLM Compute Costs 85% While Boosting Reasoning Accuracy

DeepConf: Enhancing LLM Reasoning Efficiency Through Confidence-Based Filtering


Figure 1: DeepConf system overview showing parallel thinking with confidence filtering

The Challenge of Efficient LLM Reasoning

Large language models (LLMs) have revolutionized complex reasoning tasks, but their computational demands present significant barriers to practical deployment. Traditional methods like majority voting improve accuracy by generating multiple reasoning paths, but suffer from:

  • Diminishing returns: Adding more reasoning paths yields smaller accuracy improvements
  • Linear cost scaling: Each additional path increases compute requirements proportionally
  • Quality blindness: All reasoning paths receive equal consideration regardless of quality

This article explores DeepConf, a novel approach that leverages internal confidence signals to optimize both reasoning quality and computational efficiency.

Understanding Model Confidence Metrics

DeepConf relies on three key confidence measurements to evaluate reasoning path quality:

1. Token-Level Confidence Indicators

Metric Calculation Interpretation
Token Entropy Measures prediction uncertainty at each step
Token Confidence Quantifies model certainty for top-k token predictions


Figure 2: Confidence distribution comparison between correct (blue) and incorrect (red) reasoning paths

2. Trace-Level Confidence Aggregations

Metric Description Use Case
Average Trace Confidence Mean of all token confidences Global quality assessment
Group Confidence Sliding window average of token confidences Localized quality monitoring
Bottom 10% Group Confidence Mean of least confident segments Identifies critical reasoning failures
Tail Confidence Confidence of final reasoning segment Captures conclusion reliability

DeepConf Methodology

The system operates in two complementary modes:

1. Offline Mode: Post-Processing Optimization

Process Workflow:

  1. Generate complete reasoning traces (4,096 paths per problem)
  2. Calculate confidence metrics for each trace
  3. Apply filtering and weighted voting

Key Techniques:

  • Confidence-Weighted Voting: Assigns higher weights to high-confidence traces
  • Top-η% Filtering: Retains only highest-confidence traces (η=10% or 90%)


Figure 3: Offline confidence analysis and filtering architecture

2. Online Mode: Real-Time Generation Control

Implementation Strategy:

  1. Warmup Phase: Generate initial batch (N=16 traces) to establish confidence baseline
  2. Dynamic Termination:
    • Monitors group confidence during generation
    • Stops trace generation when confidence drops below threshold
  3. Adaptive Sampling: Adjusts total traces based on answer consensus

Algorithm Features:

  • Early stopping prevents wasting compute on low-quality paths
  • Maintains accuracy while reducing token generation by up to 84.7%
  • Two variants balance efficiency/accuracy tradeoffs


Figure 4: Real-time confidence monitoring during trace generation

Experimental Results

Performance Summary (AIME 2025 Benchmark)

Model Method Accuracy Token Reduction
GPT-OSS-120B DeepConf@512 99.9% 84.7%
Qwen3-32B Majority Voting 80.1% Baseline
DeepSeek-8B DeepConf-low 86.4% 69.0%


Figure 7: Token efficiency gains while maintaining accuracy

Key Findings:

  1. Accuracy Improvements:

    • 99.9% on AIME 2025 with GPT-OSS-120B (vs 97.0% baseline)
    • 9.38 percentage point gain on HMMT 2025 with DeepSeek-8B
  2. Efficiency Gains:

    • 84.7% token reduction on AIME 2025 (GPT-OSS-120B)
    • 77.9% reduction on AIME 24 (DeepSeek-8B)
  3. Model Scalability:

    • Consistent improvements across 8B-120B parameter models
    • Effective on both mathematical and STEM reasoning tasks

Implementation Details

Configuration Parameters

Parameter DeepConf-low DeepConf-high
Filtering Threshold (η) 10% 90%
Consensus Threshold (τ) 0.95 0.95
Warmup Traces (N_init) 16 16
Group Window Size 2048 tokens 2048 tokens

Integration with vLLM

Minimal code modifications required:

  1. Add confidence tracking to LogprobsProcessor
  2. Insert early-stop check in output processor
  3. Enable via API parameters:
response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[{"role": "user", "content": "Solve: x²+5x+6=0"}],
    extra_body={
        "enable_conf": True,
        "window_size": 2048,
        "threshold": 17
    }
)

Practical Applications

Recommended Use Cases

Application Configuration Expected Benefit
Real-time Tutoring DeepConf-low 60-80% compute savings
Technical Documentation DeepConf-high <5% accuracy impact
Research Assistance Offline mode Max accuracy priority
Code Generation 1024-token window Balance speed/quality

Future Directions

  1. Training Integration: Incorporate confidence signals into RL training
  2. Error Detection: Address overconfidence on incorrect paths
  3. Multi-Modal Extension: Apply to image-text reasoning tasks
  4. Confidence Calibration: Develop better uncertainty quantification

FAQ

Q: Does DeepConf require model retraining?
A: No, it works with existing models through inference-time modifications.

Q: How to choose window size?
A: Use 2048 tokens for math problems, 1024 for code generation.

Q: Is this compatible with Chinese text?
A: Yes, confidence metrics are language-agnostic.

Q: What’s the typical speed-accuracy tradeoff?
A: DeepConf-low reduces tokens by ~70% with accuracy gains in most cases.


Figure 10: Performance on graduate-level STEM questions


Based on 2508.15260v1.pdf. Project page: jiaweizzhao.github.io/deepconf

Exit mobile version