Qwen3-235B-A22B-Thinking-2507: Beating GPT at Math and Code – Open Source AI Showdown

高效码农

5 months ago

Qwen3-235B-A22B-Thinking-2507: The Open-Source Reasoning Model That Actually Outperforms GPT on Math and Code

A plain-English, no-hype guide for developers, researchers, and technical product managers who want to understand what this 235-billion-parameter reasoning engine can—and cannot—do.

What Exactly Is Qwen3-235B-A22B-Thinking-2507?
Three Months of Improvements: Quality, Depth, Length
Model Specs at a Glance
Benchmark Results in Plain Numbers
Getting Started: Zero-to-First-Inference Tutorial
Deployment Recipes: SGLang, vLLM, and Local Tools
Turning the Model into an Agent
Best-Practice Settings: Temperature, Context, and Output Length
Frequently Asked Questions

What Exactly Is Qwen3-235B-A22B-Thinking-2507?

Think of Qwen3-235B-A22B-Thinking-2507 as a specialized “reasoning engine” built on top of the Qwen3 235-billion-parameter Mixture-of-Experts (MoE) architecture.

235B = 235 billion total parameters
A22B = only 22 billion are activated during each forward pass
Thinking = the model is always in reasoning mode; it can’t be switched off
2507 = the July 2025 checkpoint

In short, it is an open-source model that tries to match or exceed the reasoning power of proprietary systems such as OpenAI o3 and Gemini 2.5 Pro, while keeping the inference cost within reach of a well-equipped on-prem GPU cluster.

Three Months of Improvements: Quality, Depth, Length

The Qwen team compressed their summer sprint into three headline upgrades:

Dimension	Previous Version	July 2025 (2507)	What You Will Notice
Math (AIME25)	81.5 %	92.3 %	Fewer wrong final answers on competition-level problems
Code (LiveCodeBench v6)	55.7 %	74.1 %	Longer, compilable functions with fewer manual fixes
Context Window	128 K	262 K tokens ≈ 210 k Chinese characters	You can feed an entire technical report plus references in one go
Reasoning Length	32 K output limit	82 K usable tokens	The model shows its scratch-work instead of skipping steps

If your previous pain point was “It stops mid-solution” or “It jumps over key derivations”, the 2507 checkpoint should feel like a different species.

Model Specs at a Glance

Item	Value	Plain-English Note
Architecture	Causal decoder-only MoE	Same family as GPT, but with expert routing
Total Parameters	235 B	Stored on disk
Activated Parameters	22 B	Loaded into GPU memory per token
Layers	94	“Processing stations” stacked on top of each other
Attention Heads (GQA)	64 query + 4 key/value	Grouped-query attention saves memory
Experts	128 total	Like 128 specialized sub-models
Activated Experts	8 per token	Keeps throughput high and cost low
Context Length	262 144 tokens natively	Enough for a 200-page PDF
Modes	Thinking mode only	You cannot turn off internal reasoning

One caveat: the chat template automatically injects a hidden <think> token, so the model output starts with the reasoning trace and ends with </think>. There is no opening tag in the visible text.

Benchmark Results in Plain Numbers

Below are the headline figures across knowledge, reasoning, coding, alignment, agent, and multilingual tasks. Bold = best score in the row.

Category	Task	2507	DeepSeek-R1	OpenAI o3	Gemini-2.5 Pro
Knowledge	MMLU-Pro	84.4	85.0	85.9	85.6
	SuperGPQA	64.9	61.7	—	62.3
Reasoning	AIME25	92.3	87.5	88.9*	88.0
	HMMT25	83.9	79.4	77.5	82.5
Coding	LiveCodeBench v6	74.1	68.7	58.6	72.5
	CFEval (points)	2 134	2 099	2 043	2 001
Alignment	IFEval	87.8	79.1	92.1	90.8
Agent	BFCL-v3	71.9	63.8	72.4	67.2
Multilingual	MultiIF	80.6	63.5	80.3	77.8

* OpenAI o3 used high-reasoning effort for starred scores.

Take-away:

Math & code = 2507 leads the open-source pack.
Knowledge QA = differences are within error bars; choose by cost and latency.
Agent tasks = on par with GPT-4-class models, but still behind o3 on airline/telecom tool use.

Getting Started: Zero-to-First-Inference Tutorial

1. Prerequisites

Python ≥ 3.9
8×A100 80 GB or equivalent (4×A100 40 GB works with reduced context)
transformers ≥ 4.51.0 (earlier versions throw an error)

2. Install

pip install -U transformers torch

3. Minimal Inference Script

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-235B-A22B-Thinking-2507"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Explain quantum computing in 200 words."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=2048)

# Split reasoning and final answer
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
try:
    idx = len(output_ids) - output_ids[::-1].index(151668)  # token id of </think>
except ValueError:
    idx = 0

reasoning = tokenizer.decode(output_ids[:idx], skip_special_tokens=True)
answer = tokenizer.decode(output_ids[idx:], skip_special_tokens=True)

print("Reasoning:", reasoning)
print("Answer:", answer)

You will see two blocks: the hidden scratchpad and the user-ready response.

Deployment Recipes: SGLang, vLLM, and Local Tools

Production: SGLang

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 \
  --tp 8 \
  --context-length 262144 \
  --reasoning-parser deepseek-r1

--tp 8 = tensor parallelism across 8 GPUs
--reasoning-parser hides the scratchpad from clients automatically

Alternative: vLLM

vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1

If you hit out-of-memory errors, drop --max-model-len to 131072 but no lower than 81920; shorter context truncates long chains of thought.

Local Desktop

Ollama, LMStudio, llama.cpp, MLX-LM, and KTransformers all ship with 2507 support.
Quantized GGUF models are available, but expect a measurable drop in math accuracy.

Turning the Model into an Agent

The official helper library is Qwen-Agent. It hides tool-calling templates and parsers so you can focus on business logic.

from qwen_agent.agents import Assistant

# Option 1: DashScope API
llm_cfg = {
    'model': 'qwen3-235b-a22b-thinking-2507',
    'model_type': 'qwen_dashscope',
}

# Option 2: Self-hosted OpenAI-compatible endpoint
# llm_cfg = {
#     'model': 'Qwen3-235B-A22B-Thinking-2507',
#     'model_server': 'http://localhost:8000/v1',
#     'api_key': 'EMPTY',
#     'generate_cfg': {'thought_in_content': True},
# }

tools = [
    {'mcpServers': {
        'time': {
            'command': 'uvx',
            'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
        },
        'fetch': {
            'command': 'uvx',
            'args': ['mcp-server-fetch']
        }
    }},
    'code_interpreter',
]

bot = Assistant(llm=llm_cfg, function_list=tools)

messages = [{'role': 'user',
             'content': 'Visit https://qwenlm.github.io/blog/ and summarize the latest updates.'}]
for rsp in bot.run(messages):
    pass
print(rsp)

Key points

MCP servers give you time, fetch, and other micro-tools without extra code.
Do not include the scratchpad <think>…</think> in the conversation history; Qwen-Agent handles that automatically.

Best-Practice Settings: Temperature, Context, and Output Length

Scenario	Recommended Settings	Rationale
General chat	Temp 0.6, TopP 0.95, TopK 20, MinP 0	Balanced creativity and coherence
Math competitions	Temp 0.3	Lower randomness, reproducible answers
Long-form writing	max_new_tokens=81920	Enough runway for step-by-step exposition
Multi-turn dialogue	Exclude `<think>` blocks from history	Saves tokens and latency
Repetition loops	Raise presence_penalty 0.5–1.5	Breaks loops without hurting factual recall

Frequently Asked Questions

Q1: Can I run this on 2×RTX 4090 24 GB?
Not at full precision. You would need 4-bit or 3-bit GGUF, and math accuracy drops. For serious use, stick to ≥ 4×A100 40 GB.

Q2: Why does the output lack an opening <think> tag?
The chat template injects it before generation starts. The model only needs to emit </think> to mark the end of its scratchpad.

Q3: Is the model suitable for casual chatbots?
Technically yes, but economically no. Every token incurs 22 B activated parameters; use it for tasks that justify the cost.

Q4: How do I reproduce the benchmark scores exactly?

Math/code tasks: set max_tokens=81920
All others: max_tokens=32768
Temperature 0.6, TopP 0.95

Add system prompt:

Please reason step by step, and put your final answer within \boxed{}.

Q5: Does the 262 K context mean I can feed a whole code repo?
Yes, but remember that attention scales quadratically. On 8×A100 80 GB you can comfortably run 100 K input + 20 K output.

Final Thoughts

If your current pain points are

Multi-step derivations the model skips,
Code templates that never compile on the first try, or
Long documents that exceed 32 K tokens and get truncated,

then Qwen3-235B-A22B-Thinking-2507 is arguably the first open-source model that pushes all three issues above the 90 % usability mark.
The trade-off is straightforward: more GPU budget, more electricity, but fewer human hours spent in loops of prompt engineering and manual debugging.

Choose wisely, benchmark on your own data, and share your findings—the open-source community moves fastest when we all publish real numbers instead of hype.

Qwen3-30B-A3B-2507 and Qwen3-235B-A22B-2507 now support ultra-long context—up to 1 million tokens!

Powered by:

• Dual Chunk Attention (DCA) – A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence.

• MInference – Sparse attention that cuts overhead by focusing on key token interactions

These innovations boost both generation quality and inference speed, delivering up to 3× faster performance on near-1M token sequences. Fully compatible with vLLM and SGLang for efficient deployment.