Qwen3-235B-A22B-Thinking-2507: The Open-Source Reasoning Model That Actually Outperforms GPT on Math and Code
A plain-English, no-hype guide for developers, researchers, and technical product managers who want to understand what this 235-billion-parameter reasoning engine can—and cannot—do.
Table of Contents
-
What Exactly Is Qwen3-235B-A22B-Thinking-2507? -
Three Months of Improvements: Quality, Depth, Length -
Model Specs at a Glance -
Benchmark Results in Plain Numbers -
Getting Started: Zero-to-First-Inference Tutorial -
Deployment Recipes: SGLang, vLLM, and Local Tools -
Turning the Model into an Agent -
Best-Practice Settings: Temperature, Context, and Output Length -
Frequently Asked Questions
What Exactly Is Qwen3-235B-A22B-Thinking-2507?
Think of Qwen3-235B-A22B-Thinking-2507 as a specialized “reasoning engine” built on top of the Qwen3 235-billion-parameter Mixture-of-Experts (MoE) architecture.
-
235B = 235 billion total parameters -
A22B = only 22 billion are activated during each forward pass -
Thinking = the model is always in reasoning mode; it can’t be switched off -
2507 = the July 2025 checkpoint
In short, it is an open-source model that tries to match or exceed the reasoning power of proprietary systems such as OpenAI o3 and Gemini 2.5 Pro, while keeping the inference cost within reach of a well-equipped on-prem GPU cluster.
Three Months of Improvements: Quality, Depth, Length
The Qwen team compressed their summer sprint into three headline upgrades:
Dimension | Previous Version | July 2025 (2507) | What You Will Notice |
---|---|---|---|
Math (AIME25) | 81.5 % | 92.3 % | Fewer wrong final answers on competition-level problems |
Code (LiveCodeBench v6) | 55.7 % | 74.1 % | Longer, compilable functions with fewer manual fixes |
Context Window | 128 K | 262 K tokens ≈ 210 k Chinese characters | You can feed an entire technical report plus references in one go |
Reasoning Length | 32 K output limit | 82 K usable tokens | The model shows its scratch-work instead of skipping steps |
If your previous pain point was “It stops mid-solution” or “It jumps over key derivations”, the 2507 checkpoint should feel like a different species.
Model Specs at a Glance
Item | Value | Plain-English Note |
---|---|---|
Architecture | Causal decoder-only MoE | Same family as GPT, but with expert routing |
Total Parameters | 235 B | Stored on disk |
Activated Parameters | 22 B | Loaded into GPU memory per token |
Layers | 94 | “Processing stations” stacked on top of each other |
Attention Heads (GQA) | 64 query + 4 key/value | Grouped-query attention saves memory |
Experts | 128 total | Like 128 specialized sub-models |
Activated Experts | 8 per token | Keeps throughput high and cost low |
Context Length | 262 144 tokens natively | Enough for a 200-page PDF |
Modes | Thinking mode only | You cannot turn off internal reasoning |
One caveat: the chat template automatically injects a hidden <think>
token, so the model output starts with the reasoning trace and ends with </think>
. There is no opening tag in the visible text.
Benchmark Results in Plain Numbers
Below are the headline figures across knowledge, reasoning, coding, alignment, agent, and multilingual tasks. Bold = best score in the row.
Category | Task | 2507 | DeepSeek-R1 | OpenAI o3 | Gemini-2.5 Pro |
---|---|---|---|---|---|
Knowledge | MMLU-Pro | 84.4 | 85.0 | 85.9 | 85.6 |
SuperGPQA | 64.9 | 61.7 | — | 62.3 | |
Reasoning | AIME25 | 92.3 | 87.5 | 88.9* | 88.0 |
HMMT25 | 83.9 | 79.4 | 77.5 | 82.5 | |
Coding | LiveCodeBench v6 | 74.1 | 68.7 | 58.6 | 72.5 |
CFEval (points) | 2 134 | 2 099 | 2 043 | 2 001 | |
Alignment | IFEval | 87.8 | 79.1 | 92.1 | 90.8 |
Agent | BFCL-v3 | 71.9 | 63.8 | 72.4 | 67.2 |
Multilingual | MultiIF | 80.6 | 63.5 | 80.3 | 77.8 |
* OpenAI o3 used high-reasoning effort for starred scores.
Take-away:
-
Math & code = 2507 leads the open-source pack. -
Knowledge QA = differences are within error bars; choose by cost and latency. -
Agent tasks = on par with GPT-4-class models, but still behind o3 on airline/telecom tool use.
Getting Started: Zero-to-First-Inference Tutorial
1. Prerequisites
-
Python ≥ 3.9 -
8×A100 80 GB or equivalent (4×A100 40 GB works with reduced context) -
transformers ≥ 4.51.0
(earlier versions throw an error)
2. Install
pip install -U transformers torch
3. Minimal Inference Script
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-235B-A22B-Thinking-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
prompt = "Explain quantum computing in 200 words."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=2048)
# Split reasoning and final answer
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
try:
idx = len(output_ids) - output_ids[::-1].index(151668) # token id of </think>
except ValueError:
idx = 0
reasoning = tokenizer.decode(output_ids[:idx], skip_special_tokens=True)
answer = tokenizer.decode(output_ids[idx:], skip_special_tokens=True)
print("Reasoning:", reasoning)
print("Answer:", answer)
You will see two blocks: the hidden scratchpad and the user-ready response.
Deployment Recipes: SGLang, vLLM, and Local Tools
Production: SGLang
python -m sglang.launch_server \
--model-path Qwen/Qwen3-235B-A22B-Thinking-2507 \
--tp 8 \
--context-length 262144 \
--reasoning-parser deepseek-r1
-
--tp 8
= tensor parallelism across 8 GPUs -
--reasoning-parser
hides the scratchpad from clients automatically
Alternative: vLLM
vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--enable-reasoning \
--reasoning-parser deepseek_r1
If you hit out-of-memory errors, drop --max-model-len
to 131072 but no lower than 81920; shorter context truncates long chains of thought.
Local Desktop
-
Ollama, LMStudio, llama.cpp, MLX-LM, and KTransformers all ship with 2507 support. -
Quantized GGUF models are available, but expect a measurable drop in math accuracy.
Turning the Model into an Agent
The official helper library is Qwen-Agent. It hides tool-calling templates and parsers so you can focus on business logic.
from qwen_agent.agents import Assistant
# Option 1: DashScope API
llm_cfg = {
'model': 'qwen3-235b-a22b-thinking-2507',
'model_type': 'qwen_dashscope',
}
# Option 2: Self-hosted OpenAI-compatible endpoint
# llm_cfg = {
# 'model': 'Qwen3-235B-A22B-Thinking-2507',
# 'model_server': 'http://localhost:8000/v1',
# 'api_key': 'EMPTY',
# 'generate_cfg': {'thought_in_content': True},
# }
tools = [
{'mcpServers': {
'time': {
'command': 'uvx',
'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
},
'fetch': {
'command': 'uvx',
'args': ['mcp-server-fetch']
}
}},
'code_interpreter',
]
bot = Assistant(llm=llm_cfg, function_list=tools)
messages = [{'role': 'user',
'content': 'Visit https://qwenlm.github.io/blog/ and summarize the latest updates.'}]
for rsp in bot.run(messages):
pass
print(rsp)
Key points
-
MCP servers give you time, fetch, and other micro-tools without extra code. -
Do not include the scratchpad <think>…</think>
in the conversation history; Qwen-Agent handles that automatically.
Best-Practice Settings: Temperature, Context, and Output Length
Scenario | Recommended Settings | Rationale |
---|---|---|
General chat | Temp 0.6, TopP 0.95, TopK 20, MinP 0 | Balanced creativity and coherence |
Math competitions | Temp 0.3 | Lower randomness, reproducible answers |
Long-form writing | max_new_tokens=81920 | Enough runway for step-by-step exposition |
Multi-turn dialogue | Exclude <think> blocks from history |
Saves tokens and latency |
Repetition loops | Raise presence_penalty 0.5–1.5 | Breaks loops without hurting factual recall |
Frequently Asked Questions
Q1: Can I run this on 2×RTX 4090 24 GB?
Not at full precision. You would need 4-bit or 3-bit GGUF, and math accuracy drops. For serious use, stick to ≥ 4×A100 40 GB.
Q2: Why does the output lack an opening <think>
tag?
The chat template injects it before generation starts. The model only needs to emit </think>
to mark the end of its scratchpad.
Q3: Is the model suitable for casual chatbots?
Technically yes, but economically no. Every token incurs 22 B activated parameters; use it for tasks that justify the cost.
Q4: How do I reproduce the benchmark scores exactly?
-
Math/code tasks: set max_tokens=81920
-
All others: max_tokens=32768
-
Temperature 0.6, TopP 0.95 -
Add system prompt: Please reason step by step, and put your final answer within \boxed{}.
Q5: Does the 262 K context mean I can feed a whole code repo?
Yes, but remember that attention scales quadratically. On 8×A100 80 GB you can comfortably run 100 K input + 20 K output.
Final Thoughts
If your current pain points are
-
Multi-step derivations the model skips, -
Code templates that never compile on the first try, or -
Long documents that exceed 32 K tokens and get truncated,
then Qwen3-235B-A22B-Thinking-2507 is arguably the first open-source model that pushes all three issues above the 90 % usability mark.
The trade-off is straightforward: more GPU budget, more electricity, but fewer human hours spent in loops of prompt engineering and manual debugging.
Choose wisely, benchmark on your own data, and share your findings—the open-source community moves fastest when we all publish real numbers instead of hype.