Qwen3-235B-A22B-Thinking-2507: The Open-Source Reasoning Model That Actually Outperforms GPT on Math and Code
A plain-English, no-hype guide for developers, researchers, and technical product managers who want to understand what this 235-billion-parameter reasoning engine can—and cannot—do.
Table of Contents
-
What Exactly Is Qwen3-235B-A22B-Thinking-2507? -
Three Months of Improvements: Quality, Depth, Length -
Model Specs at a Glance -
Benchmark Results in Plain Numbers -
Getting Started: Zero-to-First-Inference Tutorial -
Deployment Recipes: SGLang, vLLM, and Local Tools -
Turning the Model into an Agent -
Best-Practice Settings: Temperature, Context, and Output Length -
Frequently Asked Questions
What Exactly Is Qwen3-235B-A22B-Thinking-2507?
Think of Qwen3-235B-A22B-Thinking-2507 as a specialized “reasoning engine” built on top of the Qwen3 235-billion-parameter Mixture-of-Experts (MoE) architecture.
-
235B = 235 billion total parameters -
A22B = only 22 billion are activated during each forward pass -
Thinking = the model is always in reasoning mode; it can’t be switched off -
2507 = the July 2025 checkpoint
In short, it is an open-source model that tries to match or exceed the reasoning power of proprietary systems such as OpenAI o3 and Gemini 2.5 Pro, while keeping the inference cost within reach of a well-equipped on-prem GPU cluster.
Three Months of Improvements: Quality, Depth, Length
The Qwen team compressed their summer sprint into three headline upgrades:
If your previous pain point was “It stops mid-solution” or “It jumps over key derivations”, the 2507 checkpoint should feel like a different species.
Model Specs at a Glance
One caveat: the chat template automatically injects a hidden <think>
token, so the model output starts with the reasoning trace and ends with </think>
. There is no opening tag in the visible text.
Benchmark Results in Plain Numbers
Below are the headline figures across knowledge, reasoning, coding, alignment, agent, and multilingual tasks. Bold = best score in the row.
* OpenAI o3 used high-reasoning effort for starred scores.
Take-away:
-
Math & code = 2507 leads the open-source pack. -
Knowledge QA = differences are within error bars; choose by cost and latency. -
Agent tasks = on par with GPT-4-class models, but still behind o3 on airline/telecom tool use.
Getting Started: Zero-to-First-Inference Tutorial
1. Prerequisites
-
Python ≥ 3.9 -
8×A100 80 GB or equivalent (4×A100 40 GB works with reduced context) -
transformers ≥ 4.51.0
(earlier versions throw an error)
2. Install
pip install -U transformers torch
3. Minimal Inference Script
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-235B-A22B-Thinking-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
prompt = "Explain quantum computing in 200 words."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=2048)
# Split reasoning and final answer
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
try:
idx = len(output_ids) - output_ids[::-1].index(151668) # token id of </think>
except ValueError:
idx = 0
reasoning = tokenizer.decode(output_ids[:idx], skip_special_tokens=True)
answer = tokenizer.decode(output_ids[idx:], skip_special_tokens=True)
print("Reasoning:", reasoning)
print("Answer:", answer)
You will see two blocks: the hidden scratchpad and the user-ready response.
Deployment Recipes: SGLang, vLLM, and Local Tools
Production: SGLang
python -m sglang.launch_server \
--model-path Qwen/Qwen3-235B-A22B-Thinking-2507 \
--tp 8 \
--context-length 262144 \
--reasoning-parser deepseek-r1
-
--tp 8
= tensor parallelism across 8 GPUs -
--reasoning-parser
hides the scratchpad from clients automatically
Alternative: vLLM
vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--enable-reasoning \
--reasoning-parser deepseek_r1
If you hit out-of-memory errors, drop --max-model-len
to 131072 but no lower than 81920; shorter context truncates long chains of thought.
Local Desktop
-
Ollama, LMStudio, llama.cpp, MLX-LM, and KTransformers all ship with 2507 support. -
Quantized GGUF models are available, but expect a measurable drop in math accuracy.
Turning the Model into an Agent
The official helper library is Qwen-Agent. It hides tool-calling templates and parsers so you can focus on business logic.
from qwen_agent.agents import Assistant
# Option 1: DashScope API
llm_cfg = {
'model': 'qwen3-235b-a22b-thinking-2507',
'model_type': 'qwen_dashscope',
}
# Option 2: Self-hosted OpenAI-compatible endpoint
# llm_cfg = {
# 'model': 'Qwen3-235B-A22B-Thinking-2507',
# 'model_server': 'http://localhost:8000/v1',
# 'api_key': 'EMPTY',
# 'generate_cfg': {'thought_in_content': True},
# }
tools = [
{'mcpServers': {
'time': {
'command': 'uvx',
'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
},
'fetch': {
'command': 'uvx',
'args': ['mcp-server-fetch']
}
}},
'code_interpreter',
]
bot = Assistant(llm=llm_cfg, function_list=tools)
messages = [{'role': 'user',
'content': 'Visit https://qwenlm.github.io/blog/ and summarize the latest updates.'}]
for rsp in bot.run(messages):
pass
print(rsp)
Key points
-
MCP servers give you time, fetch, and other micro-tools without extra code. -
Do not include the scratchpad <think>…</think>
in the conversation history; Qwen-Agent handles that automatically.
Best-Practice Settings: Temperature, Context, and Output Length
Frequently Asked Questions
Q1: Can I run this on 2×RTX 4090 24 GB?
Not at full precision. You would need 4-bit or 3-bit GGUF, and math accuracy drops. For serious use, stick to ≥ 4×A100 40 GB.
Q2: Why does the output lack an opening <think>
tag?
The chat template injects it before generation starts. The model only needs to emit </think>
to mark the end of its scratchpad.
Q3: Is the model suitable for casual chatbots?
Technically yes, but economically no. Every token incurs 22 B activated parameters; use it for tasks that justify the cost.
Q4: How do I reproduce the benchmark scores exactly?
-
Math/code tasks: set max_tokens=81920
-
All others: max_tokens=32768
-
Temperature 0.6, TopP 0.95 -
Add system prompt: Please reason step by step, and put your final answer within \boxed{}.
Q5: Does the 262 K context mean I can feed a whole code repo?
Yes, but remember that attention scales quadratically. On 8×A100 80 GB you can comfortably run 100 K input + 20 K output.
Final Thoughts
If your current pain points are
-
Multi-step derivations the model skips, -
Code templates that never compile on the first try, or -
Long documents that exceed 32 K tokens and get truncated,
then Qwen3-235B-A22B-Thinking-2507 is arguably the first open-source model that pushes all three issues above the 90 % usability mark.
The trade-off is straightforward: more GPU budget, more electricity, but fewer human hours spent in loops of prompt engineering and manual debugging.
Choose wisely, benchmark on your own data, and share your findings—the open-source community moves fastest when we all publish real numbers instead of hype.