Qwen3-235B-A22B-Thinking-2507: The Open-Source Reasoning Model That Actually Outperforms GPT on Math and Code

A plain-English, no-hype guide for developers, researchers, and technical product managers who want to understand what this 235-billion-parameter reasoning engine can—and cannot—do.


Table of Contents

  1. What Exactly Is Qwen3-235B-A22B-Thinking-2507?
  2. Three Months of Improvements: Quality, Depth, Length
  3. Model Specs at a Glance
  4. Benchmark Results in Plain Numbers
  5. Getting Started: Zero-to-First-Inference Tutorial
  6. Deployment Recipes: SGLang, vLLM, and Local Tools
  7. Turning the Model into an Agent
  8. Best-Practice Settings: Temperature, Context, and Output Length
  9. Frequently Asked Questions

What Exactly Is Qwen3-235B-A22B-Thinking-2507?

Think of Qwen3-235B-A22B-Thinking-2507 as a specialized “reasoning engine” built on top of the Qwen3 235-billion-parameter Mixture-of-Experts (MoE) architecture.

  • 235B = 235 billion total parameters
  • A22B = only 22 billion are activated during each forward pass
  • Thinking = the model is always in reasoning mode; it can’t be switched off
  • 2507 = the July 2025 checkpoint

In short, it is an open-source model that tries to match or exceed the reasoning power of proprietary systems such as OpenAI o3 and Gemini 2.5 Pro, while keeping the inference cost within reach of a well-equipped on-prem GPU cluster.


Three Months of Improvements: Quality, Depth, Length

The Qwen team compressed their summer sprint into three headline upgrades:

Dimension Previous Version July 2025 (2507) What You Will Notice
Math (AIME25) 81.5 % 92.3 % Fewer wrong final answers on competition-level problems
Code (LiveCodeBench v6) 55.7 % 74.1 % Longer, compilable functions with fewer manual fixes
Context Window 128 K 262 K tokens ≈ 210 k Chinese characters You can feed an entire technical report plus references in one go
Reasoning Length 32 K output limit 82 K usable tokens The model shows its scratch-work instead of skipping steps

If your previous pain point was “It stops mid-solution” or “It jumps over key derivations”, the 2507 checkpoint should feel like a different species.


Model Specs at a Glance

Item Value Plain-English Note
Architecture Causal decoder-only MoE Same family as GPT, but with expert routing
Total Parameters 235 B Stored on disk
Activated Parameters 22 B Loaded into GPU memory per token
Layers 94 “Processing stations” stacked on top of each other
Attention Heads (GQA) 64 query + 4 key/value Grouped-query attention saves memory
Experts 128 total Like 128 specialized sub-models
Activated Experts 8 per token Keeps throughput high and cost low
Context Length 262 144 tokens natively Enough for a 200-page PDF
Modes Thinking mode only You cannot turn off internal reasoning

One caveat: the chat template automatically injects a hidden <think> token, so the model output starts with the reasoning trace and ends with </think>. There is no opening tag in the visible text.


Benchmark Results in Plain Numbers

Below are the headline figures across knowledge, reasoning, coding, alignment, agent, and multilingual tasks. Bold = best score in the row.

Category Task 2507 DeepSeek-R1 OpenAI o3 Gemini-2.5 Pro
Knowledge MMLU-Pro 84.4 85.0 85.9 85.6
SuperGPQA 64.9 61.7 62.3
Reasoning AIME25 92.3 87.5 88.9* 88.0
HMMT25 83.9 79.4 77.5 82.5
Coding LiveCodeBench v6 74.1 68.7 58.6 72.5
CFEval (points) 2 134 2 099 2 043 2 001
Alignment IFEval 87.8 79.1 92.1 90.8
Agent BFCL-v3 71.9 63.8 72.4 67.2
Multilingual MultiIF 80.6 63.5 80.3 77.8

* OpenAI o3 used high-reasoning effort for starred scores.

Take-away:

  • Math & code = 2507 leads the open-source pack.
  • Knowledge QA = differences are within error bars; choose by cost and latency.
  • Agent tasks = on par with GPT-4-class models, but still behind o3 on airline/telecom tool use.

Getting Started: Zero-to-First-Inference Tutorial

1. Prerequisites

  • Python ≥ 3.9
  • 8×A100 80 GB or equivalent (4×A100 40 GB works with reduced context)
  • transformers ≥ 4.51.0 (earlier versions throw an error)

2. Install

pip install -U transformers torch

3. Minimal Inference Script

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-235B-A22B-Thinking-2507"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Explain quantum computing in 200 words."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=2048)

# Split reasoning and final answer
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
try:
    idx = len(output_ids) - output_ids[::-1].index(151668)  # token id of </think>
except ValueError:
    idx = 0

reasoning = tokenizer.decode(output_ids[:idx], skip_special_tokens=True)
answer = tokenizer.decode(output_ids[idx:], skip_special_tokens=True)

print("Reasoning:", reasoning)
print("Answer:", answer)

You will see two blocks: the hidden scratchpad and the user-ready response.


Deployment Recipes: SGLang, vLLM, and Local Tools

Production: SGLang

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 \
  --tp 8 \
  --context-length 262144 \
  --reasoning-parser deepseek-r1
  • --tp 8 = tensor parallelism across 8 GPUs
  • --reasoning-parser hides the scratchpad from clients automatically

Alternative: vLLM

vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --enable-reasoning \
  --reasoning-parser deepseek_r1

If you hit out-of-memory errors, drop --max-model-len to 131072 but no lower than 81920; shorter context truncates long chains of thought.

Local Desktop

  • Ollama, LMStudio, llama.cpp, MLX-LM, and KTransformers all ship with 2507 support.
  • Quantized GGUF models are available, but expect a measurable drop in math accuracy.

Turning the Model into an Agent

The official helper library is Qwen-Agent. It hides tool-calling templates and parsers so you can focus on business logic.

from qwen_agent.agents import Assistant

# Option 1: DashScope API
llm_cfg = {
    'model': 'qwen3-235b-a22b-thinking-2507',
    'model_type': 'qwen_dashscope',
}

# Option 2: Self-hosted OpenAI-compatible endpoint
# llm_cfg = {
#     'model': 'Qwen3-235B-A22B-Thinking-2507',
#     'model_server': 'http://localhost:8000/v1',
#     'api_key': 'EMPTY',
#     'generate_cfg': {'thought_in_content': True},
# }

tools = [
    {'mcpServers': {
        'time': {
            'command': 'uvx',
            'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
        },
        'fetch': {
            'command': 'uvx',
            'args': ['mcp-server-fetch']
        }
    }},
    'code_interpreter',
]

bot = Assistant(llm=llm_cfg, function_list=tools)

messages = [{'role': 'user',
             'content': 'Visit https://qwenlm.github.io/blog/ and summarize the latest updates.'}]
for rsp in bot.run(messages):
    pass
print(rsp)

Key points

  • MCP servers give you time, fetch, and other micro-tools without extra code.
  • Do not include the scratchpad <think>…</think> in the conversation history; Qwen-Agent handles that automatically.

Best-Practice Settings: Temperature, Context, and Output Length

Scenario Recommended Settings Rationale
General chat Temp 0.6, TopP 0.95, TopK 20, MinP 0 Balanced creativity and coherence
Math competitions Temp 0.3 Lower randomness, reproducible answers
Long-form writing max_new_tokens=81920 Enough runway for step-by-step exposition
Multi-turn dialogue Exclude <think> blocks from history Saves tokens and latency
Repetition loops Raise presence_penalty 0.5–1.5 Breaks loops without hurting factual recall

Frequently Asked Questions

Q1: Can I run this on 2×RTX 4090 24 GB?
Not at full precision. You would need 4-bit or 3-bit GGUF, and math accuracy drops. For serious use, stick to ≥ 4×A100 40 GB.

Q2: Why does the output lack an opening <think> tag?
The chat template injects it before generation starts. The model only needs to emit </think> to mark the end of its scratchpad.

Q3: Is the model suitable for casual chatbots?
Technically yes, but economically no. Every token incurs 22 B activated parameters; use it for tasks that justify the cost.

Q4: How do I reproduce the benchmark scores exactly?

  • Math/code tasks: set max_tokens=81920
  • All others: max_tokens=32768
  • Temperature 0.6, TopP 0.95
  • Add system prompt:

    Please reason step by step, and put your final answer within \boxed{}.
    

Q5: Does the 262 K context mean I can feed a whole code repo?
Yes, but remember that attention scales quadratically. On 8×A100 80 GB you can comfortably run 100 K input + 20 K output.


Final Thoughts

If your current pain points are

  • Multi-step derivations the model skips,
  • Code templates that never compile on the first try, or
  • Long documents that exceed 32 K tokens and get truncated,

then Qwen3-235B-A22B-Thinking-2507 is arguably the first open-source model that pushes all three issues above the 90 % usability mark.
The trade-off is straightforward: more GPU budget, more electricity, but fewer human hours spent in loops of prompt engineering and manual debugging.

Choose wisely, benchmark on your own data, and share your findings—the open-source community moves fastest when we all publish real numbers instead of hype.