DeepSeek-V3.1: A Friendly, No-Jargon Guide for First-Time Users

Written by an Engineer Who Still Reads Manuals First

If you have ever unboxed a new laptop and reached for the quick-start card before pressing the power button, treat this article the same way.
Below you will find nothing more—and nothing less—than the official DeepSeek-V3.1 documentation, rewritten in plain English for curious readers who have at least a junior-college background but do not live inside research papers.

1. What Exactly Is DeepSeek-V3.1?

DeepSeek-V3.1 is one neural network that can behave like two different assistants:

Non-Thinking Mode – gives quick, direct answers (think of a helpful customer-service rep).
Thinking Mode – shows its scratch-work before answering (think of a student who hands in both the exam and the draft paper).

You switch between the two by changing a short text template—no extra downloads, no extra fees.

2. New Pricing (Effective September 6, 2025, 00:00 Beijing Time)

Usage Event	Cost per Million Tokens
Input with cache hit	0.5 CNY
Input with cache miss	4 CNY
Output (any case)	12 CNY

Quick mental math

A 3,000-character Chinese article ≈ 4 k tokens.
250 such articles cost about 1 CNY to send and 3 CNY to receive.
When the cache hits, the input cost drops to roughly one-eighth.

3. Performance at a Glance

The table below condenses the official benchmark sheet so you can see where the model shines without decoding acronyms.

Task Area	Benchmark Example	Non-Thinking	Thinking	Prior V3	R1-0528
General knowledge	MMLU-Redux (%)	91.8	93.7	90.5	93.4
Graduate-level reasoning	GPQA-Diamond (%)	74.9	80.1	68.4	81.0
Math competition	AIME 2024 (%)	66.3	93.1	59.4	91.4
Code generation	LiveCodeBench (%)	56.4	74.8	43.0	73.3
Software-engineering agent	SWE Verified (%)	66.0	—	45.4	44.6
Web-search agent	BrowseComp (%)	—	30.0	—	8.9

“—” means the test was not run or not applicable.

Take-away: If your task involves multi-step reasoning, code, or advanced math, switch to Thinking mode; the jump is often 20–30 percentage points.

4. How the Two Modes Work

DeepSeek-V3.1 reads special tokens that act like stage directions.

4.1 Non-Thinking Mode Templates

Single-turn conversation

<｜begin▁of▁sentence｜>{system prompt}<｜User｜>{query}<｜Assistant｜></think>

The extra </think> token tells the model, “Skip the scratch-work.”

Multi-turn conversation

<｜begin▁of▁sentence｜>{system prompt}<｜User｜>{query}<｜Assistant｜></think>{response}<｜end▁of▁sentence｜>...<｜User｜>{query}<｜Assistant｜></think>{response}

Each time the user adds a new query, repeat the entire history and append the same closing tag.

4.2 Thinking Mode Templates

Single-turn

<｜begin▁of▁sentence｜>{system prompt}<｜User｜>{query}<｜Assistant｜><think>

The <think> token invites the model to draft its reasoning first.

Multi-turn

Same history structure as above, but the final turn uses <think> instead of </think>.

5. Calling External Tools

DeepSeek-V3.1 can “pick up the phone” and call functions you provide.

5.1 ToolCall (Non-Thinking Mode Only)

Describe the tool inside the system prompt:

## Tools

You have access to the following tools:

### get_weather
Description: Returns current weather
Parameters: {"location": {"type": "string"}}

The model answers with a precise syntax:

<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>get_weather<｜tool▁sep｜>{"location": "Beijing"}<｜tool▁call▁end｜><｜tool▁calls▁end｜>

Your code parses the JSON, runs the function, and feeds the result back into the chat history.

5.2 Search-Agent (Thinking Mode)

For tasks that require multiple searches—e.g., “Compare the 2025 GDP forecasts from three sources”—use the Search-Agent flow:

Model proposes a search query.
You run the search and return the top snippets.
Model reviews, refines the query, or writes the final answer.

Example trajectories are provided in

assets/search_tool_trajectory.html
assets/search_python_tool_trajectory.html

6. Running the Model Locally

The model architecture is identical to DeepSeek-V3; any script that runs V3 will run V3.1 unchanged.

6.1 Hardware Checklist

Precision	VRAM Needed	Notes
FP8 (official)	80 GB	Highest speed
BF16	160 GB	Fallback if FP8 not supported
8-bit quantized	48 GB	Community scripts via bitsandbytes
4-bit quantized	24 GB	Experimental; quality may drop

6.2 Minimal Python Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "deepseek-ai/DeepSeek-V3.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum tunneling like I'm five."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    thinking=True,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

7. Frequently Asked Questions (FAQ)

Q1: Can I download only the weights and skip the transformers library?
Yes, but you must parse tokenizer_config.json and implement the inference loop yourself. Not recommended for beginners.

Q2: Does Thinking mode cost more?
Yes. You are billed by the token, and the scratch-work adds tokens. Whether the extra cost is justified depends on the task—math proofs and multi-file code edits usually see large accuracy gains.

Q3: How does the cache hit work?
The platform hashes the exact prompt (including spaces and line breaks). Any change creates a new hash.

Q4: Is a commercial license required?
No. Weights are released under the MIT License. You may use, modify, or redistribute them commercially as long as you keep the license file.

Q5: Are there community channels?
Yes. The official repo footer contains Discord and WeChat QR codes.

8. Quick Reference Card

What You Need	Where to Find It
Model weights	Hugging Face repo
Source code	DeepSeek-V3 GitHub
License	MIT, included in the repo
Contact email	service@deepseek.com

9. One-Page Decision Tree

Task type?
├─ Simple Q&A → Non-Thinking mode
├─ Multi-step math / code → Thinking mode
├─ Needs web search → Thinking mode + Search-Agent
└─ Needs external API → Non-Thinking mode + ToolCall

10. Final Word

DeepSeek-V3.1 is not magic. It is a well-documented, openly licensed tool that lets you choose between speed and depth without juggling two separate models.
If you have read this far, the next step is concrete: head to the Hugging Face page, download the files, and run the five-line Python snippet above.
The best way to understand a new model is to make it say “Hello, world.”

DeepSeek-V3.1 Explained: How This Dual-Mode AI Model Revolutionizes Cost-Effective Implementation