The Complete Guide to Running and Fine-Tuning OpenAI’s gpt-oss Models with Unsloth
You might wonder: How can I run billion-parameter open-source models efficiently? OpenAI’s newly released gpt-oss series combined with Unsloth’s toolchain enables high-performance inference and fine-tuning on consumer hardware.
What Are gpt-oss Models?
In August 2025, OpenAI open-sourced two breakthrough language models: gpt-oss-120b and gpt-oss-20b. Both models feature:
-
Apache 2.0 license for commercial use -
128k context window for long-form reasoning -
State-of-the-art performance in reasoning, tool use, and agentic tasks
Key Model Specifications
Model | Parameters | Performance Benchmark | Core Strengths |
---|---|---|---|
gpt-oss-20b | 20 billion | Matches o3-mini | Tool calling, chain-of-thought reasoning |
gpt-oss-120b | 120 billion | Rivals o4-mini | Complex problem solving, multi-task handling |
These models utilize a Mixture-of-Experts (MoE) architecture:
-
gpt-oss-20b: Activates 4 of 32 experts per token -
gpt-oss-120b: Activates 4 of 128 experts per token
Why Unsloth Is Essential
You might ask: What hardware is needed to run such massive models? The Unsloth toolchain solves three critical challenges through technical innovation:
Critical Compatibility Fixes
-
Chat Template Alignment
Original Harmony tokenization conflicted with common Jinja templates:# Common incorrect rendering <|start|>assistant<|message|>Thoughts... # Unsloth-corrected version <|start|>assistant<|channel|>analysis<|message|>Thoughts...
-
Precision Optimization
Fixed BF16 overflow issues on T4 GPUs -
MoE Memory Management
Revolutionary layer-wise loading strategy
Performance Breakthroughs
-
Inference: 6+ tokens/sec (gpt-oss-20b) -
Fine-tuning memory: 70% reduction -
Context length: 10x extension support
Running gpt-oss Models
Environment Setup
# Install core dependencies
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
Running gpt-oss-20b (20B Parameters)
Hardware Requirements
Configuration | Memory | Example Hardware |
---|---|---|
Minimum | 14GB unified (VRAM+RAM) | T4 GPU + 32GB RAM |
Recommended | 24GB+ VRAM | RTX 4090 |
Three Deployment Methods
1. Docker Deployment
docker pull hf.co/unsloth/gpt-oss-20b-GGUF:F16
2. Local Execution via llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j
./llama.cpp/llama-cli \
-hf unsloth/gpt-oss-20b-GGUF:F16 \
--jinja -ngl 99 --threads -1 --ctx-size 16384 \
--temp 1.0 --top-p 1.0 --top-k 0
3. Free Google Colab Notebook
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GPT_OSS_MXFP4_(20B-Inference.ipynb)
Running gpt-oss-120b (120B Parameters)
Hardware Requirements
Configuration | Memory | Example Hardware |
---|---|---|
Minimum | 66GB unified memory | A100 80GB + 128GB RAM |
Recommended | 80GB+ VRAM | Dual A100 GPUs |
Optimization Techniques
./llama.cpp/llama-cli \
--model unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
--threads -1 --ctx-size 16384 --n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \ # MoE layer offloading
--temp 1.0 --top-p 1.0 --top-k 0
Memory Tiering Strategy
Resource Type | Function | Optimization Tip |
---|---|---|
VRAM | Core layers | Keep first 5 layers |
RAM | MoE experts | Control with regex patterns |
SSD | Large context | Enable 4-bit KV caching |
# Custom layer offloading
-ot "\.(6|7|8|9|[0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
Fine-Tuning gpt-oss Models
Environment Setup
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
Fine-Tuning Efficiency Comparison
Method | VRAM Usage | Training Speed | Compatibility |
---|---|---|---|
Standard BF16 | ≥65GB | 1.0x | Moderate |
Unsloth MXFP4 | 14GB (20B) 65GB (120B) |
1.5x | Excellent |
Step-by-Step Fine-Tuning Guide
1. Dataset Preparation Principles
-
Maintain 75% reasoning-focused data (math proofs, logical analysis) -
Include ≤25% direct Q&A data -
Recommended dataset mix: from datasets import load_dataset dataset = load_dataset("OpenAssistant/oasst2")
2. Critical Configuration Parameters
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gpt-oss-20b",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = True,
)
3. Launching Training
trainer = transformers.Trainer(
model = model,
train_dataset = dataset,
args = transformers.TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 42,
),
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
Best Practices & Pro Tips
Optimal Inference Parameters
# Official recommendations
temperature=1.0
top_p=1.0
top_k=0
Conversation Template Standard
<|start|>system<|message|>You are ChatGPT...<|end|>
<|start|>user<|message|>Hello<|end|>
<|start|>assistant<|channel|>final<|message|>Hi there!<|end|>
Advanced Harmony Features
from unsloth_zoo import encode_conversations_with_harmony
messages = [
{"role": "user", "content": "What's the current temperature in San Francisco?"},
{"role": "assistant", "thinking": "Need to use weather API tool"},
]
encoded = encode_conversations_with_harmony(
messages,
reasoning_effort = "high", # Options: low/medium/high
developer_instructions = "You're a helpful AI assistant"
)
Frequently Asked Questions
Q: Can consumer GPUs run gpt-oss-120b?
A: With Unsloth’s layer offloading, RTX 4090 (24GB VRAM) + 128GB RAM can run the 120B model.
Q: Does fine-tuning reduce reasoning capabilities?
A: Maintaining 75% reasoning-focused data preserves original capabilities.
Q: How to choose quantization methods?
A:
-
Prioritize Unsloth’s MXFP4 -
Use NF4 for older GPUs like T4 -
Choose F16 for highest precision
Q: Why do MoE layers need special handling?
A: Experts constitute 95% of parameters. Tiered loading saves >60% memory.
Q: How much data is needed for fine-tuning?
A: Domain adaptation requires ~1,000 quality samples; task-specific tuning needs 10,000+.
Conclusion
OpenAI’s gpt-oss release marks a democratization milestone in large language models. Through Unsloth’s innovations:
-
Consumer hardware now runs billion-parameter models -
Fine-tuning efficiency improved 300% with 80% less memory -
Original reasoning capabilities and tool-use features remain intact
Technology Insight: The fusion of MoE architecture with efficient inference is bridging the gap between AI researchers and developers. Within six months, we’ll see more open models deployed on edge devices.
flowchart LR
A[Raw gpt-oss Models] --> B{Unsloth Optimization}
B --> C[Precision Fixes]
B --> D[Memory Management]
B --> E[Template Alignment]
C --> F[Run Anywhere]
D --> G[Efficient Tuning]
E --> H[Accurate Reasoning]
Get Started Resources:
-
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B-Fine-tuning.ipynb) -
https://huggingface.co/unsloth -
https://docs.unsloth.ai/basics/gpt-oss-how-to-run-and-fine-tune