The Complete Guide to Running and Fine-Tuning OpenAI’s gpt-oss Models with Unsloth

You might wonder: How can I run billion-parameter open-source models efficiently? OpenAI’s newly released gpt-oss series combined with Unsloth’s toolchain enables high-performance inference and fine-tuning on consumer hardware.

What Are gpt-oss Models?

In August 2025, OpenAI open-sourced two breakthrough language models: gpt-oss-120b and gpt-oss-20b. Both models feature:

Apache 2.0 license for commercial use
128k context window for long-form reasoning
State-of-the-art performance in reasoning, tool use, and agentic tasks

Key Model Specifications

Model	Parameters	Performance Benchmark	Core Strengths
gpt-oss-20b	20 billion	Matches o3-mini	Tool calling, chain-of-thought reasoning
gpt-oss-120b	120 billion	Rivals o4-mini	Complex problem solving, multi-task handling

These models utilize a Mixture-of-Experts (MoE) architecture:

gpt-oss-20b: Activates 4 of 32 experts per token
gpt-oss-120b: Activates 4 of 128 experts per token

Why Unsloth Is Essential

You might ask: What hardware is needed to run such massive models? The Unsloth toolchain solves three critical challenges through technical innovation:

Critical Compatibility Fixes

Chat Template Alignment
Original Harmony tokenization conflicted with common Jinja templates:

# Common incorrect rendering
<|start|>assistant<|message|>Thoughts...

# Unsloth-corrected version
<|start|>assistant<|channel|>analysis<|message|>Thoughts...

Precision Optimization
Fixed BF16 overflow issues on T4 GPUs
MoE Memory Management
Revolutionary layer-wise loading strategy

Performance Breakthroughs

Inference: 6+ tokens/sec (gpt-oss-20b)
Fine-tuning memory: 70% reduction
Context length: 10x extension support

Running gpt-oss Models

Environment Setup

# Install core dependencies
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y

Running gpt-oss-20b (20B Parameters)

Hardware Requirements

Configuration	Memory	Example Hardware
Minimum	14GB unified (VRAM+RAM)	T4 GPU + 32GB RAM
Recommended	24GB+ VRAM	RTX 4090

Three Deployment Methods

1. Docker Deployment

docker pull hf.co/unsloth/gpt-oss-20b-GGUF:F16

2. Local Execution via llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j

./llama.cpp/llama-cli \
  -hf unsloth/gpt-oss-20b-GGUF:F16 \
  --jinja -ngl 99 --threads -1 --ctx-size 16384 \
  --temp 1.0 --top-p 1.0 --top-k 0

3. Free Google Colab Notebook
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GPT_OSS_MXFP4_(20B-Inference.ipynb)

Running gpt-oss-120b (120B Parameters)

Hardware Requirements

Configuration	Memory	Example Hardware
Minimum	66GB unified memory	A100 80GB + 128GB RAM
Recommended	80GB+ VRAM	Dual A100 GPUs

Optimization Techniques

./llama.cpp/llama-cli \
  --model unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
  --threads -1 --ctx-size 16384 --n-gpu-layers 99 \
  -ot ".ffn_.*_exps.=CPU" \  # MoE layer offloading
  --temp 1.0 --top-p 1.0 --top-k 0

Memory Tiering Strategy

Resource Type	Function	Optimization Tip
VRAM	Core layers	Keep first 5 layers
RAM	MoE experts	Control with regex patterns
SSD	Large context	Enable 4-bit KV caching

# Custom layer offloading
-ot "\.(6|7|8|9|[0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"

Fine-Tuning gpt-oss Models

Environment Setup

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

Fine-Tuning Efficiency Comparison

Method	VRAM Usage	Training Speed	Compatibility
Standard BF16	≥65GB	1.0x	Moderate
Unsloth MXFP4	14GB (20B) 65GB (120B)	1.5x	Excellent

Step-by-Step Fine-Tuning Guide

1. Dataset Preparation Principles

Maintain 75% reasoning-focused data (math proofs, logical analysis)
Include ≤25% direct Q&A data

Recommended dataset mix:

from datasets import load_dataset
dataset = load_dataset("OpenAssistant/oasst2")

2. Critical Configuration Parameters

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
)

3. Launching Training

trainer = transformers.Trainer(
    model = model,
    train_dataset = dataset,
    args = transformers.TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 42,
    ),
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

Best Practices & Pro Tips

Optimal Inference Parameters

# Official recommendations
temperature=1.0
top_p=1.0
top_k=0

Conversation Template Standard

<|start|>system<|message|>You are ChatGPT...<|end|>
<|start|>user<|message|>Hello<|end|>
<|start|>assistant<|channel|>final<|message|>Hi there!<|end|>

Advanced Harmony Features

from unsloth_zoo import encode_conversations_with_harmony

messages = [
    {"role": "user", "content": "What's the current temperature in San Francisco?"},
    {"role": "assistant", "thinking": "Need to use weather API tool"},
]

encoded = encode_conversations_with_harmony(
    messages,
    reasoning_effort = "high",  # Options: low/medium/high
    developer_instructions = "You're a helpful AI assistant"
)

Frequently Asked Questions

Q: Can consumer GPUs run gpt-oss-120b?
A: With Unsloth’s layer offloading, RTX 4090 (24GB VRAM) + 128GB RAM can run the 120B model.

Q: Does fine-tuning reduce reasoning capabilities?
A: Maintaining 75% reasoning-focused data preserves original capabilities.

Q: How to choose quantization methods?
A:

Prioritize Unsloth’s MXFP4
Use NF4 for older GPUs like T4
Choose F16 for highest precision

Q: Why do MoE layers need special handling?
A: Experts constitute 95% of parameters. Tiered loading saves >60% memory.

Q: How much data is needed for fine-tuning?
A: Domain adaptation requires ~1,000 quality samples; task-specific tuning needs 10,000+.

Conclusion

OpenAI’s gpt-oss release marks a democratization milestone in large language models. Through Unsloth’s innovations:

Consumer hardware now runs billion-parameter models
Fine-tuning efficiency improved 300% with 80% less memory
Original reasoning capabilities and tool-use features remain intact

Technology Insight: The fusion of MoE architecture with efficient inference is bridging the gap between AI researchers and developers. Within six months, we’ll see more open models deployed on edge devices.

flowchart LR
    A[Raw gpt-oss Models] --> B{Unsloth Optimization}
    B --> C[Precision Fixes]
    B --> D[Memory Management]
    B --> E[Template Alignment]
    C --> F[Run Anywhere]
    D --> G[Efficient Tuning]
    E --> H[Accurate Reasoning]

Get Started Resources:

https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B-Fine-tuning.ipynb)
https://huggingface.co/unsloth
https://docs.unsloth.ai/basics/gpt-oss-how-to-run-and-fine-tune

Unlock OpenAI’s gpt-oss: Run & Fine-Tune Billion-Parameter Models on Consumer Hardware

The Complete Guide to Running and Fine-Tuning OpenAI’s gpt-oss Models with Unsloth

What Are gpt-oss Models?

Key Model Specifications

Why Unsloth Is Essential

Critical Compatibility Fixes

Performance Breakthroughs

Running gpt-oss Models

Environment Setup

Running gpt-oss-20b (20B Parameters)

Hardware Requirements

Three Deployment Methods

Running gpt-oss-120b (120B Parameters)

Hardware Requirements

Optimization Techniques

Memory Tiering Strategy

Fine-Tuning gpt-oss Models

Environment Setup

Fine-Tuning Efficiency Comparison

Step-by-Step Fine-Tuning Guide

Best Practices & Pro Tips

Optimal Inference Parameters

Conversation Template Standard

Advanced Harmony Features

Frequently Asked Questions

Conclusion

Related Posts