The Complete Guide to Running and Fine-Tuning OpenAI’s gpt-oss Models with Unsloth

You might wonder: How can I run billion-parameter open-source models efficiently? OpenAI’s newly released gpt-oss series combined with Unsloth’s toolchain enables high-performance inference and fine-tuning on consumer hardware.

What Are gpt-oss Models?

In August 2025, OpenAI open-sourced two breakthrough language models: gpt-oss-120b and gpt-oss-20b. Both models feature:

  • Apache 2.0 license for commercial use
  • 128k context window for long-form reasoning
  • State-of-the-art performance in reasoning, tool use, and agentic tasks

Key Model Specifications

Model Parameters Performance Benchmark Core Strengths
gpt-oss-20b 20 billion Matches o3-mini Tool calling, chain-of-thought reasoning
gpt-oss-120b 120 billion Rivals o4-mini Complex problem solving, multi-task handling

These models utilize a Mixture-of-Experts (MoE) architecture:

  • gpt-oss-20b: Activates 4 of 32 experts per token
  • gpt-oss-120b: Activates 4 of 128 experts per token

Why Unsloth Is Essential

You might ask: What hardware is needed to run such massive models? The Unsloth toolchain solves three critical challenges through technical innovation:

Critical Compatibility Fixes

  1. Chat Template Alignment
    Original Harmony tokenization conflicted with common Jinja templates:

    # Common incorrect rendering
    <|start|>assistant<|message|>Thoughts...
    
    # Unsloth-corrected version
    <|start|>assistant<|channel|>analysis<|message|>Thoughts...
    
  2. Precision Optimization
    Fixed BF16 overflow issues on T4 GPUs
  3. MoE Memory Management
    Revolutionary layer-wise loading strategy

Performance Breakthroughs

  • Inference: 6+ tokens/sec (gpt-oss-20b)
  • Fine-tuning memory: 70% reduction
  • Context length: 10x extension support

Running gpt-oss Models

Environment Setup

# Install core dependencies
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y

Running gpt-oss-20b (20B Parameters)

Hardware Requirements

Configuration Memory Example Hardware
Minimum 14GB unified (VRAM+RAM) T4 GPU + 32GB RAM
Recommended 24GB+ VRAM RTX 4090

Three Deployment Methods

1. Docker Deployment

docker pull hf.co/unsloth/gpt-oss-20b-GGUF:F16

2. Local Execution via llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j

./llama.cpp/llama-cli \
  -hf unsloth/gpt-oss-20b-GGUF:F16 \
  --jinja -ngl 99 --threads -1 --ctx-size 16384 \
  --temp 1.0 --top-p 1.0 --top-k 0

3. Free Google Colab Notebook
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GPT_OSS_MXFP4_(20B-Inference.ipynb)

Running gpt-oss-120b (120B Parameters)

Hardware Requirements

Configuration Memory Example Hardware
Minimum 66GB unified memory A100 80GB + 128GB RAM
Recommended 80GB+ VRAM Dual A100 GPUs

Optimization Techniques

./llama.cpp/llama-cli \
  --model unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
  --threads -1 --ctx-size 16384 --n-gpu-layers 99 \
  -ot ".ffn_.*_exps.=CPU" \  # MoE layer offloading
  --temp 1.0 --top-p 1.0 --top-k 0

Memory Tiering Strategy

Resource Type Function Optimization Tip
VRAM Core layers Keep first 5 layers
RAM MoE experts Control with regex patterns
SSD Large context Enable 4-bit KV caching
# Custom layer offloading
-ot "\.(6|7|8|9|[0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"

Fine-Tuning gpt-oss Models

Environment Setup

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

Fine-Tuning Efficiency Comparison

Method VRAM Usage Training Speed Compatibility
Standard BF16 ≥65GB 1.0x Moderate
Unsloth MXFP4 14GB (20B)
65GB (120B)
1.5x Excellent

Step-by-Step Fine-Tuning Guide

1. Dataset Preparation Principles

  • Maintain 75% reasoning-focused data (math proofs, logical analysis)
  • Include ≤25% direct Q&A data
  • Recommended dataset mix:

    from datasets import load_dataset
    dataset = load_dataset("OpenAssistant/oasst2")
    

2. Critical Configuration Parameters

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
)

3. Launching Training

trainer = transformers.Trainer(
    model = model,
    train_dataset = dataset,
    args = transformers.TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 42,
    ),
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

Best Practices & Pro Tips

Optimal Inference Parameters

# Official recommendations
temperature=1.0
top_p=1.0
top_k=0

Conversation Template Standard

<|start|>system<|message|>You are ChatGPT...<|end|>
<|start|>user<|message|>Hello<|end|>
<|start|>assistant<|channel|>final<|message|>Hi there!<|end|>

Advanced Harmony Features

from unsloth_zoo import encode_conversations_with_harmony

messages = [
    {"role": "user", "content": "What's the current temperature in San Francisco?"},
    {"role": "assistant", "thinking": "Need to use weather API tool"},
]

encoded = encode_conversations_with_harmony(
    messages,
    reasoning_effort = "high",  # Options: low/medium/high
    developer_instructions = "You're a helpful AI assistant"
)

Frequently Asked Questions

Q: Can consumer GPUs run gpt-oss-120b?
A: With Unsloth’s layer offloading, RTX 4090 (24GB VRAM) + 128GB RAM can run the 120B model.

Q: Does fine-tuning reduce reasoning capabilities?
A: Maintaining 75% reasoning-focused data preserves original capabilities.

Q: How to choose quantization methods?
A:

  • Prioritize Unsloth’s MXFP4
  • Use NF4 for older GPUs like T4
  • Choose F16 for highest precision

Q: Why do MoE layers need special handling?
A: Experts constitute 95% of parameters. Tiered loading saves >60% memory.

Q: How much data is needed for fine-tuning?
A: Domain adaptation requires ~1,000 quality samples; task-specific tuning needs 10,000+.

Conclusion

OpenAI’s gpt-oss release marks a democratization milestone in large language models. Through Unsloth’s innovations:

  1. Consumer hardware now runs billion-parameter models
  2. Fine-tuning efficiency improved 300% with 80% less memory
  3. Original reasoning capabilities and tool-use features remain intact

Technology Insight: The fusion of MoE architecture with efficient inference is bridging the gap between AI researchers and developers. Within six months, we’ll see more open models deployed on edge devices.

flowchart LR
    A[Raw gpt-oss Models] --> B{Unsloth Optimization}
    B --> C[Precision Fixes]
    B --> D[Memory Management]
    B --> E[Template Alignment]
    C --> F[Run Anywhere]
    D --> G[Efficient Tuning]
    E --> H[Accurate Reasoning]

Get Started Resources:

  • https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B-Fine-tuning.ipynb)
  • https://huggingface.co/unsloth
  • https://docs.unsloth.ai/basics/gpt-oss-how-to-run-and-fine-tune