XBai o4: An Open-Source Fourth-Generation Reasoning Model That Outperforms OpenAI-o3-mini on Your Workstation

Quick Take

If you only remember one thing, make it this:
XBai o4 is a fully open-source large language model that uses a new “reflective decoding” technique. On common math and coding benchmarks it scores higher than OpenAI-o3-mini, yet it runs on a single consumer-grade GPU.

Below, we unpack exactly what that means, why it matters, and how you can try it today.

Why Another Open Model?
Reflective Decoding in Plain English
Benchmark Numbers You Can Trust
From Zero to Running: Setup, Training, and Evaluation
Frequently Asked Questions
Bottom Line

Why Another Open Model? {#why}

Question You Might Ask	Straightforward Answer
“Aren’t there already enough open models?”	Existing open models still lag behind proprietary ones on hard reasoning tasks. XBai o4 closes that gap.
“I don’t do math competitions.”	Code generation, logic puzzles, and even data-cleaning workflows benefit from stronger reasoning.
“What does ‘open’ actually give me?”	Full weight files, complete training scripts, and the freedom to fine-tune or self-host.

Reflective Decoding in Plain English {#paradigm}

The Traditional Two-Step Workflow

Most current systems separate the act of “thinking” from the act of “scoring”:

Long-CoT generation – the model writes out many intermediate steps.
Process reward model – a second network judges how good each step is, then picks the best path.

Pain points

You need two large networks, doubling memory use.
The scoring step adds latency, making the system unsuitable for real-time use.

XBai o4’s Single-Network Design

XBai o4 keeps one shared Transformer backbone. After the final layer, the network forks into two lightweight heads:

Policy head – continues generating text.
Value head – outputs a scalar between 0 and 1 that rates the current reasoning step.

Because both heads run in a single forward pass, you get:

Almost no extra memory.
Up to 99 % lower scoring latency, since the model both “thinks” and “judges” at the same time.

Illustration
Reflective decoding pipeline

Benchmark Numbers You Can Trust {#performance}

Task	AIME24 / AIME25 (Math)	LiveCodeBench v5 (Code)	C-EVAL (General Chinese)
XBai o4-medium	85.4 / 77.6	67.0	89.5
OpenAI-o3-mini-medium	79.6 / 74.8	66.3	75.9
QwQ-32B	79.5 / 69.5	62.7	88.4
DeepSeek-R1-671B	79.8 / 70.0	64.3	91.8

Take-aways

XBai o4-medium beats the best openly available 32 B models on math and code.
On the Chinese general-knowledge benchmark C-EVAL, it trails the 671 B-parameter DeepSeek-R1 by less than three points—within error bars for many use cases.

From Zero to Running: Setup, Training, and Evaluation {#run}

All commands have been verified on Ubuntu 22.04 with CUDA 12.1 and PyTorch 2.3+.

Step 1: Create a Clean Environment

# 1. Create a Python 3.10 environment
conda create -n xbai_o4 python=3.10 -y
conda activate xbai_o4

# 2. Install dependencies
pip install -e verl                # training framework (forked from VeRL)
pip install -r requirements.txt    # remaining libraries
pip install flash_attn==2.7.4.post1

Step 2: Download the Model Weights

Source	URL
Hugging Face	MetaStoneTec/XBai-o4
ModelScope	XBai o4

# Example with Hugging Face
git lfs install
git clone https://huggingface.co/MetaStoneTec/XBai-o4

Step 3: Optional—Train from Scratch or Continue

Single-node training

export WANDB_API_KEY=YOUR_WANDB_API_KEY   # optional for experiment tracking
bash ./scripts/run_single_node.sh

Multi-node training

# 1. Launch Ray on each node
bash ./verl/examples/ray/run_worker_n.sh

# 2. Start training from the master node
bash ./scripts/run_multi_node.sh

Export to Hugging Face format

cd ./verl/scripts
bash model_merger.sh

Step 4: Evaluate on Your Own Data

1. Start the reward-model API

CUDA_VISIBLE_DEVICES=0 python test/score_model_queue.py \
  --model_path ./XBai-o4 \
  --score_model_dim 1536 \
  --lang 'en' \
  --ip '0.0.0.0' \
  --port '8001'

2. Start the policy-model API

export VLLM_ATTENTION_BACKEND=XFORMERS
CUDA_VISIBLE_DEVICES=0 python test/policy_model_queue.py \
  --model_path ./XBai-o4 \
  --ip '0.0.0.0' \
  --port '8000'

3. Run inference on AIME24

python test/inference.py \
  --task 'aime24' \
  --input_file data/aime24.jsonl \
  --output_file ./result.jsonl \
  --n_samples 16 \
  --model_dir ./XBai-o4 \
  --score_api_url http://localhost:8001/score \
  --response_api_url "http://localhost:8000/score" \
  --branch 2

4. Compute pass@1

python test/compute_metric.py \
  --task 'aime24' \
  --result_paths ./result.jsonl \
  --N 2          # 2 = low, 8 = medium, 32 = high

Frequently Asked Questions {#faq}

Q1: I don’t own an A100. Will a 24 GB RTX 3090 work?

Inference: Yes. The FP16 checkpoint is ~24 GB, so a single 3090 suffices.
Training: You’ll need LoRA or ZeRO-3; four 3090s is the practical minimum.

Q2: What license is the model released under?

Apache-2.0. Commercial use is allowed. You only need to keep the license and copyright notice.

Q3: What do “low / medium / high” modes mean?

They control how many parallel reasoning paths the model explores:

low – 2 paths, fastest.
medium – 8 paths, balanced.
high – 32 paths, highest accuracy for competitions or research.

Q4: How do I fine-tune on my own dataset?

Convert your data to {"prompt": ..., "answer": ...} jsonl lines.
Run ./scripts/run_single_node.sh and set dataset_path=/your/file.jsonl.
After training, execute model_merger.sh to export the merged weights.

Q5: Out-of-memory during inference—what now?

First try torch_dtype=torch.float16.
If that still fails, add load_in_4bit=True via bitsandbytes; the quality drop is typically under 1 %.

Bottom Line {#summary}

If you want a model that:

Runs on one consumer GPU,
Beats the best publicly available models on math and coding tasks, and
Gives you full control over weights, code, and data,

then XBai o4 is the most practical choice today.

Next step: open a terminal, run bash ./scripts/run_single_node.sh, and see the numbers for yourself.

XBai o4: Open-Source Reasoning Model Outperforms OpenAI-o3-mini on Consumer Hardware

XBai o4: An Open-Source Fourth-Generation Reasoning Model That Outperforms OpenAI-o3-mini on Your Workstation

Quick Take

Table of Contents

Why Another Open Model? {#why}

Reflective Decoding in Plain English {#paradigm}

The Traditional Two-Step Workflow

XBai o4’s Single-Network Design

Benchmark Numbers You Can Trust {#performance}

From Zero to Running: Setup, Training, and Evaluation {#run}

Step 1: Create a Clean Environment

Step 2: Download the Model Weights

Step 3: Optional—Train from Scratch or Continue

Single-node training

Multi-node training

Export to Hugging Face format

Step 4: Evaluate on Your Own Data

1. Start the reward-model API

2. Start the policy-model API

3. Run inference on AIME24

4. Compute pass@1

Frequently Asked Questions {#faq}

Q1: I don’t own an A100. Will a 24 GB RTX 3090 work?

Q2: What license is the model released under?

Q3: What do “low / medium / high” modes mean?

Q4: How do I fine-tune on my own dataset?

Q5: Out-of-memory during inference—what now?

Bottom Line {#summary}

XBai o4: Open-Source Reasoning Model Outperforms OpenAI-o3-mini on Consumer Hardware

XBai o4: An Open-Source Fourth-Generation Reasoning Model That Outperforms OpenAI-o3-mini on Your Workstation

Quick Take

Table of Contents

Why Another Open Model? {#why}

Reflective Decoding in Plain English {#paradigm}

The Traditional Two-Step Workflow

XBai o4’s Single-Network Design

Benchmark Numbers You Can Trust {#performance}

From Zero to Running: Setup, Training, and Evaluation {#run}

Step 1: Create a Clean Environment

Step 2: Download the Model Weights

Step 3: Optional—Train from Scratch or Continue

Single-node training

Multi-node training

Export to Hugging Face format

Step 4: Evaluate on Your Own Data

1. Start the reward-model API

2. Start the policy-model API

3. Run inference on AIME24

4. Compute pass@1

Frequently Asked Questions {#faq}

Q1: I don’t own an A100. Will a 24 GB RTX 3090 work?

Q2: What license is the model released under?

Q3: What do “low / medium / high” modes mean?

Q4: How do I fine-tune on my own dataset?

Q5: Out-of-memory during inference—what now?

Bottom Line {#summary}

Related Posts