XBai o4: An Open-Source Fourth-Generation Reasoning Model That Outperforms OpenAI-o3-mini on Your Workstation
Quick Take
If you only remember one thing, make it this:
XBai o4 is a fully open-source large language model that uses a new “reflective decoding” technique. On common math and coding benchmarks it scores higher than OpenAI-o3-mini, yet it runs on a single consumer-grade GPU.
Below, we unpack exactly what that means, why it matters, and how you can try it today.
Table of Contents
-
Why Another Open Model? -
Reflective Decoding in Plain English -
Benchmark Numbers You Can Trust -
From Zero to Running: Setup, Training, and Evaluation -
Frequently Asked Questions -
Bottom Line
Why Another Open Model? {#why}
Question You Might Ask | Straightforward Answer |
---|---|
“Aren’t there already enough open models?” | Existing open models still lag behind proprietary ones on hard reasoning tasks. XBai o4 closes that gap. |
“I don’t do math competitions.” | Code generation, logic puzzles, and even data-cleaning workflows benefit from stronger reasoning. |
“What does ‘open’ actually give me?” | Full weight files, complete training scripts, and the freedom to fine-tune or self-host. |
Reflective Decoding in Plain English {#paradigm}
The Traditional Two-Step Workflow
Most current systems separate the act of “thinking” from the act of “scoring”:
-
Long-CoT generation – the model writes out many intermediate steps. -
Process reward model – a second network judges how good each step is, then picks the best path.
Pain points
-
You need two large networks, doubling memory use. -
The scoring step adds latency, making the system unsuitable for real-time use.
XBai o4’s Single-Network Design
XBai o4 keeps one shared Transformer backbone. After the final layer, the network forks into two lightweight heads:
-
Policy head – continues generating text. -
Value head – outputs a scalar between 0 and 1 that rates the current reasoning step.
Because both heads run in a single forward pass, you get:
-
Almost no extra memory. -
Up to 99 % lower scoring latency, since the model both “thinks” and “judges” at the same time.
Illustration
Benchmark Numbers You Can Trust {#performance}
Task | AIME24 / AIME25 (Math) | LiveCodeBench v5 (Code) | C-EVAL (General Chinese) |
---|---|---|---|
XBai o4-medium | 85.4 / 77.6 | 67.0 | 89.5 |
OpenAI-o3-mini-medium | 79.6 / 74.8 | 66.3 | 75.9 |
QwQ-32B | 79.5 / 69.5 | 62.7 | 88.4 |
DeepSeek-R1-671B | 79.8 / 70.0 | 64.3 | 91.8 |
Take-aways
-
XBai o4-medium beats the best openly available 32 B models on math and code. -
On the Chinese general-knowledge benchmark C-EVAL, it trails the 671 B-parameter DeepSeek-R1 by less than three points—within error bars for many use cases.
From Zero to Running: Setup, Training, and Evaluation {#run}
All commands have been verified on Ubuntu 22.04 with CUDA 12.1 and PyTorch 2.3+.
Step 1: Create a Clean Environment
# 1. Create a Python 3.10 environment
conda create -n xbai_o4 python=3.10 -y
conda activate xbai_o4
# 2. Install dependencies
pip install -e verl # training framework (forked from VeRL)
pip install -r requirements.txt # remaining libraries
pip install flash_attn==2.7.4.post1
Step 2: Download the Model Weights
Source | URL |
---|---|
Hugging Face | MetaStoneTec/XBai-o4 |
ModelScope | XBai o4 |
# Example with Hugging Face
git lfs install
git clone https://huggingface.co/MetaStoneTec/XBai-o4
Step 3: Optional—Train from Scratch or Continue
Single-node training
export WANDB_API_KEY=YOUR_WANDB_API_KEY # optional for experiment tracking
bash ./scripts/run_single_node.sh
Multi-node training
# 1. Launch Ray on each node
bash ./verl/examples/ray/run_worker_n.sh
# 2. Start training from the master node
bash ./scripts/run_multi_node.sh
Export to Hugging Face format
cd ./verl/scripts
bash model_merger.sh
Step 4: Evaluate on Your Own Data
1. Start the reward-model API
CUDA_VISIBLE_DEVICES=0 python test/score_model_queue.py \
--model_path ./XBai-o4 \
--score_model_dim 1536 \
--lang 'en' \
--ip '0.0.0.0' \
--port '8001'
2. Start the policy-model API
export VLLM_ATTENTION_BACKEND=XFORMERS
CUDA_VISIBLE_DEVICES=0 python test/policy_model_queue.py \
--model_path ./XBai-o4 \
--ip '0.0.0.0' \
--port '8000'
3. Run inference on AIME24
python test/inference.py \
--task 'aime24' \
--input_file data/aime24.jsonl \
--output_file ./result.jsonl \
--n_samples 16 \
--model_dir ./XBai-o4 \
--score_api_url http://localhost:8001/score \
--response_api_url "http://localhost:8000/score" \
--branch 2
4. Compute pass@1
python test/compute_metric.py \
--task 'aime24' \
--result_paths ./result.jsonl \
--N 2 # 2 = low, 8 = medium, 32 = high
Frequently Asked Questions {#faq}
Q1: I don’t own an A100. Will a 24 GB RTX 3090 work?
Inference: Yes. The FP16 checkpoint is ~24 GB, so a single 3090 suffices.
Training: You’ll need LoRA or ZeRO-3; four 3090s is the practical minimum.
Q2: What license is the model released under?
Apache-2.0. Commercial use is allowed. You only need to keep the license and copyright notice.
Q3: What do “low / medium / high” modes mean?
They control how many parallel reasoning paths the model explores:
-
low – 2 paths, fastest. -
medium – 8 paths, balanced. -
high – 32 paths, highest accuracy for competitions or research.
Q4: How do I fine-tune on my own dataset?
-
Convert your data to {"prompt": ..., "answer": ...}
jsonl lines. -
Run ./scripts/run_single_node.sh
and setdataset_path=/your/file.jsonl
. -
After training, execute model_merger.sh
to export the merged weights.
Q5: Out-of-memory during inference—what now?
-
First try torch_dtype=torch.float16
. -
If that still fails, add load_in_4bit=True
via bitsandbytes; the quality drop is typically under 1 %.
Bottom Line {#summary}
If you want a model that:
-
Runs on one consumer GPU, -
Beats the best publicly available models on math and coding tasks, and -
Gives you full control over weights, code, and data,
then XBai o4 is the most practical choice today.
Next step: open a terminal, run bash ./scripts/run_single_node.sh
, and see the numbers for yourself.