15 M QA Pairs, 8 B Parameters, One Belief: Clean Data Is the Final Lever – Inside Bee-8B

“

A short tweet started the buzz.
An engineer benchmarked InternVL3.5-8B (semi-open) against Bee-8B (fully open) on ChartQA. Bee won 86.7 → 86.3.
His follow-up: “Bee did it with data, not dollars.”
30 k likes later, the community is asking: Can a data-centric pipeline really out-run the parameter arms-race?
This post answers that question—step by step, number by number.

The Three Reefs Sinking Open-Source MLLMs

Problem	Typical Symptom	Root Cause
Noisy data	Hallucinates “oranges” when asked to solve a math function	24 M public pairs carry 18 % image-instruction mismatches
One-shot datasets	Only JSON released, no cleaning code	Reproducibility dies
Scarce long reasoning	<1 % multi-step samples	Models fail on geometry, charts, professional exams

Bee-8B’s solution is almost old-school: wash the data until it shines, then feed the model. Below we open the full laundry.

Honey-Data-15M – How 15 M Clean Samples Are Made

1. Feed-stock – 24 M image-text pairs

40+ open corpora (LLaVA-OneVision, PixMo, MAmmoth-VL …).
Deduplication: perceptual hash (image) + SimHash (text). A pair is dropped only when both hashes collide, protecting near-duplicates that carry different semantics.

2. Scrubbing – rule & model filters

Filter	Example rejection	% removed
Rule	26 × 27 px traffic-sign image	4.2 %
Qwen2.5-VL-72B judge	“Solve the function” vs orange photo	18 %

3. Dual-level Chain-of-Thought enrichment

Short CoT – 12.2 M samples, 2-4 steps, fast reasoning
Long CoT – 2.7 M samples, 10+ steps, <think> tag, math & chart heavy
Router: fail fidelity check in short path → auto-promote to long path.

pipeline
Figure: HoneyPipe dual-level routing. Failed short-CoT samples enter the long-CoT loop.

4. Quality ablation – numbers that talk

MathVista score:
Raw data 63.2 → Clean only 73.5 → Clean + short CoT 78.8 → Clean + dual CoT 83.1 (+19.9 pts).
Every jump is statistically significant (p < 0.01, bootstrap 5 k).

Bee-8B Model – Brewing Beer from Honey

1. Architecture snapshot

LLM backbone – Qwen3-8B (8.0 B active)
Vision encoder – SigLIP2-so400m-384, native 384²
Projector – 2-layer MLP + GELU, < 0.1 B params
Total 8.3 B → single A100-80G, 16 k tokens, bf16.

2. Five-stage training recipe

Stage	Data	Goal	LR	Epoch	Hours
MLP warmup	1 M caption pairs	Vision-language glue	1e-3	1	2
Full align	14 M VL + 1.4 M text	Keep language alive	4e-5	1	18
SFT	15 M Honey	Inject dual CoT	5e-5	1	36
Refine	1 M curated subset	Topic rebalance	3e-5→5e-6	1	4
RL (GRPO)	50 K prompts	Format & accuracy	2e-6	–	8

Key tricks

Packed sequences (16 k) raise GPU utilisation +17 %.
RL reward – 0.2 format (must output \boxed{}) + 0.8 accuracy.
Checkpoint averaging (EMA 0.05) gives +0.6 pts on MMMU.

Benchmark Battle – Punching Above Its Weight

Task	Bee-8B	InternVL3.5-8B	Δ
ChartQA	86.7	86.3	+0.4
MathVerse	67.0	61.5	+5.5
CharXiv-RQ	57.3	45.4	+11.9
MMMU-Pro	50.7	47.1	+3.6

Why it wins

Charts – long CoT breaks “read bar → sum → compare” into 9 steps, slashes mental-math errors.
Geometry – quick recall “median on hypotenuse = half hypotenuse” locks AB = 10, then Pythagoras.
OCR – 211 K cleaned K12 print samples cut char-error by 23 %.

Quick-Start – From pip install to First Inference

1. Install

pip install transformers==4.46.0 accelerate flash-attn --no-build-isolation
# torch ≥ 2.2 recommended

2. Minimal inference (24 GB GPU)

from PIL import Image
import requests, torch
from transformers import AutoProcessor, AutoModel

model_id = "Open-Bee/Bee-8B-RL"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True).cuda().eval()

image = Image.open(requests.get("https://your/image.jpg", stream=True).raw)
text = processor.apply_chat_template(
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "How many monitors?"}]}],
        add_generation_prompt=True,
        enable_thinking=True)        # False for short CoT
inputs = processor(images=image, text=text, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6)
print(processor.decode(out[0], skip_special_tokens=True))

3. High-throughput serving with vLLM (≥ 0.11.1)

vllm serve Open-Bee/Bee-8B-RL \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --served-model-name bee-8b-rl

2×A100-80G, 1860 tokens/s (batch = 64, 16 k context)

FAQ – What Everyone Asks

Q: Is Honey-Data-15M commercially usable?
A: Yes. All sources are MIT / Apache. Only 0.8 M CC-BY-NC STEM samples need removal.

Q: How much slower is long CoT?
A: 38 % longer for 2 k tokens, but math benchmarks +9.2 pts. Flip enable_thinking at will.

Q: Can I keep fine-tuning?
A: Sure. Stage-3 (pre-RL) weights are released. LR 3e-6, 1 epoch on Honey-Data-1M is the sweet spot.

Take-away – Data First, Params Second

Bee-8B hands the community three open gifts:

A wash-rinse-repeat pipeline you can fork tomorrow.
15 M spotless samples with chain-of-thought baked in.
A 8 B SOTA proving that clean data > more parameters.

The race is no longer “who has 100 B?” but “who has the cleanest 10 M?”
Go build your own honey – and let the bees keep buzzing.

Clean Data Beats Bigger Models: Inside Bee-8B’s 15M QA Breakthrough