15 M QA Pairs, 8 B Parameters, One Belief: Clean Data Is the Final Lever – Inside Bee-8B

A short tweet started the buzz.
An engineer benchmarked InternVL3.5-8B (semi-open) against Bee-8B (fully open) on ChartQA. Bee won 86.7 → 86.3.
His follow-up: “Bee did it with data, not dollars.”
30 k likes later, the community is asking: Can a data-centric pipeline really out-run the parameter arms-race?
This post answers that question—step by step, number by number.


The Three Reefs Sinking Open-Source MLLMs

Problem Typical Symptom Root Cause
Noisy data Hallucinates “oranges” when asked to solve a math function 24 M public pairs carry 18 % image-instruction mismatches
One-shot datasets Only JSON released, no cleaning code Reproducibility dies
Scarce long reasoning <1 % multi-step samples Models fail on geometry, charts, professional exams

Bee-8B’s solution is almost old-school: wash the data until it shines, then feed the model. Below we open the full laundry.


Honey-Data-15M – How 15 M Clean Samples Are Made

1. Feed-stock – 24 M image-text pairs

40+ open corpora (LLaVA-OneVision, PixMo, MAmmoth-VL …).
Deduplication: perceptual hash (image) + SimHash (text). A pair is dropped only when both hashes collide, protecting near-duplicates that carry different semantics.

2. Scrubbing – rule & model filters

Filter Example rejection % removed
Rule 26 × 27 px traffic-sign image 4.2 %
Qwen2.5-VL-72B judge “Solve the function” vs orange photo 18 %

3. Dual-level Chain-of-Thought enrichment

  • Short CoT – 12.2 M samples, 2-4 steps, fast reasoning
  • Long CoT – 2.7 M samples, 10+ steps, <think> tag, math & chart heavy
    Router: fail fidelity check in short path → auto-promote to long path.

pipeline
Figure: HoneyPipe dual-level routing. Failed short-CoT samples enter the long-CoT loop.

4. Quality ablation – numbers that talk

MathVista score:
Raw data 63.2 → Clean only 73.5 → Clean + short CoT 78.8 → Clean + dual CoT 83.1 (+19.9 pts).
Every jump is statistically significant (p < 0.01, bootstrap 5 k).


Bee-8B Model – Brewing Beer from Honey

1. Architecture snapshot

  • LLM backbone – Qwen3-8B (8.0 B active)
  • Vision encoder – SigLIP2-so400m-384, native 384²
  • Projector – 2-layer MLP + GELU, < 0.1 B params
    Total 8.3 B → single A100-80G, 16 k tokens, bf16.

2. Five-stage training recipe

Stage Data Goal LR Epoch Hours
MLP warmup 1 M caption pairs Vision-language glue 1e-3 1 2
Full align 14 M VL + 1.4 M text Keep language alive 4e-5 1 18
SFT 15 M Honey Inject dual CoT 5e-5 1 36
Refine 1 M curated subset Topic rebalance 3e-5→5e-6 1 4
RL (GRPO) 50 K prompts Format & accuracy 2e-6 8

Key tricks

  • Packed sequences (16 k) raise GPU utilisation +17 %.
  • RL reward – 0.2 format (must output \boxed{}) + 0.8 accuracy.
  • Checkpoint averaging (EMA 0.05) gives +0.6 pts on MMMU.

Benchmark Battle – Punching Above Its Weight

Task Bee-8B InternVL3.5-8B Δ
ChartQA 86.7 86.3 +0.4
MathVerse 67.0 61.5 +5.5
CharXiv-RQ 57.3 45.4 +11.9
MMMU-Pro 50.7 47.1 +3.6

Why it wins

  • Charts – long CoT breaks “read bar → sum → compare” into 9 steps, slashes mental-math errors.
  • Geometry – quick recall “median on hypotenuse = half hypotenuse” locks AB = 10, then Pythagoras.
  • OCR – 211 K cleaned K12 print samples cut char-error by 23 %.

Quick-Start – From pip install to First Inference

1. Install

pip install transformers==4.46.0 accelerate flash-attn --no-build-isolation
# torch ≥ 2.2 recommended

2. Minimal inference (24 GB GPU)

from PIL import Image
import requests, torch
from transformers import AutoProcessor, AutoModel

model_id = "Open-Bee/Bee-8B-RL"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True).cuda().eval()

image = Image.open(requests.get("https://your/image.jpg", stream=True).raw)
text = processor.apply_chat_template(
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "How many monitors?"}]}],
        add_generation_prompt=True,
        enable_thinking=True)        # False for short CoT
inputs = processor(images=image, text=text, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6)
print(processor.decode(out[0], skip_special_tokens=True))

3. High-throughput serving with vLLM (≥ 0.11.1)

vllm serve Open-Bee/Bee-8B-RL \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --served-model-name bee-8b-rl

2×A100-80G, 1860 tokens/s (batch = 64, 16 k context)


FAQ – What Everyone Asks

Q: Is Honey-Data-15M commercially usable?
A: Yes. All sources are MIT / Apache. Only 0.8 M CC-BY-NC STEM samples need removal.

Q: How much slower is long CoT?
A: 38 % longer for 2 k tokens, but math benchmarks +9.2 pts. Flip enable_thinking at will.

Q: Can I keep fine-tuning?
A: Sure. Stage-3 (pre-RL) weights are released. LR 3e-6, 1 epoch on Honey-Data-1M is the sweet spot.


Take-away – Data First, Params Second

Bee-8B hands the community three open gifts:

  1. A wash-rinse-repeat pipeline you can fork tomorrow.
  2. 15 M spotless samples with chain-of-thought baked in.
  3. A 8 B SOTA proving that clean data > more parameters.

The race is no longer “who has 100 B?” but “who has the cleanest 10 M?
Go build your own honey – and let the bees keep buzzing.