15 M QA Pairs, 8 B Parameters, One Belief: Clean Data Is the Final Lever – Inside Bee-8B
“
A short tweet started the buzz.
An engineer benchmarked InternVL3.5-8B (semi-open) against Bee-8B (fully open) on ChartQA. Bee won 86.7 → 86.3.
His follow-up: “Bee did it with data, not dollars.”
30 k likes later, the community is asking: Can a data-centric pipeline really out-run the parameter arms-race?
This post answers that question—step by step, number by number.
The Three Reefs Sinking Open-Source MLLMs
Problem | Typical Symptom | Root Cause |
---|---|---|
Noisy data | Hallucinates “oranges” when asked to solve a math function | 24 M public pairs carry 18 % image-instruction mismatches |
One-shot datasets | Only JSON released, no cleaning code | Reproducibility dies |
Scarce long reasoning | <1 % multi-step samples | Models fail on geometry, charts, professional exams |
Bee-8B’s solution is almost old-school: wash the data until it shines, then feed the model. Below we open the full laundry.
Honey-Data-15M – How 15 M Clean Samples Are Made
1. Feed-stock – 24 M image-text pairs
40+ open corpora (LLaVA-OneVision, PixMo, MAmmoth-VL …).
Deduplication: perceptual hash (image) + SimHash (text). A pair is dropped only when both hashes collide, protecting near-duplicates that carry different semantics.
2. Scrubbing – rule & model filters
Filter | Example rejection | % removed |
---|---|---|
Rule | 26 × 27 px traffic-sign image | 4.2 % |
Qwen2.5-VL-72B judge | “Solve the function” vs orange photo | 18 % |
3. Dual-level Chain-of-Thought enrichment
-
Short CoT – 12.2 M samples, 2-4 steps, fast reasoning -
Long CoT – 2.7 M samples, 10+ steps, <think>
tag, math & chart heavy
Router: fail fidelity check in short path → auto-promote to long path.
Figure: HoneyPipe dual-level routing. Failed short-CoT samples enter the long-CoT loop.
4. Quality ablation – numbers that talk
MathVista score:
Raw data 63.2 → Clean only 73.5 → Clean + short CoT 78.8 → Clean + dual CoT 83.1 (+19.9 pts).
Every jump is statistically significant (p < 0.01, bootstrap 5 k).
Bee-8B Model – Brewing Beer from Honey
1. Architecture snapshot
-
LLM backbone – Qwen3-8B (8.0 B active) -
Vision encoder – SigLIP2-so400m-384, native 384² -
Projector – 2-layer MLP + GELU, < 0.1 B params
Total 8.3 B → single A100-80G, 16 k tokens, bf16.
2. Five-stage training recipe
Stage | Data | Goal | LR | Epoch | Hours |
---|---|---|---|---|---|
MLP warmup | 1 M caption pairs | Vision-language glue | 1e-3 | 1 | 2 |
Full align | 14 M VL + 1.4 M text | Keep language alive | 4e-5 | 1 | 18 |
SFT | 15 M Honey | Inject dual CoT | 5e-5 | 1 | 36 |
Refine | 1 M curated subset | Topic rebalance | 3e-5→5e-6 | 1 | 4 |
RL (GRPO) | 50 K prompts | Format & accuracy | 2e-6 | – | 8 |
Key tricks
-
Packed sequences (16 k) raise GPU utilisation +17 %. -
RL reward – 0.2 format (must output \boxed{}
) + 0.8 accuracy. -
Checkpoint averaging (EMA 0.05) gives +0.6 pts on MMMU.
Benchmark Battle – Punching Above Its Weight
Task | Bee-8B | InternVL3.5-8B | Δ |
---|---|---|---|
ChartQA | 86.7 | 86.3 | +0.4 |
MathVerse | 67.0 | 61.5 | +5.5 |
CharXiv-RQ | 57.3 | 45.4 | +11.9 |
MMMU-Pro | 50.7 | 47.1 | +3.6 |
Why it wins
-
Charts – long CoT breaks “read bar → sum → compare” into 9 steps, slashes mental-math errors. -
Geometry – quick recall “median on hypotenuse = half hypotenuse” locks AB = 10, then Pythagoras. -
OCR – 211 K cleaned K12 print samples cut char-error by 23 %.
Quick-Start – From pip install to First Inference
1. Install
pip install transformers==4.46.0 accelerate flash-attn --no-build-isolation
# torch ≥ 2.2 recommended
2. Minimal inference (24 GB GPU)
from PIL import Image
import requests, torch
from transformers import AutoProcessor, AutoModel
model_id = "Open-Bee/Bee-8B-RL"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True).cuda().eval()
image = Image.open(requests.get("https://your/image.jpg", stream=True).raw)
text = processor.apply_chat_template(
[{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "How many monitors?"}]}],
add_generation_prompt=True,
enable_thinking=True) # False for short CoT
inputs = processor(images=image, text=text, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6)
print(processor.decode(out[0], skip_special_tokens=True))
3. High-throughput serving with vLLM (≥ 0.11.1)
vllm serve Open-Bee/Bee-8B-RL \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--served-model-name bee-8b-rl
2×A100-80G, 1860 tokens/s (batch = 64, 16 k context)
FAQ – What Everyone Asks
Q: Is Honey-Data-15M commercially usable?
A: Yes. All sources are MIT / Apache. Only 0.8 M CC-BY-NC STEM samples need removal.
Q: How much slower is long CoT?
A: 38 % longer for 2 k tokens, but math benchmarks +9.2 pts. Flip enable_thinking
at will.
Q: Can I keep fine-tuning?
A: Sure. Stage-3 (pre-RL) weights are released. LR 3e-6, 1 epoch on Honey-Data-1M is the sweet spot.
Take-away – Data First, Params Second
Bee-8B hands the community three open gifts:
-
A wash-rinse-repeat pipeline you can fork tomorrow. -
15 M spotless samples with chain-of-thought baked in. -
A 8 B SOTA proving that clean data > more parameters.
The race is no longer “who has 100 B?” but “who has the cleanest 10 M?”
Go build your own honey – and let the bees keep buzzing.