Open-Source Large Language Models: The 2025 Buyer’s Guide
A plain-language, data-only handbook for junior college graduates and busy practitioners
Table of Contents
-
Why bother choosing the model yourself? -
Four size buckets that make sense -
Giant models (>150 B): when you need the brain -
Mid-size models (40–150 B): the sweet spot for most teams -
Small models (4–40 B): run on one gaming GPU -
Tiny models (≤4 B): laptops, phones, and Raspberry Pi -
One mega-table: parameters, context length, price, and download link -
FAQ: answers we hear every week -
60-second decision checklist
1. Why bother choosing the model yourself?
-
Open-source weights mean you can download, fine-tune, and host on-prem. -
Price spreads are wild: two 70 B models can differ by 15× in cost per million tokens. -
Context length decides whether your chatbot “remembers” an entire legal contract or forgets after two pages.
2. Four size buckets that make sense
Bucket | Total params | Active params* | Typical use case | Minimum hardware |
---|---|---|---|---|
Giant | >150 B | 20–40 B | research, complex reasoning | 8×A100 or 4×H100 |
Mid | 40–150 B | 10–40 B | company chatbot, code completion | 1×A100 or 2×RTX 4090 |
Small | 4–40 B | 2–30 B | local dev box, light API | RTX 4090 24 GB, M2 Ultra |
Tiny | ≤4 B | 0.5–4 B | mobile, edge IoT, Raspberry Pi | CPU with 8 GB RAM |
* Active params = weights actually used during inference. Smaller active count = less VRAM.
3. Giant models (>150 B): when you need the brain
3.1 Top performers (July 2025 data)
Model | Total / Active Params | Intelligence Score* | Context Length | Price per 1 M tokens | Hugging Face |
---|---|---|---|---|---|
Qwen3 235B A22B 2507 (Reasoning) | 235 B / 22 B | 69 | 132 k | $2.6 | Link |
DeepSeek R1 0528 | 685 B / 37 B | 68 | 128 k | $1.0 | Link |
GLM-4.5 | 355 B / 32 B | 66 | 128 k | — | Link |
MiniMax M1 80k | 456 B / 45.9 B | 63 | 1 M | $0.8 | Link |
* Intelligence Score is a composite of MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, and MATH-500 benchmarks.
3.2 Quick take-aways
-
Smartest: Qwen3 235B 2507 (Reasoning). -
Longest memory: MiniMax M1 80k handles a million-token window—good for entire books. -
Cheapest per token: DeepSeek R1 0528 at $1.0 per million tokens.
4. Mid-size models (40–150 B): the sweet spot for most teams
4.1 Highest intelligence in this bracket
Model | Total / Active Params | Intelligence Score | Context | Price | Notes |
---|---|---|---|---|---|
Llama Nemotron Super 49B v1.5 (Reasoning) | 49 B / 49 B | 64 | 128 k | — | Dense, fast inference |
DeepSeek R1 Distill Llama 70B | 70 B / 70 B | 48 | 128 k | $0.8 | Distilled, great value |
Llama 4 Scout | 109 B / 17 B | 43 | 10 M | $0.2 | Ultra-long context, lowest price |
4.2 How to pick
-
Single 24 GB GPU → Llama Nemotron Super 49B v1.5 (int8 fits in 24 GB). -
Need 10 M context → Llama 4 Scout for whole-document Q&A. -
Budget tight → DeepSeek R1 Distill Llama 70B, still 70 B-class at < $1.
5. Small models (4–40 B): run on one gaming GPU
5.1 Ceiling of this tier
Model | Total / Active Params | Intelligence Score | Context | Price | Typical GPU |
---|---|---|---|---|---|
EXAONE 4.0 32B (Reasoning) | 32 B / 32 B | 64 | 131 k | $1.0 | RTX 4090 24 GB |
Qwen3 32B (Reasoning) | 32.8 B / 32.8 B | 59 | 128 k | $2.6 | RTX 4090 24 GB |
QwQ-32B | 32.8 B / 32.8 B | 58 | 131 k | $0.5 | RTX 4090 24 GB |
5.2 Real-world numbers
-
Indie developer: RTX 4090 + QwQ-32B. Local code completion costs $0.50 per million tokens. -
SME chatbot: One A6000 48 GB can serve four concurrent Qwen3 32B instances at <$0.01 per turn.
6. Tiny models (≤4 B): laptops, phones, and Raspberry Pi
Model | Total / Active Params | Intelligence Score | Context | Price | Notes |
---|---|---|---|---|---|
Qwen3 1.7B (Reasoning) | 2.03 B / 2.03 B | 38 | 32 k | $0.4 | Runs on Jetson Orin Nano |
Phi-4 Mini Instruct | 3.84 B / 3.84 B | 26 | 128 k | — | Microsoft’s tiny powerhouse, CPU real-time |
Llama 3.2 3B | 3 B / 3 B | 20 | 128 k | $0.0 | Free, mobile-friendly |
“
Tested: Phi-4 Mini int4 on M1 Max yields 12 tokens/s; translating a 300-word email finishes in under three seconds.
7. One mega-table: parameters, context length, price, and download link
Bucket | Model | Total Params | Active Params | Context Length | Price ($/1 M tokens) | Hugging Face |
---|---|---|---|---|---|---|
Giant | Qwen3 235B A22B 2507 (R) | 235 B | 22 B | 132 k | 2.6 | Link |
Giant | DeepSeek R1 0528 | 685 B | 37 B | 128 k | 1.0 | Link |
Mid | Llama Nemotron 49B v1.5 (R) | 49 B | 49 B | 128 k | — | Link |
Mid | Llama 4 Scout | 109 B | 17 B | 10 M | 0.2 | Link |
Small | EXAONE 4.0 32B (R) | 32 B | 32 B | 131 k | 1.0 | Link |
Small | QwQ-32B | 32.8 B | 32.8 B | 131 k | 0.5 | Link |
Tiny | Qwen3 1.7B (R) | 2.03 B | 2.03 B | 32 k | 0.4 | Link |
Tiny | Llama 3.2 3B | 3 B | 3 B | 128 k | 0.0 | Link |
8. FAQ: answers we hear every week
Q1. Is the difference between Intelligence 60 and 70 noticeable?
In academic benchmarks like MMLU-Pro and GPQA, yes. In daily chatbots or translation, most users won’t feel it.
Q2. Is a 1 M token context just marketing?
MiniMax demonstrates a 300 000-character novel summarized in one pass with <10 s latency. It works.
Q3. How do I estimate GPU memory?
Rule of thumb:
VRAM ≈ active params × 2 bytes (fp16)
Example: QwQ-32B (32 B active) → 32 GB fp16; int4 quantization shrinks it to ~8 GB—fits an RTX 4080.
**Q4. What does 0.0008, or about half a cent.
Q5. Do distilled models lose quality?
DeepSeek’s report shows R1 Distill Llama 70B drops only 2 % on HumanEval coding tasks while cutting cost by 10×.
9. 60-second decision checklist
Step 1: Match the task
-
Research / complex reasoning → Giant -
Company chatbot / code completion → Mid -
Local IDE plugin → Small -
Mobile app → Tiny
Step 2: Check the wallet
Budget per 1 M tokens | Recommended bucket | Example model |
---|---|---|
< $0.5 | Tiny | Llama 3.2 3B |
1 | Small | QwQ-32B |
2 | Mid | DeepSeek R1 Distill Llama 70B |
3 | Giant | DeepSeek R1 0528 |
Step 3: Match the hardware
-
8 GB VRAM → Tiny model -
16–24 GB → Small model -
48 GB+ → Mid model -
Multi-GPU 80 GB → Giant model
Step 4: One-line install
# Example: QwQ-32B on consumer GPU
pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/QwQ-32B",
torch_dtype="auto",
device_map="auto"
)
Closing thoughts
There’s no “best” model—only the best fit for task, budget, and hardware. Bookmark this page and share the mega-table the next time someone claims “bigger is always better.”