Train a Pocket-Size Language Model End-to-End: The llm-madness Handbook
A laptop-friendly pipeline that takes you from raw text to a working GPT in one afternoon—no cloud credits, no PhD required.
Quick-Fire Answers to the Three Questions Everyone Asks
| Question | One-Sentence Reply |
|---|---|
| What does it actually do? | It chains “raw txt → tokenizer → training → visual inspection” on a single machine and leaves you with a reproducible run folder. |
| How good is the hardware barrier? | Eight gigabytes of VRAM is enough for a 30-million-parameter model; CPU-only mode is also supported (just slower). |
| Why bother when giant models exist? | You can see every attention head, tweak the vocabulary, and debug the loss curve in real time—things closed APIs will never show you. |
Table of Contents
-
30-Second Overview -
Install & Smoke-Test -
Feed It Text: Creating a Dataset Snapshot -
Train the Tokenizer: BPE Demystified -
Design the Network: How Many Layers Is “Enough”? -
Launch Training: Watch the Browser Plot Live -
Inspect Results: Loss, Perplexity, Attention Heat-Maps -
Troubleshooting Cheat-Sheet: 20 Common Errors -
After the Finish Line: Shrinking, Quantising, and Porting to Mobile
1. 30-Second Overview
Screenshot taken from the built-in Web UI—no extra packages to install.
| Stage | Output Artifact | Key Script |
|---|---|---|
| ① Dataset snapshot | data/<name>.manifest + SHA-256 |
datasets/manifest.py |
| ② Tokenizer | runs/tokenizer/<id>/vocab.json |
tokenizer.py |
| ③ Pre-tokenised corpus | runs/tokens/<id>/train.bin |
stages/tokenize_dataset.py |
| ④ Model training | runs/train/<id>/checkpt.pt + run.json |
stages/train.py |
| ⑤ Visual inspection | Browser-based loss curves & attention maps | scripts/web_ui.py |
2. Install & Smoke-Test
# 1. Clone
git clone https://github.com/<user>/llm-madness.git
cd llm-madness
# 2. Dependencies (Python 3.9+)
pip install -r requirements.txt
# 3. Quick smoke test
python -c "import llm_madness, torch; print('Device:', torch.device('cuda' if torch.cuda.is_available() else 'cpu'))"
Windows tip: If torch fails, grab the correct wheel URL from pytorch.org first, then rerun the requirements file.
3. Feed It Text: Creating a Dataset Snapshot
3.1 Raw-text Rules
-
UTF-8 encoding only -
One paragraph per line or one sentence per line—both work -
No extra punctuation clean-up needed; the pipeline uses regex pre-tokenisation
3.2 Two Equally Valid Ways
| Style | Command | Good for |
|---|---|---|
| GUI | python -m scripts.web_ui → tab “Datasets” → “Create Snapshot” |
Mouse lovers |
| CLI | python -m scripts.dataset_snapshot --input data/my_corpus.txt --split 0.9 |
Script lovers |
You will get:
-
my_corpus.manifest(JSON lineage) -
my_corpus.sha256(checksum for reproducibility)
4. Train the Tokenizer: BPE Demystified
Byte-Pair Encoding starts with single bytes and repeatedly merges the most frequent adjacent pair into a new symbol.
Why care?
-
Works for any Unicode text—Chinese, code, emoji—because bytes are universal -
Vocabulary size is a dial you can turn: 16 k, 32 k, 50 k … you decide
4.1 Hyper-parameters in One Glance
| Setting | Starter Value | Meaning |
|---|---|---|
vocab_size |
32000 | Stop merging when table reaches this count |
pretokenizer |
"digits" |
Split 1234 into 1 2 3 4 to avoid endless numeric tails |
regex_pattern |
`\w+ | \s+` |
4.2 Train Command
python -m scripts.train_tokenizer \
--config configs/tokenizer/default__v002.json \
--input data/my_corpus.txt \
--set vocab_size=32000
Outputs land in runs/tokenizer/<id>/.
Pro tip: Train two vocab sizes (16 k vs 32 k), open both in the Web UI comparator, and keep whichever gives the lower bytes-per-token ratio.
5. Design the Network: How Many Layers Is “Enough”?
The model is a straight GPT stack:
Token Embedding → Transformer Blocks → LayerNorm → Linear → Logits
Each block = Multi-Head Self-Attention + FFN.
5.1 Rule-of-Thumb Sizing Table
| Param Count | Layers | Heads | Embed Dim | VRAM (fp32) |
|---|---|---|---|---|
| 10 M | 6 | 6 | 512 | ~2 GB |
| 30 M | 8 | 8 | 768 | ~4 GB |
| 100 M | 12 | 12 | 1024 | ~8 GB |
Mixed-precision (fp16/bfloat16) cuts usage almost in half.
5.2 Modern “Tiny-Stack” Flags
Add these lines to any config JSON for free-ish quality gains:
"use_rmsnorm": true,
"use_swiglu": true,
"use_rope": true
-
RMSNorm saves 5 % VRAM -
SwiGLU gains ~1 perplexity point -
RoPE helps extrapolate to longer sequences (but disables absolute pos_emb)
6. Launch Training: Watch the Browser Plot Live
Single-shot pipeline (if you love defaults):
python -m scripts.pipeline --config configs/pipeline.json
Manual step-by-step (if you love control):
python -m scripts.train_model \
--config configs/training/moderntiny__v007.json \
--data data/my_corpus.txt \
--set max_iters=50000
Under the hood the script:
-
Locates the newest tokenizer and tokenised .bin -
Saves a checkpoint every 100 iterations -
Writes 10 fresh samples every 500 iterations -
Appends loss, perplexity, learning-rate to logs.jsonl
6.1 Real-Time Dashboard
python -m scripts.web_ui
# Surf to http://localhost:8080
# Choose “Runs” → latest ID → “Loss Curve” tab
Loss drops in real time; attention heat-maps update on refresh.
7. Inspect Results: Loss, Perplexity, Attention Heat-Maps
| Metric | How to Read It | Typical Range (30 M, Chinese) |
|---|---|---|
| Loss | Should slope down, then plateau | 2.8–3.2 |
| Perplexity | exp(loss) |
16–24 |
| Attention | Diagonal stripe = causal; vertical stripe = copy previous token | Visual sanity check |
If loss spikes then skyrockets, learning rate is too high—halve it and resume from the last checkpoint.
8. Troubleshooting Cheat-Sheet: 20 Common Errors
| Symptom | Root Cause | Quick Fix |
|---|---|---|
| CUDA out of memory | Batch size too big | Edit config: "batch_size": 32 |
| Stuck at “Loading tokens” | Corrupted cache | Delete runs/tokens/*/train.bin, regenerate |
| Garbled Chinese output | Input not UTF-8 | iconv -f gbk -t utf-8 old.txt > new.txt |
| Samples repeat forever | temperature = 0 |
Set "sample_temperature": 0.8 |
| Web UI shows no runs | Symlink broken | ln -s runs/train/<id> runs/train/latest |
9. After the Finish Line: Shrinking, Quantising, and Porting to Mobile
llm-madness stops at a fp32 checkpoint. To go smaller:
-
Dynamic quantisation ( torch.quantization) → int8, 4× smaller, <1 % PPL loss -
QAT (Quantisation Aware Training) → continue training with fake-quant layers, recover accuracy -
ONNX export → MLC-LLM → run at 30 ms/token on an Android phone
The repo leaves you at step 0 (fp32 checkpoint); all three routes use standard PyTorch tooling, no secret sauce required.
Closing Thoughts: A Tiny Model Is a Microscope, Not a Toy
When you can see why the attention map lights up at “Covid-19” and “virus” in the same sentence, the maths in hundred-billion-parameter papers suddenly feels concrete. llm-madness is built to give you that x-ray vision—fast, local, and fully open. Enjoy the experiments, push your pull-requests, and remember to share what you break (and fix) along the way.

