Train a Pocket-Size Language Model End-to-End: The llm-madness Handbook

A laptop-friendly pipeline that takes you from raw text to a working GPT in one afternoon—no cloud credits, no PhD required.


Quick-Fire Answers to the Three Questions Everyone Asks

Question One-Sentence Reply
What does it actually do? It chains “raw txt → tokenizer → training → visual inspection” on a single machine and leaves you with a reproducible run folder.
How good is the hardware barrier? Eight gigabytes of VRAM is enough for a 30-million-parameter model; CPU-only mode is also supported (just slower).
Why bother when giant models exist? You can see every attention head, tweak the vocabulary, and debug the loss curve in real time—things closed APIs will never show you.

Table of Contents

  1. 30-Second Overview
  2. Install & Smoke-Test
  3. Feed It Text: Creating a Dataset Snapshot
  4. Train the Tokenizer: BPE Demystified
  5. Design the Network: How Many Layers Is “Enough”?
  6. Launch Training: Watch the Browser Plot Live
  7. Inspect Results: Loss, Perplexity, Attention Heat-Maps
  8. Troubleshooting Cheat-Sheet: 20 Common Errors
  9. After the Finish Line: Shrinking, Quantising, and Porting to Mobile

1. 30-Second Overview

End-to-end pipeline screenshot
Screenshot taken from the built-in Web UI—no extra packages to install.

Stage Output Artifact Key Script
① Dataset snapshot data/<name>.manifest + SHA-256 datasets/manifest.py
② Tokenizer runs/tokenizer/<id>/vocab.json tokenizer.py
③ Pre-tokenised corpus runs/tokens/<id>/train.bin stages/tokenize_dataset.py
④ Model training runs/train/<id>/checkpt.pt + run.json stages/train.py
⑤ Visual inspection Browser-based loss curves & attention maps scripts/web_ui.py

2. Install & Smoke-Test

# 1. Clone
git clone https://github.com/<user>/llm-madness.git
cd llm-madness

# 2. Dependencies (Python 3.9+)
pip install -r requirements.txt

# 3. Quick smoke test
python -c "import llm_madness, torch; print('Device:', torch.device('cuda' if torch.cuda.is_available() else 'cpu'))"

Windows tip: If torch fails, grab the correct wheel URL from pytorch.org first, then rerun the requirements file.


3. Feed It Text: Creating a Dataset Snapshot

3.1 Raw-text Rules

  • UTF-8 encoding only
  • One paragraph per line or one sentence per line—both work
  • No extra punctuation clean-up needed; the pipeline uses regex pre-tokenisation

3.2 Two Equally Valid Ways

Style Command Good for
GUI python -m scripts.web_ui → tab “Datasets” → “Create Snapshot” Mouse lovers
CLI python -m scripts.dataset_snapshot --input data/my_corpus.txt --split 0.9 Script lovers

You will get:

  • my_corpus.manifest (JSON lineage)
  • my_corpus.sha256 (checksum for reproducibility)

4. Train the Tokenizer: BPE Demystified

Byte-Pair Encoding starts with single bytes and repeatedly merges the most frequent adjacent pair into a new symbol.
Why care?

  1. Works for any Unicode text—Chinese, code, emoji—because bytes are universal
  2. Vocabulary size is a dial you can turn: 16 k, 32 k, 50 k … you decide

4.1 Hyper-parameters in One Glance

Setting Starter Value Meaning
vocab_size 32000 Stop merging when table reaches this count
pretokenizer "digits" Split 1234 into 1 2 3 4 to avoid endless numeric tails
regex_pattern `\w+ \s+`

4.2 Train Command

python -m scripts.train_tokenizer \
  --config configs/tokenizer/default__v002.json \
  --input data/my_corpus.txt \
  --set vocab_size=32000

Outputs land in runs/tokenizer/<id>/.
Pro tip: Train two vocab sizes (16 k vs 32 k), open both in the Web UI comparator, and keep whichever gives the lower bytes-per-token ratio.


5. Design the Network: How Many Layers Is “Enough”?

The model is a straight GPT stack:
Token Embedding → Transformer Blocks → LayerNorm → Linear → Logits
Each block = Multi-Head Self-Attention + FFN.

5.1 Rule-of-Thumb Sizing Table

Param Count Layers Heads Embed Dim VRAM (fp32)
10 M 6 6 512 ~2 GB
30 M 8 8 768 ~4 GB
100 M 12 12 1024 ~8 GB

Mixed-precision (fp16/bfloat16) cuts usage almost in half.

5.2 Modern “Tiny-Stack” Flags

Add these lines to any config JSON for free-ish quality gains:

"use_rmsnorm": true,
"use_swiglu": true,
"use_rope": true
  • RMSNorm saves 5 % VRAM
  • SwiGLU gains ~1 perplexity point
  • RoPE helps extrapolate to longer sequences (but disables absolute pos_emb)

6. Launch Training: Watch the Browser Plot Live

Single-shot pipeline (if you love defaults):

python -m scripts.pipeline --config configs/pipeline.json

Manual step-by-step (if you love control):

python -m scripts.train_model \
  --config configs/training/moderntiny__v007.json \
  --data data/my_corpus.txt \
  --set max_iters=50000

Under the hood the script:

  1. Locates the newest tokenizer and tokenised .bin
  2. Saves a checkpoint every 100 iterations
  3. Writes 10 fresh samples every 500 iterations
  4. Appends loss, perplexity, learning-rate to logs.jsonl

6.1 Real-Time Dashboard

python -m scripts.web_ui
# Surf to http://localhost:8080
# Choose “Runs” → latest ID → “Loss Curve” tab

Training UI
Loss drops in real time; attention heat-maps update on refresh.


7. Inspect Results: Loss, Perplexity, Attention Heat-Maps

Metric How to Read It Typical Range (30 M, Chinese)
Loss Should slope down, then plateau 2.8–3.2
Perplexity exp(loss) 16–24
Attention Diagonal stripe = causal; vertical stripe = copy previous token Visual sanity check

If loss spikes then skyrockets, learning rate is too high—halve it and resume from the last checkpoint.


8. Troubleshooting Cheat-Sheet: 20 Common Errors

Symptom Root Cause Quick Fix
CUDA out of memory Batch size too big Edit config: "batch_size": 32
Stuck at “Loading tokens” Corrupted cache Delete runs/tokens/*/train.bin, regenerate
Garbled Chinese output Input not UTF-8 iconv -f gbk -t utf-8 old.txt > new.txt
Samples repeat forever temperature = 0 Set "sample_temperature": 0.8
Web UI shows no runs Symlink broken ln -s runs/train/<id> runs/train/latest

9. After the Finish Line: Shrinking, Quantising, and Porting to Mobile

llm-madness stops at a fp32 checkpoint. To go smaller:

  1. Dynamic quantisation (torch.quantization) → int8, 4× smaller, <1 % PPL loss
  2. QAT (Quantisation Aware Training) → continue training with fake-quant layers, recover accuracy
  3. ONNX export → MLC-LLM → run at 30 ms/token on an Android phone

The repo leaves you at step 0 (fp32 checkpoint); all three routes use standard PyTorch tooling, no secret sauce required.


Closing Thoughts: A Tiny Model Is a Microscope, Not a Toy

When you can see why the attention map lights up at “Covid-19” and “virus” in the same sentence, the maths in hundred-billion-parameter papers suddenly feels concrete. llm-madness is built to give you that x-ray vision—fast, local, and fully open. Enjoy the experiments, push your pull-requests, and remember to share what you break (and fix) along the way.