Train a Pocket-Size Language Model End-to-End: The llm-madness Handbook

A laptop-friendly pipeline that takes you from raw text to a working GPT in one afternoon—no cloud credits, no PhD required.

Quick-Fire Answers to the Three Questions Everyone Asks

Question	One-Sentence Reply
What does it actually do?	It chains “raw txt → tokenizer → training → visual inspection” on a single machine and leaves you with a reproducible run folder.
How good is the hardware barrier?	Eight gigabytes of VRAM is enough for a 30-million-parameter model; CPU-only mode is also supported (just slower).
Why bother when giant models exist?	You can see every attention head, tweak the vocabulary, and debug the loss curve in real time—things closed APIs will never show you.

30-Second Overview
Install & Smoke-Test
Feed It Text: Creating a Dataset Snapshot
Train the Tokenizer: BPE Demystified
Design the Network: How Many Layers Is “Enough”?
Launch Training: Watch the Browser Plot Live
Inspect Results: Loss, Perplexity, Attention Heat-Maps
Troubleshooting Cheat-Sheet: 20 Common Errors
After the Finish Line: Shrinking, Quantising, and Porting to Mobile

1. 30-Second Overview

End-to-end pipeline screenshot
Screenshot taken from the built-in Web UI—no extra packages to install.

Stage	Output Artifact	Key Script
① Dataset snapshot	`data/<name>.manifest` + SHA-256	`datasets/manifest.py`
② Tokenizer	`runs/tokenizer/<id>/vocab.json`	`tokenizer.py`
③ Pre-tokenised corpus	`runs/tokens/<id>/train.bin`	`stages/tokenize_dataset.py`
④ Model training	`runs/train/<id>/checkpt.pt` + `run.json`	`stages/train.py`
⑤ Visual inspection	Browser-based loss curves & attention maps	`scripts/web_ui.py`

2. Install & Smoke-Test

# 1. Clone
git clone https://github.com/<user>/llm-madness.git
cd llm-madness

# 2. Dependencies (Python 3.9+)
pip install -r requirements.txt

# 3. Quick smoke test
python -c "import llm_madness, torch; print('Device:', torch.device('cuda' if torch.cuda.is_available() else 'cpu'))"

Windows tip: If torch fails, grab the correct wheel URL from pytorch.org first, then rerun the requirements file.

3. Feed It Text: Creating a Dataset Snapshot

3.1 Raw-text Rules

UTF-8 encoding only
One paragraph per line or one sentence per line—both work
No extra punctuation clean-up needed; the pipeline uses regex pre-tokenisation

3.2 Two Equally Valid Ways

Style	Command	Good for
GUI	`python -m scripts.web_ui` → tab “Datasets” → “Create Snapshot”	Mouse lovers
CLI	`python -m scripts.dataset_snapshot --input data/my_corpus.txt --split 0.9`	Script lovers

You will get:

my_corpus.manifest (JSON lineage)
my_corpus.sha256 (checksum for reproducibility)

4. Train the Tokenizer: BPE Demystified

Byte-Pair Encoding starts with single bytes and repeatedly merges the most frequent adjacent pair into a new symbol.
Why care?

Works for any Unicode text—Chinese, code, emoji—because bytes are universal
Vocabulary size is a dial you can turn: 16 k, 32 k, 50 k … you decide

4.1 Hyper-parameters in One Glance

Setting	Starter Value	Meaning
`vocab_size`	32000	Stop merging when table reaches this count
`pretokenizer`	`"digits"`	Split 1234 into 1 2 3 4 to avoid endless numeric tails
`regex_pattern`	`\w+	\s+`

4.2 Train Command

python -m scripts.train_tokenizer \
  --config configs/tokenizer/default__v002.json \
  --input data/my_corpus.txt \
  --set vocab_size=32000

Outputs land in runs/tokenizer/<id>/.
Pro tip: Train two vocab sizes (16 k vs 32 k), open both in the Web UI comparator, and keep whichever gives the lower bytes-per-token ratio.

5. Design the Network: How Many Layers Is “Enough”?

The model is a straight GPT stack:
Token Embedding → Transformer Blocks → LayerNorm → Linear → Logits
Each block = Multi-Head Self-Attention + FFN.

5.1 Rule-of-Thumb Sizing Table

Param Count	Layers	Heads	Embed Dim	VRAM (fp32)
10 M	6	6	512	~2 GB
30 M	8	8	768	~4 GB
100 M	12	12	1024	~8 GB

Mixed-precision (fp16/bfloat16) cuts usage almost in half.

5.2 Modern “Tiny-Stack” Flags

Add these lines to any config JSON for free-ish quality gains:

"use_rmsnorm": true,
"use_swiglu": true,
"use_rope": true

RMSNorm saves 5 % VRAM
SwiGLU gains ~1 perplexity point
RoPE helps extrapolate to longer sequences (but disables absolute pos_emb)

6. Launch Training: Watch the Browser Plot Live

Single-shot pipeline (if you love defaults):

python -m scripts.pipeline --config configs/pipeline.json

Manual step-by-step (if you love control):

python -m scripts.train_model \
  --config configs/training/moderntiny__v007.json \
  --data data/my_corpus.txt \
  --set max_iters=50000

Under the hood the script:

Locates the newest tokenizer and tokenised .bin
Saves a checkpoint every 100 iterations
Writes 10 fresh samples every 500 iterations
Appends loss, perplexity, learning-rate to logs.jsonl

6.1 Real-Time Dashboard

python -m scripts.web_ui
# Surf to http://localhost:8080
# Choose “Runs” → latest ID → “Loss Curve” tab

Training UI
Loss drops in real time; attention heat-maps update on refresh.

7. Inspect Results: Loss, Perplexity, Attention Heat-Maps

Metric	How to Read It	Typical Range (30 M, Chinese)
Loss	Should slope down, then plateau	2.8–3.2
Perplexity	`exp(loss)`	16–24
Attention	Diagonal stripe = causal; vertical stripe = copy previous token	Visual sanity check

If loss spikes then skyrockets, learning rate is too high—halve it and resume from the last checkpoint.

8. Troubleshooting Cheat-Sheet: 20 Common Errors

Symptom	Root Cause	Quick Fix
CUDA out of memory	Batch size too big	Edit config: `"batch_size": 32`
Stuck at “Loading tokens”	Corrupted cache	Delete `runs/tokens/*/train.bin`, regenerate
Garbled Chinese output	Input not UTF-8	`iconv -f gbk -t utf-8 old.txt > new.txt`
Samples repeat forever	`temperature = 0`	Set `"sample_temperature": 0.8`
Web UI shows no runs	Symlink broken	`ln -s runs/train/<id> runs/train/latest`

9. After the Finish Line: Shrinking, Quantising, and Porting to Mobile

llm-madness stops at a fp32 checkpoint. To go smaller:

Dynamic quantisation (torch.quantization) → int8, 4× smaller, <1 % PPL loss
QAT (Quantisation Aware Training) → continue training with fake-quant layers, recover accuracy
ONNX export → MLC-LLM → run at 30 ms/token on an Android phone

The repo leaves you at step 0 (fp32 checkpoint); all three routes use standard PyTorch tooling, no secret sauce required.

Closing Thoughts: A Tiny Model Is a Microscope, Not a Toy

When you can see why the attention map lights up at “Covid-19” and “virus” in the same sentence, the maths in hundred-billion-parameter papers suddenly feels concrete. llm-madness is built to give you that x-ray vision—fast, local, and fully open. Enjoy the experiments, push your pull-requests, and remember to share what you break (and fix) along the way.

Train Your Own AI: The llm-madness Guide to Building a Pocket-Size Language Model

Train a Pocket-Size Language Model End-to-End: The llm-madness Handbook

Quick-Fire Answers to the Three Questions Everyone Asks

Table of Contents

1. 30-Second Overview

2. Install & Smoke-Test

3. Feed It Text: Creating a Dataset Snapshot

3.1 Raw-text Rules

3.2 Two Equally Valid Ways

4. Train the Tokenizer: BPE Demystified

4.1 Hyper-parameters in One Glance

4.2 Train Command

5. Design the Network: How Many Layers Is “Enough”?

5.1 Rule-of-Thumb Sizing Table

5.2 Modern “Tiny-Stack” Flags

6. Launch Training: Watch the Browser Plot Live

6.1 Real-Time Dashboard

7. Inspect Results: Loss, Perplexity, Attention Heat-Maps

8. Troubleshooting Cheat-Sheet: 20 Common Errors

9. After the Finish Line: Shrinking, Quantising, and Porting to Mobile

Closing Thoughts: A Tiny Model Is a Microscope, Not a Toy

Train Your Own AI: The llm-madness Guide to Building a Pocket-Size Language Model

Train a Pocket-Size Language Model End-to-End: The llm-madness Handbook

Quick-Fire Answers to the Three Questions Everyone Asks

Table of Contents

1. 30-Second Overview

2. Install & Smoke-Test

3. Feed It Text: Creating a Dataset Snapshot

3.1 Raw-text Rules

3.2 Two Equally Valid Ways

4. Train the Tokenizer: BPE Demystified

4.1 Hyper-parameters in One Glance

4.2 Train Command

5. Design the Network: How Many Layers Is “Enough”?

5.1 Rule-of-Thumb Sizing Table

5.2 Modern “Tiny-Stack” Flags

6. Launch Training: Watch the Browser Plot Live

6.1 Real-Time Dashboard

7. Inspect Results: Loss, Perplexity, Attention Heat-Maps

8. Troubleshooting Cheat-Sheet: 20 Common Errors

9. After the Finish Line: Shrinking, Quantising, and Porting to Mobile

Closing Thoughts: A Tiny Model Is a Microscope, Not a Toy

Related Posts