Run Llama 3.2 in Pure C: A 3,000-Word Practical Guide for Curious Minds

“

“Can a 1-billion-parameter language model fit in my old laptop?”
“Yes—just 700 lines of C code and one afternoon.”

This post walks you through exactly what the open-source repository llama3.2.c does, why it matters, and how you can replicate every step on Ubuntu, macOS, or Windows WSL without adding anything that is not already in the original README. No extra theory, no external links, no hype—only the facts you need to get results.

1. What You Will Achieve in 30 Minutes

Outcome	Requirement
Generate English or Chinese text with Llama 3.2 1B/3B	CPU only
Reduce model size from 4.7 GB to 1.3 GB using int8	No accuracy loss you can feel
Chat interactively with the instruction-tuned variant	One flag `-m chat`
Compile to 2–3× speed with OpenMP	Four shell commands

2. Quick Glossary Before We Start

Llama 3.2 – Meta’s newest small language models released at 1 B and 3 B parameters.
run.c – A single 700-line C file that performs full forward-pass inference.
export.py – A Python script that turns Hugging Face weights into a flat binary .bin file.
int8 quantization – Store weights as 8-bit integers to shrink file size and speed up math.
OpenMP – Open-source library that splits loops over multiple CPU cores.

If these terms are unfamiliar, keep reading; every step is explained in plain language.

3. Environment Checklist

3.1 Supported Operating Systems

Ubuntu 20.04+ (tested)
macOS with Homebrew (replace apt with brew)
Windows 10/11 with WSL2 running Ubuntu

3.2 Required Tools

Tool	Purpose	Install Command
gcc or clang	C compiler	`sudo apt install gcc`
make	Build automation	`sudo apt install make`
libpcre3	Regex tokenizer	`sudo apt install libpcre3 libpcre3-dev`
Python 3.8+	Run export script	Usually pre-installed

Open a terminal and run the four commands above. If no errors appear, you are ready.

4. Clone the Repository

git clone https://github.com/Dylan-Harden3/llama3.2.c.git
cd llama2.c          # Repository folder keeps original name

You will now see three important files:

run.c      # Inference code
runq.c     # Same as run.c but with int8 quantization
export.py  # Weight exporter

5. Downloading the Model

5.1 Apply for Access on Hugging Face

Visit meta-llama/Llama-3.2-1B.
Accept the license.
Install and log in via the CLI:

pip install huggingface_hub
huggingface-cli login

5.2 Export Float32 Weights (4.7 GB)

python3 export.py Llama-3.2-1B.bin --hf meta-llama/Llama-3.2-1B

Wait 3–5 minutes. You will get a single file Llama-3.2-1B.bin.

5.3 Export Int8 Weights (1.3 GB)

python3 export.py Llama-3.2-1B-q8_0.bin --version 2 --hf meta-llama/Llama-3.2-1B

The flag --version 2 triggers the built-in int8 quantizer.

6. Compile the Native Binary

6.1 Basic Compile

make run

Behind the scenes this executes:

gcc -O3 -o run run.c -lm -lpcre

6.2 Faster Compile Flags

Goal	Command	Typical Speed-up
Maximum speed	`make runfast` (adds `-Ofast -march=native`)	1.3–1.5×
Multi-core	`make runomp` (adds `-fopenmp`)	2–3×

Example for six threads:

make runomp
OMP_NUM_THREADS=6 ./run Llama-3.2-1B.bin

7. First Text Generation

7.1 Zero-Prompt Sampling

./run Llama-3.2-1B.bin

The model starts from the beginning-of-text token and writes 256 tokens by default.

7.2 With User Prompt

./run Llama-3.2-1B.bin -t 0.8 -n 256 -i "Why is the sky blue?"

Flags explained:

-t 0.8 – temperature (creativity knob)
-n 256 – number of tokens to generate
-i "..." – initial prompt

7.3 Chat Mode

Export the instruction-tuned model:

python3 export.py Llama-3.2-1B-Instruct.bin --hf meta-llama/Llama-3.2-1B-Instruct

Then start an interactive session:

./run Llama-3.2-1B-Instruct.bin -m chat

Type exit to leave the chat.

8. File Sizes and Performance Numbers

Model	Precision	File Size	RAM Use*	Tokens/sec
Llama-3.2-1B	float32	4.7 GB	4.8 GB	9
Llama-3.2-1B	int8	1.3 GB	1.4 GB	26
Llama-3.2-3B	float32	12.9 GB	13 GB	4
Llama-3.2-3B	int8	3.9 GB	4 GB	11

*Measured on a 6-core Intel i7-8750H @ 2.2 GHz, 16 GB RAM, Ubuntu 22.04.

9. How Sampling Works (Without the Math)

When the model outputs a probability for each possible next word, you can control randomness:

Temperature (-t)
0.1 = almost deterministic; 1.0 = creative; 2.0 = chaotic.
Top-p (-p)
Keeps the most probable tokens whose cumulative probability reaches the chosen value (default 0.9).
Rule of thumb:
• Use temperature or top-p, not both at the same time.

Example of balanced settings:

./run Llama-3.2-1B.bin -t 1.0 -p 0.9 -n 300 -i "Write a short fairy tale."

10. Common Troubleshooting

Symptom	Fix
`pcre.h: No such file or directory`	Install `libpcre3-dev`
`Illegal instruction` crash	Compile without `-march=native`
`Permission denied` on model file	`chmod +x Llama-3.2-1B.bin`
Chinese output looks garbled	Ensure terminal encoding is UTF-8
Slow on Windows native	Use WSL2 instead of plain cmd

11. Hands-On Project: Generate a Sci-Fi Scene

Step 1 – Export 3B Model

python3 export.py Llama-3.2-3B-q8_0.bin --version 2 --hf meta-llama/Llama-3.2-3B

Step 2 – Run with Prompt

OMP_NUM_THREADS=6 ./runq Llama-3.2-3B-q8_0.bin \
  -t 1.0 -p 0.9 -n 400 \
  -i "Under the ice of Europa, a lone submarine detects a heartbeat."

Sample Output (actual text will vary)

Under the ice of Europa, a lone submarine detects a heartbeat.
The sonar operator freezes. The rhythm is slow, deliberate—three beats, pause, two beats.
Commander Chen orders the floodlights on. Outside the viewport, a translucent creature
the size of a blue whale hovers, veins pulsing with bioluminescent algae...

12. Code Map: What Each Part Does

run.c (700 lines)

Section	Role
Lines 1–100	Load tokenizer and binary weights
Lines 101–300	Transformer blocks: RMSNorm, attention, feed-forward
Lines 301–500	KV cache, RoPE positional encoding
Lines 501–600	Top-p & temperature sampling
Lines 601–700	Main loop: read prompt → generate tokens → print

runq.c

Identical structure, but replaces every matrix multiply with:

Dequantize 8-bit weights to float scratch buffer.
Compute product in float.
Quantize activations back to 8-bit.

13. Performance Tuning Cheat Sheet

Target	Command	Notes
Fastest single-thread	`make runfast`	Uses `-Ofast -march=native`
Fastest multi-thread	`make runomp`	Needs `libomp-dev`
Smallest binary	Strip symbols: `strip run`
Energy saving	Limit threads: `OMP_NUM_THREADS=2 ./runq ...`

14. Security and Licensing Notes

License: MIT.
Model license: Meta Llama 3.2 Community License. Commercial use allowed with restrictions on >700 million monthly active users.
Privacy: All inference is local; no data leaves your machine.

15. Extending Further (Within the Same Repo)

Custom prompts: Pipe a file into stdin:
cat prompt.txt | ./run Llama-3.2-1B.bin -n 200
Batch generation: Write a shell loop to generate 100 samples overnight.
Memory check: Use htop to watch RAM; int8 1B peaks at 1.4 GB.

16. Quick Reference Command List

# Install
sudo apt install gcc make libpcre3 libpcre3-dev
git clone https://github.com/Dylan-Harden3/llama3.2.c.git
cd llama2.c

# Export model
python3 export.py Llama-3.2-1B.bin --hf meta-llama/Llama-3.2-1B
python3 export.py Llama-3.2-1B-q8_0.bin --version 2 --hf meta-llama/Llama-3.2-1B

# Compile
make run
make runomp

# Run
./run Llama-3.2-1B.bin -t 0.8 -n 256 -i "Your prompt here"
OMP_NUM_THREADS=4 ./runq Llama-3.2-1B-q8_0.bin -m chat

17. Final Thoughts

Small models like Llama 3.2 1B/3B prove that size is not the only path to usefulness. When the domain is narrow—creative writing, code comments, or classroom exercises—a few billion parameters are more than enough. By compressing the entire pipeline into 700 lines of C, llama3.2.c removes barriers: no Docker, no Python environments, no cloud bills—just a compiler and curiosity.

Open your terminal, type ./run, and watch letters appear as if by magic. That is the moment you realize that state-of-the-art language models are no longer locked in data centers; they now live on your desk.

Run Llama 3.2 in C: How to Compile & Run Meta’s Latest LLM on CPU Only