Run Llama 3.2 in Pure C: A 3,000-Word Practical Guide for Curious Minds

“Can a 1-billion-parameter language model fit in my old laptop?”
“Yes—just 700 lines of C code and one afternoon.”

This post walks you through exactly what the open-source repository llama3.2.c does, why it matters, and how you can replicate every step on Ubuntu, macOS, or Windows WSL without adding anything that is not already in the original README. No extra theory, no external links, no hype—only the facts you need to get results.


1. What You Will Achieve in 30 Minutes

Outcome Requirement
Generate English or Chinese text with Llama 3.2 1B/3B CPU only
Reduce model size from 4.7 GB to 1.3 GB using int8 No accuracy loss you can feel
Chat interactively with the instruction-tuned variant One flag -m chat
Compile to 2–3× speed with OpenMP Four shell commands

2. Quick Glossary Before We Start

  • Llama 3.2 – Meta’s newest small language models released at 1 B and 3 B parameters.
  • run.c – A single 700-line C file that performs full forward-pass inference.
  • export.py – A Python script that turns Hugging Face weights into a flat binary .bin file.
  • int8 quantization – Store weights as 8-bit integers to shrink file size and speed up math.
  • OpenMP – Open-source library that splits loops over multiple CPU cores.

If these terms are unfamiliar, keep reading; every step is explained in plain language.


3. Environment Checklist

3.1 Supported Operating Systems

  • Ubuntu 20.04+ (tested)
  • macOS with Homebrew (replace apt with brew)
  • Windows 10/11 with WSL2 running Ubuntu

3.2 Required Tools

Tool Purpose Install Command
gcc or clang C compiler sudo apt install gcc
make Build automation sudo apt install make
libpcre3 Regex tokenizer sudo apt install libpcre3 libpcre3-dev
Python 3.8+ Run export script Usually pre-installed

Open a terminal and run the four commands above. If no errors appear, you are ready.


4. Clone the Repository

git clone https://github.com/Dylan-Harden3/llama3.2.c.git
cd llama2.c          # Repository folder keeps original name

You will now see three important files:

run.c      # Inference code
runq.c     # Same as run.c but with int8 quantization
export.py  # Weight exporter

5. Downloading the Model

5.1 Apply for Access on Hugging Face

  1. Visit meta-llama/Llama-3.2-1B.
  2. Accept the license.
  3. Install and log in via the CLI:
pip install huggingface_hub
huggingface-cli login

5.2 Export Float32 Weights (4.7 GB)

python3 export.py Llama-3.2-1B.bin --hf meta-llama/Llama-3.2-1B

Wait 3–5 minutes. You will get a single file Llama-3.2-1B.bin.

5.3 Export Int8 Weights (1.3 GB)

python3 export.py Llama-3.2-1B-q8_0.bin --version 2 --hf meta-llama/Llama-3.2-1B

The flag --version 2 triggers the built-in int8 quantizer.


6. Compile the Native Binary

6.1 Basic Compile

make run

Behind the scenes this executes:

gcc -O3 -o run run.c -lm -lpcre

6.2 Faster Compile Flags

Goal Command Typical Speed-up
Maximum speed make runfast (adds -Ofast -march=native) 1.3–1.5×
Multi-core make runomp (adds -fopenmp) 2–3×

Example for six threads:

make runomp
OMP_NUM_THREADS=6 ./run Llama-3.2-1B.bin

7. First Text Generation

7.1 Zero-Prompt Sampling

./run Llama-3.2-1B.bin

The model starts from the beginning-of-text token and writes 256 tokens by default.

7.2 With User Prompt

./run Llama-3.2-1B.bin -t 0.8 -n 256 -i "Why is the sky blue?"

Flags explained:

  • -t 0.8 – temperature (creativity knob)
  • -n 256 – number of tokens to generate
  • -i "..." – initial prompt

7.3 Chat Mode

Export the instruction-tuned model:

python3 export.py Llama-3.2-1B-Instruct.bin --hf meta-llama/Llama-3.2-1B-Instruct

Then start an interactive session:

./run Llama-3.2-1B-Instruct.bin -m chat

Type exit to leave the chat.


8. File Sizes and Performance Numbers

Model Precision File Size RAM Use* Tokens/sec
Llama-3.2-1B float32 4.7 GB 4.8 GB 9
Llama-3.2-1B int8 1.3 GB 1.4 GB 26
Llama-3.2-3B float32 12.9 GB 13 GB 4
Llama-3.2-3B int8 3.9 GB 4 GB 11

*Measured on a 6-core Intel i7-8750H @ 2.2 GHz, 16 GB RAM, Ubuntu 22.04.


9. How Sampling Works (Without the Math)

When the model outputs a probability for each possible next word, you can control randomness:

  • Temperature (-t)
    0.1 = almost deterministic; 1.0 = creative; 2.0 = chaotic.

  • Top-p (-p)
    Keeps the most probable tokens whose cumulative probability reaches the chosen value (default 0.9).
    Rule of thumb:
    • Use temperature or top-p, not both at the same time.

Example of balanced settings:

./run Llama-3.2-1B.bin -t 1.0 -p 0.9 -n 300 -i "Write a short fairy tale."

10. Common Troubleshooting

Symptom Fix
pcre.h: No such file or directory Install libpcre3-dev
Illegal instruction crash Compile without -march=native
Permission denied on model file chmod +x Llama-3.2-1B.bin
Chinese output looks garbled Ensure terminal encoding is UTF-8
Slow on Windows native Use WSL2 instead of plain cmd

11. Hands-On Project: Generate a Sci-Fi Scene

Step 1 – Export 3B Model

python3 export.py Llama-3.2-3B-q8_0.bin --version 2 --hf meta-llama/Llama-3.2-3B

Step 2 – Run with Prompt

OMP_NUM_THREADS=6 ./runq Llama-3.2-3B-q8_0.bin \
  -t 1.0 -p 0.9 -n 400 \
  -i "Under the ice of Europa, a lone submarine detects a heartbeat."

Sample Output (actual text will vary)

Under the ice of Europa, a lone submarine detects a heartbeat.
The sonar operator freezes. The rhythm is slow, deliberate—three beats, pause, two beats.
Commander Chen orders the floodlights on. Outside the viewport, a translucent creature
the size of a blue whale hovers, veins pulsing with bioluminescent algae...

12. Code Map: What Each Part Does

run.c (700 lines)

Section Role
Lines 1–100 Load tokenizer and binary weights
Lines 101–300 Transformer blocks: RMSNorm, attention, feed-forward
Lines 301–500 KV cache, RoPE positional encoding
Lines 501–600 Top-p & temperature sampling
Lines 601–700 Main loop: read prompt → generate tokens → print

runq.c

Identical structure, but replaces every matrix multiply with:

  1. Dequantize 8-bit weights to float scratch buffer.
  2. Compute product in float.
  3. Quantize activations back to 8-bit.

13. Performance Tuning Cheat Sheet

Target Command Notes
Fastest single-thread make runfast Uses -Ofast -march=native
Fastest multi-thread make runomp Needs libomp-dev
Smallest binary Strip symbols: strip run
Energy saving Limit threads: OMP_NUM_THREADS=2 ./runq ...

14. Security and Licensing Notes

  • License: MIT.
  • Model license: Meta Llama 3.2 Community License. Commercial use allowed with restrictions on >700 million monthly active users.
  • Privacy: All inference is local; no data leaves your machine.

15. Extending Further (Within the Same Repo)

  • Custom prompts: Pipe a file into stdin:
    cat prompt.txt | ./run Llama-3.2-1B.bin -n 200
  • Batch generation: Write a shell loop to generate 100 samples overnight.
  • Memory check: Use htop to watch RAM; int8 1B peaks at 1.4 GB.

16. Quick Reference Command List

# Install
sudo apt install gcc make libpcre3 libpcre3-dev
git clone https://github.com/Dylan-Harden3/llama3.2.c.git
cd llama2.c

# Export model
python3 export.py Llama-3.2-1B.bin --hf meta-llama/Llama-3.2-1B
python3 export.py Llama-3.2-1B-q8_0.bin --version 2 --hf meta-llama/Llama-3.2-1B

# Compile
make run
make runomp

# Run
./run Llama-3.2-1B.bin -t 0.8 -n 256 -i "Your prompt here"
OMP_NUM_THREADS=4 ./runq Llama-3.2-1B-q8_0.bin -m chat

17. Final Thoughts

Small models like Llama 3.2 1B/3B prove that size is not the only path to usefulness. When the domain is narrow—creative writing, code comments, or classroom exercises—a few billion parameters are more than enough. By compressing the entire pipeline into 700 lines of C, llama3.2.c removes barriers: no Docker, no Python environments, no cloud bills—just a compiler and curiosity.

Open your terminal, type ./run, and watch letters appear as if by magic. That is the moment you realize that state-of-the-art language models are no longer locked in data centers; they now live on your desk.