# qwen600.cu: Building a Minimal CUDA Inference Engine for Qwen3-0.6B

This project began as a simple curiosity: while studying **CUDA programming** and **GPGPU concepts**, I wondered—what if I built an inference engine for a language model completely from scratch?
I chose the [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) model, a compact yet capable LLM that runs smoothly on an **RTX 3050 with 8GB VRAM**. The intention was, and still is, to create an **educational program** that allows deeper learning about **transformer models** while simultaneously practicing CUDA development.
The result is a **static inference engine** for the Qwen3-0.6B instruct model in **bf16 precision**. Benchmarks show that it outperforms some existing solutions:
- About **8.5% faster** than [llama.cpp](https://github.com/ggml-org/llama.cpp)
- About **292% faster** than Hugging Face with flash-attn, measured in tokens per second
---
## What qwen600 Includes
- Single-batch inference engine
- Static constants for compile-time optimization
- Pure CUDA C/C++ (no Python dependencies except for tokenizer conversion)
- Minimal external libraries (cuBLAS, CUB, standard IO)
- Efficient memory pipeline:
- Memory-mapped files (mmap)
- Single GPU block
- Asynchronous copy
- Zero-cost, pointer-based GPU weight management
---
## Inspirations Behind qwen600
This project builds upon ideas from several open-source implementations:
- [llama.cpp - ggml](https://github.com/ggml-org/llama.cpp)
- [llama2.c - Andrej Karpathy](https://github.com/karpathy/llama2.c)
- [LLMs-from-scratch - Sebastian Raschka](https://github.com/rasbt/LLMs-from-scratch)
- [qwen3.c - Adrian Cable](https://github.com/adriancable/qwen3.c)

---
## Design Philosophy
The design follows the principles of the [suckless philosophy](https://suckless.org/philosophy/):
- Simplicity over feature bloat
- Minimal dependencies
- Direct configuration in `config.h`
- Focus on readability and performance, not abstraction layers
This makes qwen600 **minimalist, transparent, and efficient**.
---
## Getting Started
### Step 1: Model Setup
Download the [Qwen3-0.6B model](https://huggingface.co/Qwen/Qwen3-0.6B). For reference, Hugging Face provides a guide to [cloning repositories](https://huggingface.co/docs/hub/en/repositories-getting-started).
After downloading, verify the weights file (`model.safetensors`) with a checksum:
```bash
sha256sum <model_dir>/<safetensors-file-name>
Expected result:
f47f71177f32bcd101b7573ec9171e6a57f4f4d31148d38e382306f42996874b
Clone this repository:
git clone https://github.com/yassa9/qwen600
cd qwen600
Convert the Hugging Face tokenizer into qwen600’s format:
python export.py <model_dir>
This generates required template files, including the critical tokenizer.bin
.
Step 2: Build the Engine
Requirements:
-
CUDA + nvcc -
cuBLAS + CUB
Compile with:
mkdir build && cd build
cmake .. && make -j$(nproc)
That’s it—no bulky dependencies, no extra layers.
Step 3: Running the Model
Inside qwen600/build
, run:
./qwen600
This prints the command-line usage manual:
usage: ./qwen600 <model_dir> [options]
example: ./qwen600 <model_dir> -r 1
Arguments
Flag | Type | Description |
---|---|---|
-r |
int | Reasoning mode: 0 = off (default), 1 = on |
-s |
int | Random seed |
-k |
int | Top-k sampling, default 20 |
-t |
float | Temperature, default 0.6 |
-p |
float | Top-p (nucleus) sampling, default 0.95 |
-i |
string | Input prompt |
-y |
string | System prompt for chat mode (optional) |
Example run:
./qwen600 <model_dir> -r 1 -t 0.65 -p 0.9 -k 20
Or simply:
./qwen600 <model_dir> -r 1
Recommended Settings
Based on the official Hugging Face model card:
-
Reasoning Mode (enable_thinking=True)
-
Temperature: 0.6 -
TopP: 0.95 -
TopK: 20 -
Avoid greedy decoding to prevent repetition and performance loss
-
-
Non-Reasoning Mode (enable_thinking=False)
-
Temperature: 0.7 -
TopP: 0.8 -
TopK: 20 -
MinP: 0
-
Experiments
Without Reasoning Mode
./qwen600 <model_dir> -r 0
Input:
>> what is capital of Greece ?
Output:
The capital of Greece is Athens
[231.71 tk/s, 19 tokens in 0.08s]
With Reasoning Mode Enabled

./qwen600 <model_dir> -r 1
Input:
>> what are llms used for ?
The engine produces a structured reasoning process and then a final, comprehensive answer about Large Language Model (LLM) applications across industries such as customer service, education, healthcare, and creative fields.
Performance metrics are also shown (tokens per second and generation time).
Benchmarking
All benchmarks were performed on the same hardware setup:
-
GPU: RTX 3050 8GB + CUDA 13.0 -
CPU: AMD Ryzen 5 3500 -
RAM: 16GB -
OS: Void Linux
Test prompt: what are llms used for ?
Mode: Reasoning enabled
Temperature: 0 (greedy decoding for consistency)
Each result is an average of 5 runs.
Inference Engine | Tokens/sec |
---|---|
Hugging Face + flash-attn | 29.57 |
llama.cpp | 107.19 |
qwen600 | 116.15 |
While the project is educational, these results highlight the benefits of static compile-time optimizations and CUDA-level improvements.
Development Roadmap
Current progress:
-
[x] Fused RMSNorm kernel -
[x] Skip connections fused with cuBLAS -
[ ] Fix softmax kernel and dispatcher -
[ ] Explore pre-computed RoPE values
FAQ
Q: Who is this project for?
A: Developers and learners who want to understand the inner workings of CUDA and LLM inference engines. It’s not designed for production-ready deployment.
Q: Why not just use existing frameworks?
A: The goal is educational: to practice CUDA, minimize dependencies, and understand every part of the pipeline.
Q: Does it support multiple GPUs?
A: Currently, no. The implementation is for single-GPU use.
Q: Why does it perform better than llama.cpp in some cases?
A: Optimizations such as static compilation and kernel fusion provide measurable performance improvements.