CUDA-Based LLM Inference Engine: Building qwen600 for 8.5% Faster Qwen3-0.6B Performance

高效码农

3 months ago

# qwen600.cu: Building a Minimal CUDA Inference Engine for Qwen3-0.6B

![Project Banner](https://github.com/yassa9/qwen600/raw/main/assets/banner.png)

This project began as a simple curiosity: while studying **CUDA programming** and **GPGPU concepts**, I wondered—what if I built an inference engine for a language model completely from scratch?  

I chose the [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) model, a compact yet capable LLM that runs smoothly on an **RTX 3050 with 8GB VRAM**. The intention was, and still is, to create an **educational program** that allows deeper learning about **transformer models** while simultaneously practicing CUDA development.

The result is a **static inference engine** for the Qwen3-0.6B instruct model in **bf16 precision**. Benchmarks show that it outperforms some existing solutions:

- About **8.5% faster** than [llama.cpp](https://github.com/ggml-org/llama.cpp)
- About **292% faster** than Hugging Face with flash-attn, measured in tokens per second

---

## What qwen600 Includes

- Single-batch inference engine
- Static constants for compile-time optimization
- Pure CUDA C/C++ (no Python dependencies except for tokenizer conversion)
- Minimal external libraries (cuBLAS, CUB, standard IO)
- Efficient memory pipeline:
  - Memory-mapped files (mmap)
  - Single GPU block
  - Asynchronous copy
- Zero-cost, pointer-based GPU weight management

---

## Inspirations Behind qwen600

This project builds upon ideas from several open-source implementations:

- [llama.cpp - ggml](https://github.com/ggml-org/llama.cpp)  
- [llama2.c - Andrej Karpathy](https://github.com/karpathy/llama2.c)  
- [LLMs-from-scratch - Sebastian Raschka](https://github.com/rasbt/LLMs-from-scratch)  
- [qwen3.c - Adrian Cable](https://github.com/adriancable/qwen3.c)  

![Architecture](https://github.com/yassa9/qwen600/raw/main/assets/arch.png)

---

## Design Philosophy

The design follows the principles of the [suckless philosophy](https://suckless.org/philosophy/):  

- Simplicity over feature bloat  
- Minimal dependencies  
- Direct configuration in `config.h`  
- Focus on readability and performance, not abstraction layers  

This makes qwen600 **minimalist, transparent, and efficient**.

---

## Getting Started

### Step 1: Model Setup

Download the [Qwen3-0.6B model](https://huggingface.co/Qwen/Qwen3-0.6B). For reference, Hugging Face provides a guide to [cloning repositories](https://huggingface.co/docs/hub/en/repositories-getting-started).  

After downloading, verify the weights file (`model.safetensors`) with a checksum:

```bash
sha256sum <model_dir>/<safetensors-file-name>

Expected result:

f47f71177f32bcd101b7573ec9171e6a57f4f4d31148d38e382306f42996874b

Clone this repository:

git clone https://github.com/yassa9/qwen600
cd qwen600

Convert the Hugging Face tokenizer into qwen600’s format:

python export.py <model_dir>

This generates required template files, including the critical tokenizer.bin.

Step 2: Build the Engine

Requirements:

CUDA + nvcc
cuBLAS + CUB

Compile with:

mkdir build && cd build
cmake .. && make -j$(nproc)

That’s it—no bulky dependencies, no extra layers.

Step 3: Running the Model

Inside qwen600/build, run:

./qwen600

This prints the command-line usage manual:

usage:   ./qwen600 <model_dir> [options]
example: ./qwen600 <model_dir> -r 1

Arguments

Flag	Type	Description
`-r`	int	Reasoning mode: 0 = off (default), 1 = on
`-s`	int	Random seed
`-k`	int	Top-k sampling, default 20
`-t`	float	Temperature, default 0.6
`-p`	float	Top-p (nucleus) sampling, default 0.95
`-i`	string	Input prompt
`-y`	string	System prompt for chat mode (optional)

Example run:

./qwen600 <model_dir> -r 1 -t 0.65 -p 0.9 -k 20

Or simply:

./qwen600 <model_dir> -r 1

Recommended Settings

Based on the official Hugging Face model card:

Reasoning Mode (enable_thinking=True)
- Temperature: 0.6
- TopP: 0.95
- TopK: 20
- Avoid greedy decoding to prevent repetition and performance loss
Non-Reasoning Mode (enable_thinking=False)
- Temperature: 0.7
- TopP: 0.8
- TopK: 20
- MinP: 0

Experiments

Without Reasoning Mode

./qwen600 <model_dir> -r 0

Input:

>> what is capital of Greece ?

Output:

The capital of Greece is Athens
[231.71 tk/s, 19 tokens in 0.08s]

With Reasoning Mode Enabled

./qwen600 <model_dir> -r 1

Input:

>> what are llms used for ?

The engine produces a structured reasoning process and then a final, comprehensive answer about Large Language Model (LLM) applications across industries such as customer service, education, healthcare, and creative fields.

Performance metrics are also shown (tokens per second and generation time).

Benchmarking

All benchmarks were performed on the same hardware setup:

GPU: RTX 3050 8GB + CUDA 13.0
CPU: AMD Ryzen 5 3500
RAM: 16GB
OS: Void Linux

Test prompt: what are llms used for ?
Mode: Reasoning enabled
Temperature: 0 (greedy decoding for consistency)

Each result is an average of 5 runs.

Inference Engine	Tokens/sec
Hugging Face + flash-attn	29.57
llama.cpp	107.19
qwen600	116.15

While the project is educational, these results highlight the benefits of static compile-time optimizations and CUDA-level improvements.

Development Roadmap

Current progress:

[x] Fused RMSNorm kernel
[x] Skip connections fused with cuBLAS
[ ] Fix softmax kernel and dispatcher
[ ] Explore pre-computed RoPE values

FAQ

Q: Who is this project for?
A: Developers and learners who want to understand the inner workings of CUDA and LLM inference engines. It’s not designed for production-ready deployment.

Q: Why not just use existing frameworks?
A: The goal is educational: to practice CUDA, minimize dependencies, and understand every part of the pipeline.

Q: Does it support multiple GPUs?
A: Currently, no. The implementation is for single-GPU use.

Q: Why does it perform better than llama.cpp in some cases?
A: Optimizations such as static compilation and kernel fusion provide measurable performance improvements.