Burn: A Friendly Deep-Dive into the Next-Gen Deep Learning Framework for Everyone

A practical walk-through for junior college graduates and working engineers who want to train, tune, and ship models—without juggling three different languages.

Burn

Burn


Table of Contents

  1. Why yet another framework?
  2. What exactly is Burn?
  3. Performance in plain English
  4. Hardware support at a glance
  5. Training & inference—end-to-end
  6. Your first model in five minutes
  7. Moving models in and out of Burn
  8. Real examples you can run today
  9. Common questions & answers
  10. Where to go next

Why yet another framework?

Every popular framework solves part of the problem, but it often leaves you with:

  • Python for research, C++/CUDA for production—two code bases, twice the bugs.
  • No painless path from a laptop to a browser or an embedded board.
  • Hand-written GPU kernels if you want real speed—steep learning curve.

Burn’s goal is simple: one Rust code base handles research, training, and deployment on anything from a Raspberry Pi to an NVIDIA A100. No rewrites, no drama.


What exactly is Burn?

Term you might know Burn’s counterpart What it means for you
PyTorch nn.Module #[derive(Module)] struct Same mental model, compile-time safety
CUDA kernels Auto-generated WGSL You write high-level math, Burn writes the shader
torch.save Record trait + MPK/ZIP Readable, version-tolerant snapshots
Python GIL headache Rust ownership True multi-threading, no locks
ONNX export Built-in ONNX import Bring models from TF/PyTorch into Burn

Burn is 100 % open source under MIT + Apache 2.0; you can embed it in commercial products without opening your own code.


Performance in plain English

Burn breaks speed into six practical techniques. Click each heading for the short, no-jargon version.

1. Automatic kernel fusion – “many ops become one GPU call”

When you chain small operations (add → mul → activation), Burn stitches them into one GPU kernel.
Example: the custom GELU below turns into ~60 lines of WGSL at runtime—hand-written speed without hand-written pain.

fn gelu_custom<B: Backend, const D: usize>(x: Tensor<B, D>) -> Tensor<B, D> {
    let x = x.clone() * ((x / SQRT_2).erf() + 1);
    x / 2
}

You keep the readable math; Burn keeps the speed.

2. Asynchronous execution – “CPU and GPU stop waiting for each other”

Burn’s first-party backends queue work in the background.
Your Python-style loop keeps feeding the GPU while the GPU keeps crunching numbers; no blocking, no wasted cycles.

3. Thread-safe building blocks – “multi-GPU without locks”

Each module owns its weights. Clone the module, send it to another thread, compute gradients, send them back—no mutexes, no race conditions.
PyTorch’s .grad mutations simply don’t exist here.

4. Smart memory management – “reuse instead of malloc”
  • Memory pool keeps frequently used buffers alive.
  • Ownership rules let Burn mutate tensors in place when safe; small wins add up to large-model savings.
5. Automatic kernel selection – “right kernel, right size, right hardware”

Matrix multiply has dozens of tile/block choices. Burn micro-benchmarks once, caches the best, and reuses the decision. First run slower, every later run faster.

6. Hardware-specific features – “when your chip has a secret weapon, Burn uses it”
  • NVIDIA Tensor Cores: used in CUDA, LibTorch, Candle, WGPU back-ends.
  • Apple Metal Performance Shaders: used in Metal back-end.
    Future chips will come online as soon as WGSL extensions land.

Hardware support at a glance

Back-end Devices Maintainer Special notes
CUDA NVIDIA GPUs (Linux, Windows) Burn team Tensor Cores active
ROCm AMD GPUs (Linux) Burn team
Metal Apple GPUs (macOS, iOS) Burn team Apple Silicon supported
Vulkan Most desktop GPUs Burn team Linux & Windows
Wgpu Any Vulkan/Metal/DX12 target Burn team Also compiles to WebAssembly
NdArray Any CPU Community Works in no_std
LibTorch CPU & GPU (via PyTorch C++) Community Reuses PyTorch kernels
Candle NVIDIA, Apple, CPU Community HuggingFace back-end

Combining back-ends (optional decorators)

  • Autodiff – adds automatic differentiation to any back-end.
  • Fusion – adds kernel fusion to supported back-ends (WGPU, CUDA today).
  • Router (Beta) – routes parts of the graph to GPU, others to CPU.
  • Remote (Beta) – send ops over the network to a GPU box.

Example: a CPU+GPU mixed run in four lines.

type Backend = Router<(Wgpu, NdArray)>;
let gpu = MultiDevice::B1(WgpuDevice::DiscreteGpu(0));
let cpu = MultiDevice::B2(NdArrayDevice::Cpu);

Training & inference

1. Terminal dashboard

A Ratatui-based TUI shows live metrics, checkpoints, and early-stop safety—no TensorBoard required.

Dashboard screenshot

2. ONNX import

Bring any ONNX model (exported from TensorFlow, PyTorch, etc.) into Burn:

cargo add burn-import --features onnx

See the ONNX import book section for supported ops.

3. PyTorch & Safetensors weights

Load pre-trained weights directly:

let record = PyTorchFileRecorder::new()
    .load("model.pt".into(), &device)?;
let model = MyModelConfig::new().init(&device).load_record(record)?;

4. Browser inference

Compile Burn + WGPU → WebAssembly and ship a 2 MB .wasm file.
Try the live MNIST demo—no server round-trip.

5. Embedded (no_std)

NdArray back-end runs on bare-metal ARM Cortex-M.
Limitation: only CPU inference for now.


Your first model in five minutes

Step 1 — install Rust (once per machine)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Step 2 — create a project

cargo new hello_burn
cd hello_burn
cargo add burn --features wgpu

Step 3 — define a tiny network

use burn::nn;
use burn::module::Module;
use burn::tensor::{backend::Backend, Tensor};

#[derive(Module, Debug)]
pub struct Mlp<B: Backend> {
    linear1: nn::Linear<B>,
    linear2: nn::Linear<B>,
    relu: nn::Relu,
}

impl<B: Backend> Mlp<B> {
    pub fn forward<const D: usize>(&self, x: Tensor<B, D>) -> Tensor<B, D> {
        let x = self.linear1.forward(x);
        let x = self.relu.forward(x);
        self.linear2.forward(x)
    }
}

Step 4 — train on MNIST

git clone https://github.com/tracel-ai/burn
cd burn/examples/mnist
cargo run --release

First run compiles shaders and benchmarks kernels; subsequent runs start immediately.


Moving models in and out of Burn

From To Burn Steps
ONNX Burn burn-import crate, one function call
PyTorch .pt Burn Convert to .safetensors, then load with PyTorchFileRecorder
Burn ONNX Export is planned; currently use your own forward loop
Burn WebAssembly cargo build --target wasm32-unknown-unknown

Real examples you can run today

Example What you learn Command
MNIST training End-to-end CNN training cargo run --release
Custom CSV dataset Read any tabular file cargo run --release
Image classification web demo Browser inference trunk serve
Text classification Transformer fine-tuning cargo run --release -- --dataset ag_news
Custom WGPU kernel Write your own GPU op cargo run --release

All examples are self-contained; clone and cargo run.


Common questions & answers

Q1: Is Rust hard to learn?

Most engineers pick up the syntax in a weekend. Burn hides the hard parts—ownership errors show up at compile time, not runtime.

Q2: Does Burn work on Windows?

Yes. Vulkan, WGPU, CUDA, and LibTorch back-ends are tested on Windows CI every day.

Q3: Do I need bleeding-edge drivers?

– CUDA: driver ≥ 515
– Vulkan: any driver that supports Vulkan 1.2
– Apple: macOS 12+

Q4: I saved a model last month—will it still load?

Burn 0.14–0.16 can read older snapshots if you enable the `record-backward-compat` feature. Load and re-save once to migrate to the new format.

Q5: How do I write a custom GPU kernel?

Write WGSL in a `.wgsl` file, register it with the WGPU back-end. The [custom-wgpu-kernel](./examples/custom-wgpu-kernel) example has 60 lines of copy-pasteable code.


Where to go next

  1. Read the Burn Book
    https://burn.dev/books/burn/
    Short chapters: Tensors → Modules → Training → Advanced.

  2. Chat with the community
    Discord: https://discord.gg/uPEBbYYDB6
    Ask anything—#beginners channel is active.

  3. Contribute


License

Burn is dual-licensed under MIT and Apache 2.0. Use it in open-source or commercial projects with no strings attached.


Ready to ship your next model in pure Rust?
Clone the repo, run an example, and see the difference.