nvmath-python: Revolutionizing GPU Math Acceleration with Direct CUDA Integration

1. Why one more Python math package?

Python owns the data-science mind-share, but its core linalg stack was never designed to expose every knob in NVIDIA’s hardware.
If you need:

Mixed-precision GEMM with fused bias–GELU in a single kernel, or
In-kernel FFT for radar filtering inside your own CUDA function, or
A user-written scaling function welded to an FFT so the output is already normalized,

you normally descend into C++ and 300-page PDFs.
nvmath-python stays in Python yet exposes the same levers.
Think of it as CuPy’s older sibling who studied engineering: same household, more tools.

2. Installation: one pip line, two checks

The wheel is hosted on NVIDIA’s own index.
Open a terminal that already sees nvidia-smi (driver ≥ 535.43.02) and run:

python -m pip install -i https://pypi.nvidia.com nvmath-python

Verify:

import nvmath, cupy, numpy, torch
print(nvmath.__version__)   # e.g. 0.2.0b2

Index choice matters

CUDA 12.x → use the line above
CUDA 11.x → replace wheel name with nvmath-python-cu11
WSL2 is fine; multi-GPU inside WSL2 is still experimental.

That is literally all the setup the official README asks for.

3. First axe: stateful GEMM (cuBLASLt inside)

3.1 Pain points you already felt

cupy.matmul replans the same 4 096×4 096 shape every call.
Adding bias or relu costs an extra kernel and another read/write of O(N²) data.
Turning on TF32→FP16 fast math needs four separate flags buried in different headers.

3.2 The object life-cycle

Construct – lock data type, shape, device.
Plan – choose algorithm, workspace, epilog (bias, activation).
Execute – repeatable, non-blocking, returns the same tensor type you fed in.
Free – release workspace; or use the context-manager syntax.

3.3 Minimal but complete example

import cupy as cp
import nvmath.linalg.advanced as nvm

# 4 096² FP16 matrices
a = cp.random.rand(4096, 4096).astype(cp.float16)
b = cp.random.rand(4096, 4096).astype(cp.float16)
bias = cp.random.rand(4096, 1).astype(cp.float32)

# 1) construct
mm = nvm.Matmul(a, b,
                options={"compute_type":
                         nvm.MatmulComputeType.COMPUTE_32F_FAST_16F})

# 2) plan – fuse bias addition
mm.plan(epilog=nvm.MatmulEpilog.BIAS,
        epilog_inputs={"bias": bias})

# 3) execute
c = mm.execute()
cp.cuda.get_current_stream().synchronize()  # ensure completion
print(c.shape, c.dtype)          # (4096, 4096) float16
print("kernel time ms:", mm.time)

# 4) tidy up
mm.free()

Run it and you should see ~15 ms on an RTX 4090, versus 21 ms for a plain cupy.matmul.

3.4 What actually happened?

cuBLASLt chose the fastest kernel for GA102 + FP16 + TF32 compute.
The bias addition ran inside the same micro-kernel; no second global load.
Workspace (≈ 64 MB) was allocated once during plan() and reused.

You can keep calling mm.execute() with new bias tensors as long as the data type and leading dimensions stay identical—ideal for inference loops.

4. Second axe: in-kernel FFT (cuFFTDx)

4.1 When FFT must live inside your CUDA code

Imagine a pulse-Doppler radar: each antenna trace is 128 complex samples.
You want one CUDA block per trace, do FFT → point-wise multiply → IFFT, without ever writing the intermediate spectrum to DRAM.

Traditional two-step:

cupy.fft.fft writes 128 complex to global memory.
Custom kernel reads it back, multiplies, writes again.
cupy.fft.ifft reads once more.

With cuFFTDx the whole chain stays in registers/shared memory.

4.2 Creating a device-side FFT

nvmath-python ships header-only wrappers that Numba can link.
You declare an fft() object once; it returns callable device functions together with metadata (register count, shared memory, optimal block dimension).

4.3 End-to-end radar filter (128-point, C2C, single precision)

import numpy as np
from numba import cuda
from nvmath.device import fft

# 1) obtain device functions
fft_fwd = fft(fft_type="c2c", size=128, precision=np.float32,
              direction="forward",  execution="Block", compiler="numba")
fft_inv = fft(fft_type="c2c", size=128, precision=np.float32,
              direction="inverse", execution="Block", compiler="numba")

value_t   = fft_fwd.value_type          # complex64
reg_size  = fft_fwd.storage_size        # 2  (elements per thread)
shmem_sz  = fft_fwd.shared_memory_size  # 0  (all in registers)
block_dim = fft_fwd.block_dim           # (64,1,1)

# 2) user kernel
@cuda.jit(link=fft_fwd.files + fft_inv.files)
def filter_pulse(signal, filt):
    reg = cuda.local.array(reg_size, dtype=value_t)
    sm  = cuda.shared.array(0,        dtype=value_t)

    trace_id = cuda.blockIdx.x
    tid      = cuda.threadIdx.x

    # load
    for i in range(reg_size):
        reg[i] = signal[trace_id, tid + i*fft_fwd.stride]

    # compute
    fft_fwd(reg, sm)
    for i in range(reg_size):
        reg[i] *= filt[trace_id, tid + i*fft_fwd.stride]
    fft_inv(reg, sm)

    # store
    for i in range(reg_size):
        signal[trace_id, tid + i*fft_fwd.stride] = reg[i]

# 3) run
batch = 256
signal = (np.random.randn(batch,128) +
          1j*np.random.randn(batch,128)).astype(np.complex64)
filter = np.exp(-np.arange(128)*0.02, dtype=np.complex64)

d_sig = cuda.to_device(signal)
d_filt = cuda.to_device(filter)

filter_pulse[batch, block_dim, 0, shmem_sz](d_sig, d_filt)
cuda.synchronize()

# 4) verify one trace
result = d_sig.copy_to_host()[0]
ref    = np.fft.ifft(np.fft.fft(signal[0])*filter) * 128
print("L2 error:", np.linalg.norm(result - ref)/np.linalg.norm(ref))
# → L2 error: 1.3e-06  (single precision)

Key take-away: You wrote Python, but the compiler fused everything into one kernel with zero intermediate global memory traffic.

5. Third axe: fused epilog for FFT (LTO-IR)

5.1 Band-width burn of post-processing

A 1 024-point complex FFT outputs 8 KB.
If you next scale by 1/√N or apply a window, another full read–modify–write costs ≈ 5 µs on A100—more than the FFT itself.

5.2 Compile your Python function into the kernel

nvmath-python exposes compile_epilog.
It translates pure Python scalar code into LLVM-bitcode, then into LTO-IR that cuFFT can link against at run time.
The fused kernel executes FFT and your scale loop in one pass.

5.3 Unitary FFT example (keeps energy constant)

import cupy as cp, nvmath, math

N = 1024
x = cp.random.rand(N, dtype=cp.float64) + 1j*cp.random.rand(N, dtype=cp.float64)
scale = 1.0 / math.sqrt(N)

def my_scale(out, offset, value, unused1, unused2):
    out[offset] = value * scale

# compile – arguments: user-function, in-type, out-type
epilog_ir = nvmath.fft.compile_epilog(my_scale, "complex128", "complex128")

# run fused FFT
y = nvmath.fft.fft(x, epilog={"ltoir": epilog_ir})

# check
y_ref = cp.fft.fft(x, norm="ortho")
print("Max |diff|:", float(cp.max(cp.abs(y - y_ref))))
# → Max |diff|: 0.0

No extra kernel launch, no additional 8 KB of DRAM traffic—handy when the FFT is inside a larger batch loop.

6. Speed snapshot (data from the README, not cherry-picked)

Problem	Library call	Time	Gain
4 096³ FP16 GEMM	`cupy.matmul`	21.3 ms	baseline
same	nvmath stateful	15.1 ms	1.41×
8 192³ FP16 GEMM	`cupy.matmul`	163 ms	baseline
same	nvmath stateful	108 ms	1.51×
128-point C2C FFT + filter	three separate kernels	2.9 µs	baseline
same	fused cuFFTDx	1.9 µs	1.53×
1 024-point C2C FFT + 1/√N	two kernels	4.6 µs	baseline
same	FFT+fused epilog	3.1 µs	1.48×

Hardware: RTX 4090, CUDA 12.4, driver 535.54.03, fixed 280 W power.
The numbers are repeatable—no marketing department was consulted.

7. Interoperability cheat-sheet

Framework	Tensor location	Zero-copy?	Notes
NumPy	host pinned	yes	falls back to CPU reference
CuPy	GPU	yes	same CUDA stream
PyTorch	GPU	yes	preserves autograd if tensor requires grad
Numba	GPU	yes	device functions inline

Memory layout must be contiguous C; otherwise a cheap copy is made automatically.

8. Typical pitfalls and quick fixes

Symptom	Cause	One-line cure
Result tensor all zeros	forgot to sync stream	`cp.cuda.get_current_stream().synchronize()`
Epilog compile error	captured a list	only pass scalars
Name-mangling fail	Numba < 0.58	`pip install -U numba`
ImportError on load	driver < 535.43	update GPU driver
Wrong wheel	CUDA 11 vs 12	check `nvcc --version` first

Debug mode:
export NVMATH_LOG=TRACE prints the exact cuBLASLt/cuFFT knob values chosen.

9. Road-map (taken from the repo)

FP8 kernels for Hopper GPUs – beta in 0.3
NCCL multi-GPU batch GEMM – experimental branch public
Grace-Hopper unified-memory paths – planned
Windows wheel – community vote in progress

All issues are tracked publicly; pull requests receive T-shirt or cloud-credit rewards.

10. Licence and citation

Everything inside the repository is Apache 2.0.
If you use nvmath-python in research, the BibTeX recommended by NVIDIA is:

@misc{nvmathpython,
  title = {{nvmath-python: NVIDIA Math Libraries for the Python Ecosystem}},
  url  = {https://github.com/NVIDIA/nvmath-python},
  note = {Accessed: \today}
}

11. Take-away

nvmath-python does not try to replace CuPy, PyTorch or Numba.
It simply hands you the rest of the screwdriver set that was previously locked in the C++ toolbox:

keep your Python code
keep your arrays where they are
keep your sanity

Download it, rerun the snippets above, and you will have hard numbers—not promises—to decide whether your next project needs the extra 40 % speed or the extra weekend.