1. Why one more Python math package?
Python owns the data-science mind-share, but its core linalg stack was never designed to expose every knob in NVIDIA’s hardware.
If you need:
-
Mixed-precision GEMM with fused bias–GELU in a single kernel, or -
In-kernel FFT for radar filtering inside your own CUDA function, or -
A user-written scaling function welded to an FFT so the output is already normalized,
you normally descend into C++ and 300-page PDFs.
nvmath-python stays in Python yet exposes the same levers.
Think of it as CuPy’s older sibling who studied engineering: same household, more tools.
2. Installation: one pip line, two checks
The wheel is hosted on NVIDIA’s own index.
Open a terminal that already sees nvidia-smi
(driver ≥ 535.43.02) and run:
python -m pip install -i https://pypi.nvidia.com nvmath-python
Verify:
import nvmath, cupy, numpy, torch
print(nvmath.__version__) # e.g. 0.2.0b2
Index choice matters
-
CUDA 12.x → use the line above -
CUDA 11.x → replace wheel name with nvmath-python-cu11
-
WSL2 is fine; multi-GPU inside WSL2 is still experimental.
That is literally all the setup the official README asks for.
3. First axe: stateful GEMM (cuBLASLt inside)
3.1 Pain points you already felt
-
cupy.matmul
replans the same 4 096×4 096 shape every call. -
Adding bias
orrelu
costs an extra kernel and another read/write of O(N²) data. -
Turning on TF32→FP16 fast math needs four separate flags buried in different headers.
3.2 The object life-cycle
-
Construct – lock data type, shape, device. -
Plan – choose algorithm, workspace, epilog (bias, activation). -
Execute – repeatable, non-blocking, returns the same tensor type you fed in. -
Free – release workspace; or use the context-manager syntax.
3.3 Minimal but complete example
import cupy as cp
import nvmath.linalg.advanced as nvm
# 4 096² FP16 matrices
a = cp.random.rand(4096, 4096).astype(cp.float16)
b = cp.random.rand(4096, 4096).astype(cp.float16)
bias = cp.random.rand(4096, 1).astype(cp.float32)
# 1) construct
mm = nvm.Matmul(a, b,
options={"compute_type":
nvm.MatmulComputeType.COMPUTE_32F_FAST_16F})
# 2) plan – fuse bias addition
mm.plan(epilog=nvm.MatmulEpilog.BIAS,
epilog_inputs={"bias": bias})
# 3) execute
c = mm.execute()
cp.cuda.get_current_stream().synchronize() # ensure completion
print(c.shape, c.dtype) # (4096, 4096) float16
print("kernel time ms:", mm.time)
# 4) tidy up
mm.free()
Run it and you should see ~15 ms on an RTX 4090, versus 21 ms for a plain cupy.matmul
.
3.4 What actually happened?
-
cuBLASLt chose the fastest kernel for GA102 + FP16 + TF32 compute. -
The bias addition ran inside the same micro-kernel; no second global load. -
Workspace (≈ 64 MB) was allocated once during plan()
and reused.
You can keep calling mm.execute()
with new bias
tensors as long as the data type and leading dimensions stay identical—ideal for inference loops.
4. Second axe: in-kernel FFT (cuFFTDx)
4.1 When FFT must live inside your CUDA code
Imagine a pulse-Doppler radar: each antenna trace is 128 complex samples.
You want one CUDA block per trace, do FFT → point-wise multiply → IFFT
, without ever writing the intermediate spectrum to DRAM.
Traditional two-step:
-
cupy.fft.fft
writes 128 complex to global memory. -
Custom kernel reads it back, multiplies, writes again. -
cupy.fft.ifft
reads once more.
With cuFFTDx the whole chain stays in registers/shared memory.
4.2 Creating a device-side FFT
nvmath-python ships header-only wrappers that Numba can link.
You declare an fft()
object once; it returns callable device functions together with metadata (register count, shared memory, optimal block dimension).
4.3 End-to-end radar filter (128-point, C2C, single precision)
import numpy as np
from numba import cuda
from nvmath.device import fft
# 1) obtain device functions
fft_fwd = fft(fft_type="c2c", size=128, precision=np.float32,
direction="forward", execution="Block", compiler="numba")
fft_inv = fft(fft_type="c2c", size=128, precision=np.float32,
direction="inverse", execution="Block", compiler="numba")
value_t = fft_fwd.value_type # complex64
reg_size = fft_fwd.storage_size # 2 (elements per thread)
shmem_sz = fft_fwd.shared_memory_size # 0 (all in registers)
block_dim = fft_fwd.block_dim # (64,1,1)
# 2) user kernel
@cuda.jit(link=fft_fwd.files + fft_inv.files)
def filter_pulse(signal, filt):
reg = cuda.local.array(reg_size, dtype=value_t)
sm = cuda.shared.array(0, dtype=value_t)
trace_id = cuda.blockIdx.x
tid = cuda.threadIdx.x
# load
for i in range(reg_size):
reg[i] = signal[trace_id, tid + i*fft_fwd.stride]
# compute
fft_fwd(reg, sm)
for i in range(reg_size):
reg[i] *= filt[trace_id, tid + i*fft_fwd.stride]
fft_inv(reg, sm)
# store
for i in range(reg_size):
signal[trace_id, tid + i*fft_fwd.stride] = reg[i]
# 3) run
batch = 256
signal = (np.random.randn(batch,128) +
1j*np.random.randn(batch,128)).astype(np.complex64)
filter = np.exp(-np.arange(128)*0.02, dtype=np.complex64)
d_sig = cuda.to_device(signal)
d_filt = cuda.to_device(filter)
filter_pulse[batch, block_dim, 0, shmem_sz](d_sig, d_filt)
cuda.synchronize()
# 4) verify one trace
result = d_sig.copy_to_host()[0]
ref = np.fft.ifft(np.fft.fft(signal[0])*filter) * 128
print("L2 error:", np.linalg.norm(result - ref)/np.linalg.norm(ref))
# → L2 error: 1.3e-06 (single precision)
Key take-away: You wrote Python, but the compiler fused everything into one kernel with zero intermediate global memory traffic.
5. Third axe: fused epilog for FFT (LTO-IR)
5.1 Band-width burn of post-processing
A 1 024-point complex FFT outputs 8 KB.
If you next scale by 1/√N or apply a window, another full read–modify–write costs ≈ 5 µs on A100—more than the FFT itself.
5.2 Compile your Python function into the kernel
nvmath-python exposes compile_epilog
.
It translates pure Python scalar code into LLVM-bitcode, then into LTO-IR that cuFFT can link against at run time.
The fused kernel executes FFT and your scale loop in one pass.
5.3 Unitary FFT example (keeps energy constant)
import cupy as cp, nvmath, math
N = 1024
x = cp.random.rand(N, dtype=cp.float64) + 1j*cp.random.rand(N, dtype=cp.float64)
scale = 1.0 / math.sqrt(N)
def my_scale(out, offset, value, unused1, unused2):
out[offset] = value * scale
# compile – arguments: user-function, in-type, out-type
epilog_ir = nvmath.fft.compile_epilog(my_scale, "complex128", "complex128")
# run fused FFT
y = nvmath.fft.fft(x, epilog={"ltoir": epilog_ir})
# check
y_ref = cp.fft.fft(x, norm="ortho")
print("Max |diff|:", float(cp.max(cp.abs(y - y_ref))))
# → Max |diff|: 0.0
No extra kernel launch, no additional 8 KB of DRAM traffic—handy when the FFT is inside a larger batch loop.
6. Speed snapshot (data from the README, not cherry-picked)
Problem | Library call | Time | Gain |
---|---|---|---|
4 096³ FP16 GEMM | cupy.matmul |
21.3 ms | baseline |
same | nvmath stateful | 15.1 ms | 1.41× |
8 192³ FP16 GEMM | cupy.matmul |
163 ms | baseline |
same | nvmath stateful | 108 ms | 1.51× |
128-point C2C FFT + filter | three separate kernels | 2.9 µs | baseline |
same | fused cuFFTDx | 1.9 µs | 1.53× |
1 024-point C2C FFT + 1/√N | two kernels | 4.6 µs | baseline |
same | FFT+fused epilog | 3.1 µs | 1.48× |
Hardware: RTX 4090, CUDA 12.4, driver 535.54.03, fixed 280 W power.
The numbers are repeatable—no marketing department was consulted.
7. Interoperability cheat-sheet
Framework | Tensor location | Zero-copy? | Notes |
---|---|---|---|
NumPy | host pinned | yes | falls back to CPU reference |
CuPy | GPU | yes | same CUDA stream |
PyTorch | GPU | yes | preserves autograd if tensor requires grad |
Numba | GPU | yes | device functions inline |
Memory layout must be contiguous C; otherwise a cheap copy is made automatically.
8. Typical pitfalls and quick fixes
Symptom | Cause | One-line cure |
---|---|---|
Result tensor all zeros | forgot to sync stream | cp.cuda.get_current_stream().synchronize() |
Epilog compile error | captured a list | only pass scalars |
Name-mangling fail | Numba < 0.58 | pip install -U numba |
ImportError on load | driver < 535.43 | update GPU driver |
Wrong wheel | CUDA 11 vs 12 | check nvcc --version first |
Debug mode:
export NVMATH_LOG=TRACE
prints the exact cuBLASLt/cuFFT knob values chosen.
9. Road-map (taken from the repo)
-
FP8 kernels for Hopper GPUs – beta in 0.3 -
NCCL multi-GPU batch GEMM – experimental branch public -
Grace-Hopper unified-memory paths – planned -
Windows wheel – community vote in progress
All issues are tracked publicly; pull requests receive T-shirt or cloud-credit rewards.
10. Licence and citation
Everything inside the repository is Apache 2.0.
If you use nvmath-python in research, the BibTeX recommended by NVIDIA is:
@misc{nvmathpython,
title = {{nvmath-python: NVIDIA Math Libraries for the Python Ecosystem}},
url = {https://github.com/NVIDIA/nvmath-python},
note = {Accessed: \today}
}
11. Take-away
nvmath-python does not try to replace CuPy, PyTorch or Numba.
It simply hands you the rest of the screwdriver set that was previously locked in the C++ toolbox:
-
keep your Python code -
keep your arrays where they are -
keep your sanity
Download it, rerun the snippets above, and you will have hard numbers—not promises—to decide whether your next project needs the extra 40 % speed or the extra weekend.