How to Calculate the Number of GPUs Needed to Deploy a Large Language Model (LLM): A Step-by-Step Guide

In the realm of AI, deploying large language models (LLMs) like Gemma-3, LLaMA, or Qwen demands more than just selecting a GPU randomly. It requires mathematical precision, an understanding of transformer architecture, and hardware profiling. This article delves into the exact math, code, and interpretation needed to determine the number of GPUs required for deploying a given LLM, considering performance benchmarks, FLOPs, memory constraints, and concurrency requirements.

What Affects Deployment Requirements?

The cost of serving an LLM during inference primarily depends on several factors:

  • Model architecture: The number of layers (L), hidden size (D), attention heads (H), and sequence length (S) all impact the model’s computational and memory requirements.
  • Quantization precision: Different precisions like FP16, INT8, or Q6_K affect both the model’s size and its computational needs.
  • User concurrency: The number of simultaneous user queries directly influences the memory and computational resources required.
  • Latency target: The speed at which the first token should be generated affects the model’s performance requirements.
  • GPU throughput: The TFLOPs and memory per GPU determine how efficiently the model can be served.

Mathematical Model of FLOPs and Memory

To calculate the FLOPs per token:

FLOPs_per_token = L × (12 × D² + 2 × D × S)

And the total model parameters (weights):

Params = 12 × L × D²

The KV cache (attention key/values) grows with user count and sequence length:

KV_Memory = L × H × (D/H) × S × 2 × Bytes × concurrency

An additional 10% overhead should be added for memory calculations.

Python Code for Estimation

Here’s a Python script that combines these calculations to estimate GPU requirements:

import math  MODEL_SPECS = {     "Gemma 3 27B": {"L": 24, "D": 2048, "S": 512, "H": 32},     # Add more models as needed }  GPU_DATABASE = {     "L40S (48 GB)": {"memory_gb": 48, "fp16_tflops": 40},     "H100 SXM 80 GB": {"memory_gb": 80, "fp16_tflops": 100},     "B200 192 GB": {"memory_gb": 192, "fp16_tflops": 150},     # Add more GPUs as needed }  BYTES_PER_PARAM = {     "fp16": 2,     "int8": 1,     # Add more precisions as needed }  PRECISION_TO_GPU_KEY = {     "fp16": "fp16_tflops" }  def estimate_gpu_requirements(model_name, concurrency, time_to_first_token, precision):     L, D, S, H = (MODEL_SPECS[model_name][k] for k in ("L", "D", "S", "H"))     dtype_bytes = BYTES_PER_PARAM[precision]          flops_tok = L * (12 * D**2 + 2 * D * S)     tf_needed = flops_tok * (1 / time_to_first_token) * concurrency / 1e12          params = 12 * L * D**2     w_gb = params * dtype_bytes / 1024**3          d_head = D / H     kv_gb = (* H * d_head * S * 2 * dtype_bytes * concurrency) / 1024**3     tot_gb = (w_gb + kv_gb) * 1.10          results = []     for gpu, spec in GPU_DATABASE.items():         peak = spec.get(PRECISION_TO_GPU_KEY[precision])         if peak is None:             continue                  cards_mem_w = math.ceil(w_gb / spec["memory_gb"])         cards_mem_t = math.ceil(tot_gb / spec["memory_gb"])         cards_comp = tf_needed / peak                  req_w = max(cards_mem_w, cards_comp)         req_t = max(cards_mem_t, cards_comp)                  results.append(             (gpu,             round(req_t, 1), round(req_w, 1),             round(w_gb, 1), round(kv_gb, 1), round(tot_gb, 1))         )     return results  if __name__ == "__main__":     model = "Gemma 3 27B"     concurrency = 100     t_first = 1 / 1.5  # target 1.5 tok/sec     precision = "fp16"          rows = estimate_gpu_requirements(model, concurrency, t_first, precision)     print(f"\n── {model} · {precision.upper()} ─────────────────────────────────")     for (gpu, req_t, req_w, w_gb, kv_gb, tot_gb) in rows:         print(f"{gpu:<18} | GPUs Tot:{req_t:>4}  Wt:{req_w:>4} "               f"| Weights:{w_gb:>5} GB  KV:{kv_gb:>6} GB  Tot:{tot_gb:>6} GB") 

Interpreting the Output

The output provides valuable insights:

  • Tot: The number of GPUs needed considering both total memory and compute requirements.
  • Wt: The number of GPUs needed if only the model weights were considered.
  • Lower precision generally leads to fewer FLOPs and thus fewer GPUs required.
  • Larger GPUs (like B200, H200) can serve massive models at high concurrency with fewer cards.

Advanced Options and Considerations

  • Adjust the concurrency parameter based on your service scale.
  • Swap models from MODEL_SPECS to evaluate different LLMs.
  • Consider quantization techniques like Q6_K for deployment on edge GPUs.
  • Always add a buffer (e.g., 10-20%) to your calculated memory estimates to account for unexpected overheads.

Conclusion

Deploying LLMs efficiently requires a careful balance of FLOPs, context KV cache, and precision. Lower precision can significantly reduce memory and computational requirements, lowering costs. High-end GPUs like GB200, B200, or H100 SXM can handle massive models with high concurrency. Use this method to choose the most cost-effective configuration that meets your SLA requirements. This approach isn’t just about deployment—it’s about budgeting, benchmarking, and optimizing real-world AI systems. Mastering these calculations ensures your LLM infrastructure is both cost-efficient and future-proof.