FOP Optimizer Revolution: Scaling Neural Network Training to 32,768 Batch Sizes with 5x Speed Boost

23 hours ago 高效码农

FOP Optimizer: Enhancing Large-Scale Neural Network Training Efficiency 1. Background and Challenges Deep learning faces significant efficiency challenges as models and datasets grow. Modern GPUs, despite their computational power, struggle with traditional optimization methods when handling massive training batches. 1.1 Large-Batch Training Problems • Reduced Gradient Noise: First-order optimizers like SGD and AdamW rely on gradient noise to explore optimal solutions. Large batches produce more deterministic gradients, limiting exploration capabilities. • Second-Order Method Instability: Kronecker-Factored Approximate Curvature (KFAC) methods require excessive damping coefficients at large scales, effectively losing curvature information and degrading to simple gradient descent. 1.2 Typical Failure Scenario …