How Much Do Language Models Really Remember? The 3.6 Bits/Parameter Revelation Groundbreaking research reveals: GPT architecture stores ~3.6 bits per parameter. When data exceeds capacity, models shift from rote memorization to genuine comprehension. Core Discoveries at a Glance Quantified Memory Capacity: GPT models average 3.6 bits/parameter (half-precision training) Dual-Phase Phenomenon: When data surpasses model capacity, unintended memorization decreases while generalization surges Text vs. Random Data: Real text training yields 15-20% lower memorization than random data Scaling Law: Membership inference success correlates to (Model Capacity / Dataset Size) I. The Fundamental Industry Dilemma When 8-billion-parameter models (like Dubey et al., 2024) …