Understanding Deep Learning Requires Rethinking Generalization


  • Remarkably neural networks overfit little
  • This is normally attributed to properties of the model family or good regularization
  • Experiments in this paper show how these approaches fail to explain why nn’s generalize well
  • Random label experiments, or noise label Experiments


  • Randomization tests from statistics: deep neural networks easily fit random labels
  • Specifically, training error easily gets zero, test error random chance because there is no correlation between training and test labels

Statistical implications

  • The nn capacity is large enough to simply memorize the dataset
  • Optimization on random labels is easy
  • Randomization of the labels is simply a data transformation, leaves all other properties of the learning problem unchanged (Hmmmmm???)

More experiments

  • Also replaced the input images with random Gaussian pixels and the neural net can still learn the data with zero training error.
  • Interpolating between noise and signal shows a gradual deterioration of the generalization
  • This means neural networks fit both the signal and the noise
  • This rules out VC-dimension, Rademacher complexity, and uniform stability as possible explanations for the generalization performance of state of the art neural networks

The role of explicit regularization

  • Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error
  • In classical convex empirical risk minimization regularization is necessary to rule out trivial solutions
  • In neural nets it plays a rather different Role
  • The absence of regularization does not imply poor generalization

Finite sample expressivity

  • A theoretical construction of a two layer RELU network with p=2n+d parameters where n is the sample size and d is the input dimension that can express any labelling of any sample

The role of implicit regularization

  • This appeals to e.g. SGD
  • This implicit regularization is shown for linear models in which SGD always corresponds to a solution of small norm.


  • The classical view of machine learning rests on the idea of extracting a low dimensional pattern from the data.
  • Memorization is not thought of as a good problem solver, yet it can be effective in some problem solving
  • This paper shows that many neural networks have the capacity to simply memorize the data.
  • It is likely that in deep neural networks learning is part extracting patterns and part memorization.
  • Therefore, classical approaches are not suited to analyze deep neural networks.
  • This paper argues that these findings show that we need to rethink generalization to understand deep neural networks