Understanding Deep Learning Requires Rethinking Generalization
- Remarkably neural networks overfit little
- This is normally attributed to properties of the model family or good regularization
- Experiments in this paper show how these approaches fail to explain why nn’s generalize well
- Random label experiments, or noise label Experiments
- Randomization tests from statistics: deep neural networks easily fit random labels
- Specifically, training error easily gets zero, test error random chance because there is no correlation between training and test labels
- The nn capacity is large enough to simply memorize the dataset
- Optimization on random labels is easy
- Randomization of the labels is simply a data transformation, leaves all other properties of the learning problem unchanged (Hmmmmm???)
- Also replaced the input images with random Gaussian pixels and the neural net can still learn the data with zero training error.
- Interpolating between noise and signal shows a gradual deterioration of the generalization
- This means neural networks fit both the signal and the noise
- This rules out VC-dimension, Rademacher complexity, and uniform stability as possible explanations for the generalization performance of state of the art neural networks
The role of explicit regularization
- Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error
- In classical convex empirical risk minimization regularization is necessary to rule out trivial solutions
- In neural nets it plays a rather different Role
- The absence of regularization does not imply poor generalization
Finite sample expressivity
- A theoretical construction of a two layer RELU network with p=2n+d parameters where n is the sample size and d is the input dimension that can express any labelling of any sample
The role of implicit regularization
- This appeals to e.g. SGD
- This implicit regularization is shown for linear models in which SGD always corresponds to a solution of small norm.
- The classical view of machine learning rests on the idea of extracting a low dimensional pattern from the data.
- Memorization is not thought of as a good problem solver, yet it can be effective in some problem solving
- This paper shows that many neural networks have the capacity to simply memorize the data.
- It is likely that in deep neural networks learning is part extracting patterns and part memorization.
- Therefore, classical approaches are not suited to analyze deep neural networks.
- This paper argues that these findings show that we need to rethink generalization to understand deep neural networks