## Understanding Deep Learning Requires Rethinking Generalization

## Overview

- Remarkably neural networks overfit little
- This is normally attributed to properties of the model family or good regularization
- Experiments in this paper show how these approaches fail to explain why nn’s generalize well
- Random label experiments, or noise label Experiments

## Contributions

- Randomization tests from statistics:
*deep neural networks easily fit random labels* - Specifically, training error easily gets zero, test error random chance because there is no correlation between training and test labels

#### Statistical implications

- The nn capacity is large enough to simply memorize the dataset
- Optimization on random labels is easy
- Randomization of the labels is simply a data transformation, leaves all other properties of the learning problem unchanged (Hmmmmm???)

#### More experiments

- Also replaced the input images with random Gaussian pixels and the neural net can still learn the data with zero training error.
- Interpolating between noise and signal shows a gradual deterioration of the generalization
- This means neural networks fit both the signal and the noise
- This rules out VC-dimension, Rademacher complexity, and uniform stability as possible explanations for the generalization performance of state of the art neural networks

#### The role of explicit regularization

- Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error
- In classical convex empirical risk minimization regularization is necessary to rule out trivial solutions
- In neural nets it plays a rather different Role
- The absence of regularization does not imply poor generalization

#### Finite sample expressivity

- A theoretical construction of a two layer RELU network with p=2n+d parameters where n is the sample size and d is the input dimension that can express any labelling of any sample

#### The role of implicit regularization

- This appeals to e.g. SGD
- This implicit regularization is shown for linear models in which SGD always corresponds to a solution of small norm.

## Conclusion

- The classical view of machine learning rests on the idea of extracting a low dimensional pattern from the data.
- Memorization is not thought of as a good problem solver, yet it can be effective in some problem solving
- This paper shows that many neural networks have the capacity to simply memorize the data.
- It is likely that in deep neural networks learning is part extracting patterns and part memorization.
- Therefore, classical approaches are not suited to analyze deep neural networks.
- This paper argues that these findings show that we need to rethink generalization to understand deep neural networks