mixup: Beyond Emperical Risk Minimization

7/4/2021, 7:08:00 AM
포스팅 종류


Problems of Large Deep Neural Network
Sensitivity to Adversarial examples
⇒ Solution: mixup
training using convex combinations of pairs of examples and their labels
mixup regularizes the neural network to favor simple linear behavior in-between training examples
Effects of using mixup
mixup improves the generalization of SOTA neural network archtectures
reduces the memorization of corrupt labels (Section 3.4)
increases the robustness to adversarial examples (Section 3.5)
stabilizes the training of generative adversarial networks (Section 3.7)

1. Introduction

Shared commonalities of successful large deep neural networks
Trained to minimize average error over the training data - Emperical Risk Minimization (ERM) principle
The size of SOTA neural networks scales linearly with the number of training examples
Classical result in learning theory (Vapnik & Chervonenkis, 1971):
the convergence of ERM is guaranteed as long as the size of the learning machine does not increase with the number of training data.
Challenges toward the suitability of ERM
ERM allows large neural networks to memorize the training data even in the presence of strong regularization (Zhang et al., 2017)
Neural networks trained with ERM change their predictions drastically when evaluated on adversarial examples (Szegedy et al., 2014)
→ ERM might be unable to explain or provide generalization on testing distributions that differ only slightly from the training data...?
How to train on similar but different examples to the training data → Data Augmentation
Additional virtual examples can be drawn from the vicinity distribution of the training examples to enlarge the support of the training distribution
Setbacks of conventional data augmentation methods:
1) While data augmentation consistently leads to improved generalization
the procedure is dataset-dependent, and thus requires the use of expert knowledge
2) Data augmentation assumes that the examples in the vicinity share the same class
⇒ does not model the vicinity relation across examples of different classes

Contribution of this Research

Simple and Data-augnostic Data Augmetation routine: mixup
mixup constructs virtual training examples
Effects of facilitating mixup
mixup allows a new state-of-the-art performance in the CIFAR-10, CIFAR100, and ImageNet-2012 image classification datasets
increases the robustness of neural networks when learning from corrupt labels, or facing adversarial examples
improves generalization on speech and tabular data, and can be used to stabilize the training of GANs

2. From Emperical Risk Minimization to mixup

Concepts of mixup

In supervised learning, we minimize the average of the loss function over the data distribution P, also known as the expected risk:
The distribution P is unknown in most practical situations. Using the training data D, we may approximate P by the empirical distribution:
where δ(x=xi,y=yi)\delta (x=x_{ i }, y={ y }_{ i }) is a Dirac mass centered at (xi,yi{ x }_{ i },{ y }_{ i })
Using the empirical distribution Pδ { P }_{ \delta  }, we can now approximate the expected risk by the empirical risk:
→ Learning the function f by minimizing Rδ (f){ R }_{ \delta  }(f) is known as the Empirical Risk Minimization (ERM) principle (Vapnik, 1998)
While efficient to compute, the empirical risk Rδ (f){ R }_{ \delta  }(f) monitors the behaviour of f only at a finite set of n examples.
When considering functions with a number parameters comparable to n (e.g. large neural networks)
→ Memorize the training data (Zhang et al., 2017) ⇒ Leads to the undesirable behaviour of f outside the training data (Szegedy et al., 2014)
There are other options to approximate the true distribution P
In the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2000), the distribution P is approximated by
where vv is a vicinity distribution that measures the probability of finding the virtual feature-target pair (x,~y~\tilde { x, } \tilde { y } ) in the vicinity of the training feature-target pair
The contribution of this paper is to propose a generic vicinal distribution, called mixup:
where λBeta(α,α)λ∼Beta(α,α), for α(0,)α∈(0,∞)
In nutshell, sampling from the mixup vicinal distribution produces virtual feature-target vectors
where (xi,yi{ x }_{ i },{ y }_{ i }) and (xj,yj{ x }_{ j },{ y }_{ j }) are two feature-target vectors drawn at random from the training data, and λ[0,1]λ∈[0,1]
Alternative Design Choices related with mixup (Discussed in Section 3.8)
1) Convex combinations of three or more examples
2) Single data loader to obtain one minibatch, and then mixup is applied to the same minibatch after random shuffling
3) Interpolating only between inputs with equal label

What is mixup doing?

The mixup vicinal distribution can be understood as a form of data augmentation that encourages the model f to behave linearly in-between training examples
This linear behaviour reduces the amount of undesirable oscillations when predicting outside the training examples
Figure 1b
Figure 2: The average behaviors of two neural network models trained on the CIFAR-10 dataset using ERM and mixup

3. Experiments

3.1 ImageNet Classfication

ImageNet-2012 classification dataset: contains 1.3 million training images and 50,000 validation images, from a total of 1,000 classes
We follow standard data augmentation practices: scale and aspect ratio distortions, random crops, and horizontal flips (Goyal et al., 2017)
During evaluation, only the 224 × 224 central crop of each image is tested.
We use mixup and ERM to train several state-of-the-art ImageNet-2012 classification models
For mixup, we find that α[0.1,0.4]\alpha \in [0.1, 0.4] leads to improved performance over ERM, whereas for large α\alpha, mixup leads to underfitting
Models with higher capacities and/or longer training runs are the ones to benefit the most from mixup

3.2 CIFAR-10 and CIFAR-100

3.3 Speech Data

3.4 Memorization and Corrupted Labels

Hypothessis: increasing the strength of mixup interpolation α\alpha should generate virtual examples further from the training examples, making memorization more difficult to achieve.
We compare in these experiments mixup, dropout, mixup + dropout, and ERM
Table 2
We note the best test error achieved during the training session, as well as the final test error after 200 epochs
To quantify the amount of memorization, we also evaluate the training errors at the last epoch on real labels and corrupted labels
mixup with a large α\alpha (e.g. 8 or 32) outperforms dropout on both the best and last epoch test errors
achieves lower training error on real labels while remaining resistant to noisy labels.

3.5 Robustness to Adversarial Examples

Previous methods to improve robustness to adversarial examples
Penalize the norm of the Jacobian of the model to control its Lipschitz constant (Drucker & Le Cun, 1992; Cisse et al., 2017; Bartlett et al., 2017; Hein & Andriushchenko, 2017)
Perform data augmentation by producing and training on adversarial examples (Goodfellow et al., 2015)
→ All of these methods add significant computational overhead to ERM
mixup can significantly improve the robustness of neural networks without hindering the speed of ERM

3.6 Tabular Data

3.7 Stabilization of Generative Adversarial Networks

Solving the min-max equation for training a GAN is a notoriously difficult optimization problem (Goodfellow, 2016), since the discriminator often provides the generator with vanishing gradients.
mixup should stabilize GAN training because it acts as a regularizer on the gradients of the discriminator
→ The smoothness of the discriminator guarantees a stable source of gradient information to the generator

3.8 Ablation Studies

Compare interpolating raw inputs with interpolating latent representations
Compare mixing random pairs of inputs (RP) with mixing nearest neighbors (KNN)
Compare mixing all the classes (AC) with mixing within the same class (SC)
Compare mixing inputs and labels with mixing inputs only
mixup is the best data augmentation method we test, and is significantly better than the second best method (mix input + label smoothing)
For ERM a large weight decay works better, whereas for mixup a small weight decay is preferred, confirming its regularization effects.

4. Related Work

Like the method of DeVries & Taylor (2017), mixup does not require significant domain knowledge.
Like label smoothing, using mixup the supervision of every example is not overly dominated by the ground-truth label.
Unlike both of these approaches, the mixup transformation establishes a linear relationship between data augmentation and the supervision signal.
The linearity constraint, through its effect on the derivatives of the function approximated, also relates mixup to other methods such as Sobolev training of neural networks (Czarnecki et al., 2017) or WGAN-GP (Gulrajani et al., 2017).

5. Dicussion

mixup is a form of vicinal risk minimization, which trains on virtual examples constructed as the linear interpolation of two random examples from the training set and their labels
With increasingly large α\alpha, the training error on real data increases, while the generalization gap decreases.
We do not yet have a good theory for understanding the ‘sweet spot’ of this bias-variance trade-off.
In CIFAR-10 classification we can get very low training error on real data even when α\alpha \rightarrow \infty
In ImageNet classification, the training error on real data increases significantly with α\alpha \rightarrow \infty
Questions for Further Exploration
Is it possible to make similar ideas work on other types of supervised learning problems, such as regression and structured prediction?
Can similar methods prove helpful beyond supervised learning?
Can we extend mixup to feature-label extrapolation to guarantee a robust model behavior far away from the training data?