### Abstract

•

Problems of Large Deep Neural Network

◦

Memorization

◦

Sensitivity to Adversarial examples

⇒ Solution: mixup

•

mixup

◦

training using convex combinations of pairs of examples and their labels

◦

mixup regularizes the neural network to favor simple linear behavior in-between training examples

◦

Effects of using mixup

▪

mixup improves the generalization of SOTA neural network archtectures

▪

reduces the memorization of corrupt labels (Section 3.4)

▪

increases the robustness to adversarial examples (Section 3.5)

▪

stabilizes the training of generative adversarial networks (Section 3.7)

### 1. Introduction

•

Shared commonalities of successful large deep neural networks

◦

Trained to minimize average error over the training data - Emperical Risk Minimization (ERM) principle

◦

The size of SOTA neural networks scales linearly with the number of training examples

⇒ Contradiction!

•

Classical result in learning theory (Vapnik & Chervonenkis, 1971):

the convergence of ERM is guaranteed as long as the size of the learning machine does not increase with the number of training data.

•

Challenges toward the suitability of ERM

◦

ERM allows large neural networks to memorize the training data even in the presence of strong regularization (Zhang et al., 2017)

◦

Neural networks trained with ERM change their predictions drastically when evaluated on adversarial examples (Szegedy et al., 2014)

→ ERM might be unable to explain or provide generalization on testing distributions that differ only slightly from the training data...?

•

How to train on similar but different examples to the training data → Data Augmentation

◦

Additional virtual examples can be drawn from the vicinity distribution of the training examples to enlarge the support of the training distribution

◦

Setbacks of conventional data augmentation methods:

1) While data augmentation consistently leads to improved generalization

⇒ the procedure is dataset-dependent, and thus requires the use of expert knowledge

2) Data augmentation assumes that the examples in the vicinity share the same class

⇒ does not model the vicinity relation across examples of different classes

#### Contribution of this Research

•

Simple and Data-augnostic Data Augmetation routine: mixup

•

mixup constructs virtual training examples

•

Effects of facilitating mixup

◦

mixup allows a new state-of-the-art performance in the CIFAR-10, CIFAR100, and ImageNet-2012 image classification datasets

◦

increases the robustness of neural networks when learning from corrupt labels, or facing
adversarial examples

◦

improves generalization on speech and tabular data, and can be used to stabilize the training of GANs

### 2. From Emperical Risk Minimization to mixup

#### Concepts of mixup

•

In supervised learning, we minimize the average of the loss function ℓ over the data distribution P, also known as the expected risk:

•

The distribution P is unknown in most practical situations. Using the training data D, we may approximate P by the empirical distribution:

where $\delta (x=x_{ i }, y={ y }_{ i })$ is a Dirac mass centered at (${ x }_{ i },{ y }_{ i }$)

•

Using the empirical distribution ${ P }_{ \delta }$, we can now approximate the expected risk by the empirical risk:

→ Learning the function f by minimizing ${ R }_{ \delta }(f)$ is known as the Empirical Risk Minimization (ERM) principle (Vapnik, 1998)

•

While efficient to compute, the empirical risk ${ R }_{ \delta }(f)$ monitors the behaviour of f only at a finite set of n examples.

•

When considering functions with a number parameters comparable to n (e.g. large neural networks)

→ Memorize the training data (Zhang et al., 2017) ⇒ Leads to the undesirable behaviour of f
outside the training data (Szegedy et al., 2014)

•

There are other options to approximate the true distribution P

◦

In the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2000), the
distribution P is approximated by

where $v$ is a vicinity distribution that measures the probability of finding the virtual feature-target pair ($\tilde { x, } \tilde { y }$) in the vicinity of the training feature-target pair

•

The contribution of this paper is to propose a generic vicinal distribution, called mixup:

where $λ∼Beta(α,α)$, for $α∈(0,∞)$

◦

In nutshell, sampling from the mixup vicinal distribution produces virtual feature-target vectors

where (${ x }_{ i },{ y }_{ i }$) and (${ x }_{ j },{ y }_{ j }$) are two feature-target vectors drawn at random from the training data, and $λ∈[0,1]$

•

Alternative Design Choices related with mixup (Discussed in Section 3.8)

1) Convex combinations of three or more examples

2) Single data loader to obtain one minibatch, and then mixup is applied to the same minibatch after random shuffling

3) Interpolating only between inputs with equal label

#### What is mixup doing?

•

The mixup vicinal distribution can be understood as a form of data augmentation that encourages the model f to behave linearly in-between training examples

•

This linear behaviour reduces the amount of undesirable oscillations when predicting outside the training examples

•

Figure 1b

•

Figure 2: The average behaviors of two neural network models trained on the CIFAR-10
dataset using ERM and mixup

### 3. Experiments

#### 3.1 ImageNet Classfication

•

ImageNet-2012 classification dataset: contains 1.3 million training images and 50,000 validation images, from a total of 1,000 classes

•

We follow standard data augmentation practices: scale and aspect ratio distortions, random crops, and horizontal flips (Goyal et al., 2017)

•

During evaluation, only the 224 × 224 central crop of each image is tested.

•

We use mixup and ERM to train several state-of-the-art ImageNet-2012 classification models

•

For mixup, we find that $\alpha \in [0.1, 0.4]$ leads to improved performance over ERM, whereas for large $\alpha$, mixup leads to underfitting

•

Models with higher capacities and/or longer training runs are the ones to benefit the most from mixup

#### 3.2 CIFAR-10 and CIFAR-100

#### 3.3 Speech Data

#### 3.4 Memorization and Corrupted Labels

•

Hypothessis: increasing the strength of mixup interpolation $\alpha$ should generate
virtual examples further from the training examples, making memorization more difficult to achieve.

•

We compare in these experiments mixup, dropout, mixup + dropout, and ERM

•

Table 2

◦

We note the best test error achieved during the training session, as well as the final test error after 200 epochs

◦

To quantify the amount of memorization, we also evaluate the training errors at the last epoch on real labels and corrupted labels

▪

mixup with a large $\alpha$ (e.g. 8 or 32) outperforms dropout on both the best and last epoch test errors

▪

achieves lower training error on real labels while remaining resistant to noisy labels.

#### 3.5 Robustness to Adversarial Examples

•

Previous methods to improve robustness to adversarial examples

◦

Penalize the norm of the Jacobian of the model to control its Lipschitz constant (Drucker & Le Cun, 1992; Cisse et al., 2017; Bartlett et al., 2017; Hein & Andriushchenko, 2017)

◦

Perform data augmentation by producing and training on adversarial examples (Goodfellow et al., 2015)

→ All of these methods add significant computational overhead to ERM

⇒ mixup can significantly improve the robustness of neural networks without hindering the speed of ERM

#### 3.6 Tabular Data

#### 3.7 Stabilization of Generative Adversarial Networks

•

Solving the min-max equation for training a GAN is a notoriously difficult optimization problem (Goodfellow, 2016), since the discriminator often provides the generator with vanishing gradients.

•

mixup should stabilize GAN training because it acts as a regularizer on the gradients of the discriminator

→ The smoothness of the discriminator guarantees a stable source of gradient information to the generator

#### 3.8 Ablation Studies

•

Compare interpolating raw inputs with interpolating latent representations

•

Compare mixing random pairs of inputs (RP) with mixing nearest neighbors (KNN)

•

Compare mixing all the classes (AC) with mixing within the same class (SC)

•

Compare mixing inputs and labels with mixing inputs only

•

mixup is the best data augmentation method we test, and is significantly better than the second best method (mix input + label smoothing)

•

For ERM a large weight decay works better, whereas for mixup a small weight decay is preferred, confirming its regularization effects.

### 4. Related Work

•

Like the method of DeVries & Taylor (2017), mixup does not require significant domain knowledge.

•

Like label smoothing, using mixup the supervision of every example is not overly dominated by the ground-truth label.

•

Unlike both of these approaches, the mixup transformation establishes a linear relationship between data augmentation and the supervision signal.

•

The linearity constraint, through its effect on the derivatives of the function approximated, also relates mixup to other methods such as Sobolev training of neural networks (Czarnecki et al., 2017) or WGAN-GP (Gulrajani et al., 2017).

### 5. Dicussion

•

mixup is a form of vicinal risk minimization, which trains on virtual examples
constructed as the linear interpolation of two random examples from the training set and their labels

•

With increasingly large $\alpha$, the training error on real data increases, while the generalization gap decreases.

•

We do not yet have a good theory for understanding the ‘sweet spot’ of this bias-variance trade-off.

◦

In CIFAR-10 classification we can get very low training error on real data even when $\alpha \rightarrow \infty$

◦

In ImageNet classification, the training error on real data increases significantly with $\alpha \rightarrow \infty$

•

Questions for Further Exploration

◦

Is it possible to make similar ideas work on other types of supervised learning problems, such as regression and structured prediction?

◦

Can similar methods prove helpful beyond supervised learning?

◦

Can we extend mixup to feature-label extrapolation to guarantee a robust model behavior far away from the training data?