## AdderNet

•

CVPR'20

•

Convolution == cross-correlation to measure the similarity between input feature and convolution filters

•

replace convolutions with additions to reduce the computation costs

•

use L1-norm distance between filters and input feature

•

with special back-propagation approach

•

propose an adaptive learning rate strategy

## Introduction

•

Convolutions take advantage of billions of floating number multiplications with GPUs

◦

Too expensive to run on mobile devices

### Simplifying approaches to minimize the costs

•

BinaryConnect: binary weight

•

BNN: binary weight, binary activations

•

low bit-width gradient of binarized networks

Binarizing filters of deep neural networks significantly reduces the computation cost

### Drawbacks of binarized networks

•

the original recognition not preserved

•

Unstable training step

•

Slower convergence speed with a small learning rate

Let's reduce the computation cost through the replacement of computing operation!

## Related works

•

Network pruning

◦

Remove redundant weights of filters

•

Efficient Blocks Design

•

Knowledge Distillation

## Networks without multiplication

### Similarity in the network operation

"Template matching"

•

Y indicates the similarity between the filter and the input feature

S() is a pre-defined similarity measure

if S() = x * y, then it becomes convolution

if d =1, then it becomes FC layer

In the end, all that matter is how to find similarity between input and filter

### Adder Networks

Use distance metrics that maximize the use of additions

By calculating L1 distance between the filter and the input feature, Eq1 becomes Eq2

Subtraction is addition with help of complement code

•

Different outputs of a convolution filter and an adder filter

◦

Output of a convolution filter could be positive and negative
(a weighted summation of values in the input feature map)

◦

Output of an adder filter is always negative

•

Use batch normalization

◦

Normalize the output of an adder filter to an appropriate range

◦

all the activation functions used in conventional CNN can be applied

◦

Multiplication cost in batch normalization is significantly lower than convolution

## Optimization of back propagation

•

Partial derivatives of output features Y with respect to the filters F

•

Partial derivatives of Y with respect to F

•

sgn = sign function which returns only -1, 0, 1

Eq4 is signSGD update of L2 norm considering Eq5

•

Limitation of signSGD

◦

SignSGD almost never takes the direction of steepest descent

◦

The direction of signSGD only gets worse as dimensionality grows

•

Therefore, by utilizing the full-precision gradient, the filters can be updated precisely such as Eq5

•

Difference of two derivatives (Y by F, Y by X)

◦

Derivatives of Y by F only affects the gradient of F itself

◦

Derivatives of Y by X would influence the gradient not only its layer, but also layer ahead
(by chain rule)

•

With full-precision gradient, the magnitude of derivative of Y by X could be exploded beyond -1 and 1

## Adaptive learning rate scaling

### Difference of variance of outputs

•

In conventional CNNs, provided that the weights and the input features are independent

•

The variance of the output is

•

The variance of the output in AdderNet

•

AdderNet tents to bring in a much larger variance of outputs than conventional CNN

◦

Filter variance is usually small (e.g., 10**-3, 10**-4) → multiplication: small, addition: big

•

Variance of outputs in AdderNet is bigger than in conventional CNNs

### Adoption of adaptive learning rate

•

Batch normalization

•

Gradient of loss w.r.t X

•

Larger variance of Y ($\sigma$) in AdderNet makes Eq11 much smaller than in conventional CNNs

•

Due to difference of the norm of gradient for each layer, can not just raise learning rate

•

Need an adaptive learning rate for different layers in AdderNets

•

Update for each adder layer $l$ is

($\gamma$ is a global learning rate, $\alpha$ is local learning rate, gradient of the filter)

•

Local learning rate is defined as,

•

Total algorithm

## Experiment

### Experiments on MNIST

•

using LeNet-5-BN on MNIST

◦

Images resized to 32 X 32

•

Replace convolutional filters and multiplications in FC layers with adder filter

기본 보기

Search

•

Much lower latency with similar accuracy

### Experiments on CIFAR

•

Preserved accuracy of AdderNet with no multiplication compared to BNN

### Experiments on ImageNet

•

Tested on rather shallow, and deep models

•

AddNN shows better accuracy than BNN

◦

AdderFilter actually could extract good feature map

### Visualization Results

•

AdderNets utilize L1 distance instead of cross correlation

•

Classes are divided by angle in conventional CNNs

•

Classes are clustered following L1-norm

#### Visualization on filters

•

They show similar patterns

#### Visualization on distribution of weights

### Ablation Study

•

Without changing its learning rate, the networks can be hardly trained due to small gradients

•

ILR (Increased Learning Rate) with a value of 100, which is best of {10, 50, 100, 200, 500}