Duplicate

# AdderNet: Do We Really Need Multiplications in Deep Learning?

발표일자
2020/12/14
Tags
CNN
Vision
발표자
황인중
Property
속성

CVPR'20
Convolution == cross-correlation to measure the similarity between input feature and convolution filters
replace convolutions with additions to reduce the computation costs
use L1-norm distance between filters and input feature
with special back-propagation approach
propose an adaptive learning rate strategy

## Introduction

Convolutions take advantage of billions of floating number multiplications with GPUs
Too expensive to run on mobile devices

### Simplifying approaches to minimize the costs

BinaryConnect: binary weight
BNN: binary weight, binary activations
low bit-width gradient of binarized networks
Binarizing filters of deep neural networks significantly reduces the computation cost

### Drawbacks of binarized networks

the original recognition not preserved
Unstable training step
Slower convergence speed with a small learning rate
Let's reduce the computation cost through the replacement of computing operation!

## Related works

Network pruning
Remove redundant weights of filters
Efficient Blocks Design
Knowledge Distillation

## Networks without multiplication

### Similarity in the network operation

"Template matching"
Y indicates the similarity between the filter and the input feature
S() is a pre-defined similarity measure
if S() = x * y, then it becomes convolution
if d =1, then it becomes FC layer
In the end, all that matter is how to find similarity between input and filter

Use distance metrics that maximize the use of additions
By calculating L1 distance between the filter and the input feature, Eq1 becomes Eq2
Subtraction is addition with help of complement code
Different outputs of a convolution filter and an adder filter
Output of a convolution filter could be positive and negative (a weighted summation of values in the input feature map)
Output of an adder filter is always negative
Use batch normalization
Normalize the output of an adder filter to an appropriate range
all the activation functions used in conventional CNN can be applied
Multiplication cost in batch normalization is significantly lower than convolution

## Optimization of back propagation

Partial derivatives of output features Y with respect to the filters F
Partial derivatives of Y with respect to F
sgn = sign function which returns only -1, 0, 1
Eq4 is signSGD update of L2 norm considering Eq5
Limitation of signSGD
SignSGD almost never takes the direction of steepest descent
The direction of signSGD only gets worse as dimensionality grows
Therefore, by utilizing the full-precision gradient, the filters can be updated precisely such as Eq5
Difference of two derivatives (Y by F, Y by X)
Derivatives of Y by F only affects the gradient of F itself
Derivatives of Y by X would influence the gradient not only its layer, but also layer ahead (by chain rule)
With full-precision gradient, the magnitude of derivative of Y by X could be exploded beyond -1 and 1

### Difference of variance of outputs

In conventional CNNs, provided that the weights and the input features are independent
The variance of the output is
The variance of the output in AdderNet
AdderNet tents to bring in a much larger variance of outputs than conventional CNN
Filter variance is usually small (e.g., 10**-3, 10**-4) → multiplication: small, addition: big
Variance of outputs in AdderNet is bigger than in conventional CNNs

Batch normalization
Larger variance of Y ($\sigma$) in AdderNet makes Eq11 much smaller than in conventional CNNs
Due to difference of the norm of gradient for each layer, can not just raise learning rate
Update for each adder layer $l$ is
($\gamma$ is a global learning rate, $\alpha$ is local learning rate, gradient of the filter)
Local learning rate is defined as,
Total algorithm

## Experiment

### Experiments on MNIST

using LeNet-5-BN on MNIST
Images resized to 32 X 32
Replace convolutional filters and multiplications in FC layers with adder filter
기본 보기
Search
Network
Multiplications
Accuracy
Latency
~435K
~435K
99.4%
~2.6M
Open
~870K
~0
99.4%
~1.7M
NOT EMPTY2
Much lower latency with similar accuracy

### Experiments on CIFAR

Preserved accuracy of AdderNet with no multiplication compared to BNN

### Experiments on ImageNet

Tested on rather shallow, and deep models
AddNN shows better accuracy than BNN
AdderFilter actually could extract good feature map