AdderNet: Do We Really Need Multiplications in Deep Learning?

2021/07/04 06:53
포스팅 종류


Convolution == cross-correlation to measure the similarity between input feature and convolution filters
replace convolutions with additions to reduce the computation costs
use L1-norm distance between filters and input feature
with special back-propagation approach
propose an adaptive learning rate strategy


Convolutions take advantage of billions of floating number multiplications with GPUs
Too expensive to run on mobile devices

Simplifying approaches to minimize the costs

BinaryConnect: binary weight
BNN: binary weight, binary activations
low bit-width gradient of binarized networks
Binarizing filters of deep neural networks significantly reduces the computation cost

Drawbacks of binarized networks

the original recognition not preserved
Unstable training step
Slower convergence speed with a small learning rate
Let's reduce the computation cost through the replacement of computing operation!

Related works

Network pruning
Remove redundant weights of filters
Efficient Blocks Design
Knowledge Distillation

Networks without multiplication

Similarity in the network operation

"Template matching"
Y indicates the similarity between the filter and the input feature
S() is a pre-defined similarity measure
if S() = x * y, then it becomes convolution
if d =1, then it becomes FC layer
In the end, all that matter is how to find similarity between input and filter

Adder Networks

Use distance metrics that maximize the use of additions
By calculating L1 distance between the filter and the input feature, Eq1 becomes Eq2
Subtraction is addition with help of complement code
Different outputs of a convolution filter and an adder filter
Output of a convolution filter could be positive and negative (a weighted summation of values in the input feature map)
Output of an adder filter is always negative
Use batch normalization
Normalize the output of an adder filter to an appropriate range
all the activation functions used in conventional CNN can be applied
Multiplication cost in batch normalization is significantly lower than convolution

Optimization of back propagation

Partial derivatives of output features Y with respect to the filters F
Partial derivatives of Y with respect to F
sgn = sign function which returns only -1, 0, 1
Eq4 is signSGD update of L2 norm considering Eq5
Limitation of signSGD
SignSGD almost never takes the direction of steepest descent
The direction of signSGD only gets worse as dimensionality grows
Therefore, by utilizing the full-precision gradient, the filters can be updated precisely such as Eq5
Difference of two derivatives (Y by F, Y by X)
Derivatives of Y by F only affects the gradient of F itself
Derivatives of Y by X would influence the gradient not only its layer, but also layer ahead (by chain rule)
With full-precision gradient, the magnitude of derivative of Y by X could be exploded beyond -1 and 1

Adaptive learning rate scaling

Difference of variance of outputs

In conventional CNNs, provided that the weights and the input features are independent
The variance of the output is
The variance of the output in AdderNet
AdderNet tents to bring in a much larger variance of outputs than conventional CNN
Filter variance is usually small (e.g., 10**-3, 10**-4) → multiplication: small, addition: big
Variance of outputs in AdderNet is bigger than in conventional CNNs

Adoption of adaptive learning rate

Batch normalization
Gradient of loss w.r.t X
Larger variance of Y (σ\sigma) in AdderNet makes Eq11 much smaller than in conventional CNNs
Due to difference of the norm of gradient for each layer, can not just raise learning rate
Need an adaptive learning rate for different layers in AdderNets
Update for each adder layer ll is
(γ\gamma is a global learning rate, α\alpha is local learning rate, gradient of the filter)
Local learning rate is defined as,
Total algorithm


Experiments on MNIST

using LeNet-5-BN on MNIST
Images resized to 32 X 32
Replace convolutional filters and multiplications in FC layers with adder filter
기본 보기
LeNet-5-BN with adder
Much lower latency with similar accuracy

Experiments on CIFAR

Preserved accuracy of AdderNet with no multiplication compared to BNN

Experiments on ImageNet

Tested on rather shallow, and deep models
AddNN shows better accuracy than BNN
AdderFilter actually could extract good feature map

Visualization Results

AdderNets utilize L1 distance instead of cross correlation
Classes are divided by angle in conventional CNNs
Classes are clustered following L1-norm

Visualization on filters

They show similar patterns

Visualization on distribution of weights

Ablation Study

Without changing its learning rate, the networks can be hardly trained due to small gradients
ILR (Increased Learning Rate) with a value of 100, which is best of {10, 50, 100, 200, 500}