AdderNet
•
CVPR'20
•
Convolution == cross-correlation to measure the similarity between input feature and convolution filters
•
replace convolutions with additions to reduce the computation costs
•
use L1-norm distance between filters and input feature
•
with special back-propagation approach
•
propose an adaptive learning rate strategy
Introduction
•
Convolutions take advantage of billions of floating number multiplications with GPUs
◦
Too expensive to run on mobile devices
Simplifying approaches to minimize the costs
•
BinaryConnect: binary weight
•
BNN: binary weight, binary activations
•
low bit-width gradient of binarized networks
Binarizing filters of deep neural networks significantly reduces the computation cost
Drawbacks of binarized networks
•
the original recognition not preserved
•
Unstable training step
•
Slower convergence speed with a small learning rate
Let's reduce the computation cost through the replacement of computing operation!
Related works
•
Network pruning
◦
Remove redundant weights of filters
•
Efficient Blocks Design
•
Knowledge Distillation
Networks without multiplication
Similarity in the network operation
"Template matching"
•
Y indicates the similarity between the filter and the input feature
S() is a pre-defined similarity measure
if S() = x * y, then it becomes convolution
if d =1, then it becomes FC layer
In the end, all that matter is how to find similarity between input and filter
Adder Networks
Use distance metrics that maximize the use of additions
By calculating L1 distance between the filter and the input feature, Eq1 becomes Eq2
Subtraction is addition with help of complement code
•
Different outputs of a convolution filter and an adder filter
◦
Output of a convolution filter could be positive and negative
(a weighted summation of values in the input feature map)
◦
Output of an adder filter is always negative
•
Use batch normalization
◦
Normalize the output of an adder filter to an appropriate range
◦
all the activation functions used in conventional CNN can be applied
◦
Multiplication cost in batch normalization is significantly lower than convolution
Optimization of back propagation
•
Partial derivatives of output features Y with respect to the filters F
•
Partial derivatives of Y with respect to F
•
sgn = sign function which returns only -1, 0, 1
Eq4 is signSGD update of L2 norm considering Eq5
•
Limitation of signSGD
◦
SignSGD almost never takes the direction of steepest descent
◦
The direction of signSGD only gets worse as dimensionality grows
•
Therefore, by utilizing the full-precision gradient, the filters can be updated precisely such as Eq5
•
Difference of two derivatives (Y by F, Y by X)
◦
Derivatives of Y by F only affects the gradient of F itself
◦
Derivatives of Y by X would influence the gradient not only its layer, but also layer ahead
(by chain rule)
•
With full-precision gradient, the magnitude of derivative of Y by X could be exploded beyond -1 and 1
Adaptive learning rate scaling
Difference of variance of outputs
•
In conventional CNNs, provided that the weights and the input features are independent
•
The variance of the output is
•
The variance of the output in AdderNet
•
AdderNet tents to bring in a much larger variance of outputs than conventional CNN
◦
Filter variance is usually small (e.g., 10**-3, 10**-4) → multiplication: small, addition: big
•
Variance of outputs in AdderNet is bigger than in conventional CNNs
Adoption of adaptive learning rate
•
Batch normalization
•
Gradient of loss w.r.t X
•
Larger variance of Y () in AdderNet makes Eq11 much smaller than in conventional CNNs
•
Due to difference of the norm of gradient for each layer, can not just raise learning rate
•
Need an adaptive learning rate for different layers in AdderNets
•
Update for each adder layer is
( is a global learning rate, is local learning rate, gradient of the filter)
•
Local learning rate is defined as,
•
Total algorithm
Experiment
Experiments on MNIST
•
using LeNet-5-BN on MNIST
◦
Images resized to 32 X 32
•
Replace convolutional filters and multiplications in FC layers with adder filter
기본 보기
Search
•
Much lower latency with similar accuracy
Experiments on CIFAR
•
Preserved accuracy of AdderNet with no multiplication compared to BNN
Experiments on ImageNet
•
Tested on rather shallow, and deep models
•
AddNN shows better accuracy than BNN
◦
AdderFilter actually could extract good feature map
Visualization Results
•
AdderNets utilize L1 distance instead of cross correlation
•
Classes are divided by angle in conventional CNNs
•
Classes are clustered following L1-norm
Visualization on filters
•
They show similar patterns
Visualization on distribution of weights
Ablation Study
•
Without changing its learning rate, the networks can be hardly trained due to small gradients
•
ILR (Increased Learning Rate) with a value of 100, which is best of {10, 50, 100, 200, 500}