Back to Simplicity How to Train Accurate BNNs from Scratch

Back to Simplicity: How to Train Accurate BNNs from Scratch?

文章目录

Back to Simplicity: How to Train Accurate BNNs from Scratch?

Introduction
Related Work
Study on Common Techniques
Proposed Approach

Golden Rules for Training Accurate BNNs
ResNetE
BinaryDenseNet

Main Results

文章链接
代码链接

Introduction

主要贡献在于：

1：对于如何训练出一个高精度的二值化网络提供了具体的方式，说明了原来的一些方式的效果没有那么好

2：提出了设计BNNs的一些普适的准则，在此基础上，提出了BinaryDenseNet

3：提供了开源的代码

Related Work

最近的工作主要分为三类：紧凑的网络结构设计，量化权重的网络，量化权重和**值的网络

Compact Network Design ：将 $3\times 3$ 的滤波器换成 $1\times 1$ 的滤波器，depth-wise separable convolution 、channel shuffling，不过这些方式都必须使用GPU，不能在CPU上加速

Quantized Weights and Real-valued Activations：BinaryConnect (BC) , Binary Weight Network (BWN) , and Trained Ternary Quantization (TTQ) ，内存减少，精度损失小，但是加速不多

Quantized Weights and Activations：DoReFa-Net, High-Order Residual Quantization (HORQ) and SYQ ，用1-bit的权值和多bit的**值取得了较好的效果

Binary Weights and Activations：Binarized Neural Network (BNN) ，XNOR-Net ，ABC-Nets

Study on Common Techniques

Implementation of Binary Layers

用符号函数来二值化，然后使用STE进行反传：
$\operatorname{sign}(x)=\left\{\begin{array}{l}{+1 \text { if } x \geq 0} \\ {-1 \text { otherwise }}\end{array}\right.\tag{1}$

$\begin{array}{c}{\text { Forward: } r_{o}=\operatorname{sign}\left(r_{i}\right)} \\ {\text { Backward: } \frac{\partial c}{\partial r_{i}}=\frac{\partial c}{\partial r_{o}} 1_{\left|r_{i}\right| \leq t_{\text {clip }}}}\end{array}\tag{2}$

Scaling Methods

作者经过试验认识BN层已经包括了尺度放缩的效果，因此，尺度+BN的效果和单纯的BN的效果是一样的，因此，作者就不使用scaling factor。

Full-Precision Pre-Training

作者对比了训练的三种方式，fully from scratch、by fine-tuning a fullprecision ResNetE18 with ReLU 、and clip as activation function。结果发现clip 的效果最差，from scratch的效果比用ReLU的效果稍微好一点，作者认为是因为BNN中我们并不使用ReLU，所以与训练模型不太适用。

Backward Pass of the Sign Function
$\frac{\partial c}{\partial r_{i}}=\frac{\partial c}{\partial r_{o}} 1_{\left|r_{i}\right| \leq t_{\text {clip }}} \cdot\left\{\begin{array}{l}{2-2 r_{i} \text { if } r_{i} \geq 0} \\ {2+2 r_{i} \text { otherwise. }}\end{array}\right.\tag{3}$
这个好像在fine-tune的时候比较好使，一般情况下作用也不大。

Proposed Approach

Golden Rules for Training Accurate BNNs

核心是保留网络中丰富的信息流 maintaining rich information flow of the network

不是是所有的real-value网络都合适用来二值化，如一些紧凑型的网络就不适合，因为这两种网络的设计理念是互斥的，一个是较少冗余eliminating redundancy，一个是补偿信息的损失compensating information loss

少用Bottleneck design(bottleneck： $1\times 1$ 的卷积可以用于降维)

为保存信息流，慎用全精度的降采样层

使用shortcut connections 对BNNs来说尤为重要

为了克服信息流的瓶颈，应该适当增加网络和宽度和深度

原来的scaling factor、approxsign、FP pre-training都没有什么用，可以直接从头训

考虑下BNN的缺点，理论上讲，同全精度网络相比，它的信息密度是低32倍的，因此需要用其他的方法来补偿：

1：使用shortcut connection

2：减少bottlenecks

3：某些关键层还是用全精度代替

ResNetE

在resnet上面做了两点改变，
1：删去了bottleneck层，将三个滤波器(kernel size 1,3,1)变为两个 $3\times3$ 的滤波器。(会增加模型的大小个参数)
2：使用full precision downsampling convolution layer
Back to Simplicity How to Train Accurate BNNs from Scratch

BinaryDenseNet

既然用resnet有好的效果，作者就想试试densenet，因为densenet中的shortcut比resnet更多。不过，在减少bottleneck层时，发现对densenet的效果并不好。作者说这是因为the limited representation capacity of binary layers。解决这个问题有两种方法，一个是增加the growth rate parameter k, which is the number of newly concatenated features from each layer。或者是用很多的blocks。

BinaryDenseNet和ResNetE 的另一个不同点在于降采样层。也有两种方案：一是使用全精度的降采样层，为了减少计算量，使用 $MaxPool\rightarrow ReLU\rightarrow \operatorname{1×1-Conv}$ 代替了 $\operatorname{1×1-Conv}\rightarrow \operatorname{AvgPool}$ 。另一种是使用binary downsampling conv-layer with a lower reduction rate, or even no reduction at all 代替full-precision layer。
Back to Simplicity How to Train Accurate BNNs from Scratch

Main Results

Back to Simplicity How to Train Accurate BNNs from Scratch