深度学习模型的量化方法（论文学习 & tensorflow lite量化方法）

Quantization

方法介绍

效果

论文（参考文献a）中以imageNet数据（1.2M training images，1K categories，50k validation images），使用多个模型，从各个角度做了实验，对比quantization的效果，结果如下：

forward效果
AlexNet模型，

			不Quantization	Q tilt (ReLU)	sign tilt (hard tanh)	Q(HWGQ) + ReLu backward	sign + ReLU backward
Alex Net	Full Weight	top 1	55.7	55.7	46.7	49.5
	Full Weight	top 5	79.3	79.3	71	73.7
	Binary Weight	top 1	52.4	53.9	43.9	46.8	39.5
	Binary Weight	top 5	75.9	77.3	68.3	71	63.6

a) Fullweight和Binary weight对比来看，差距不大，这个结果验证了weight的压缩效果；
b) ReLU的效果明显好于signc) Activation函数quantization效果确实没有weight的量化效果好。

backward效果
weight全binarized，HWGQ forward，bakcward选择即对应结果如下，

可以看出clipped和log-tailed两种backward的效果较为相近，且均优于直接使用ReLU作为backward，这是因为clipped和log-tailed的gradient和forward的HWGQ的mismatch更小。
no-opt是binary weight的实现训练好的模型，对**函数做quantization，这么模型中没法体现对**函数量化的处理，因而效果较差。
bit-width影响
weight binarized，HWGQ forward，cliped backward，foward量化时使用不同的bit width时效果如下。

可以看出bit-width越大，效果越好。且uniform的和non-uniform的效果比较相近。
综合对比
weight binarized，HWGQ forward，cliped backward。

tensorflow的量化

Pretrained model

如何量化模型

TensorFlow 自带对八位运算的生产级支持。它也能把浮点模型转换为等价的使用量化计算进行推断的图。下面是一个把最近的 GoogLeNet 转换成八位表示的例子

curl -L "https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz" |
  tar -C tensorflow/examples/label_image/data -xz
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
  --in_graph=tensorflow/examples/label_image/data/inception_v3_2016_08_28_frozen.pb \
  --out_graph=/tmp/quantized_graph.pb \
  --inputs=input \
  --outputs=InceptionV3/Predictions/Reshape_1 \
  --transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,299,299,3")
    remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true)
    fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes
    strip_unused_nodes sort_by_execution_order'

这会生成一个新模型，执行的操作跟原来的模型一样，但内部采用八位计算。你会发现新的文件大小大致是原来的 1/4 。你仍旧可以使用一模一样的输入，而结果应该是一致的。

实现（8bit）
1. 结构
  对量化的实现是通过把常见操作转换为等价的八位版本达到的。涉及的操作包括卷积，矩阵乘法，**函数，池化操作，以及拼接。转换脚本先把每个已知的操作替换为等价的量化版本。然后在操作的前后加上含有转换函数的子图，将input从浮点数转换成8 bit，再把output从8 bit转回浮点数。下面是 ReLu 的例子：
  
  经过转换后，如下图所示：
  
  经过转换后，输入输出依旧是float，只不过中间的计算是用过8 bit来计算的。
2. (de)quantization
  
  这里介绍下quantize和dequantize函数的逻辑。
  
  quantize取input中的min和max，分别对应被量化的input中的最小值（0）和最大值（255），把[min, max]这个区间均匀分成255个小区间，把input中的值对应到对应的区间中。反量化操作则是把上述操作反向执行。例如一直input的最大值是30.0，最小值是-10.0，则量化后的值为
  Quantized | Float --------- | ----- 0 | -10.0 255 | 30.0 128 | 10.0
  之所以这么做，tensorflow的论述是：
  1. 权重、活化张量的数值通常分布在一个相对较小的范围中（weight：-15 ~ 15，activatios：-500 ~ 1000）；
  2. 神经网络对噪音的适应性强，将数量化到一个更小的数集中并不会对整体的结果带来很大的影响；
  3. 通过量化操作，可以有效提高点乘的计算效率。
3. 当遇到连续的被量化的操作时
  
  有一个优化是当连续出现多个被量化了的操作时，没有必要在每个操作前做反序列化/序列化，因为上一个操作的反序列化和下一个操作的序列化是会被互相抵消的。例如下图：

参考资料
1. Deep Learning with Low Precision by Half-wave Gaussian Quantization，https://arxiv.org/abs/1702.00953
2. High performance ultra-low-precision convolutions on mobile devices，https://arxiv.org/abs/1712.02427
3. Tensorflow lite网页：https://www.tensorflow.org/performance/quantization#what_representation_is_used_for_quantized_tensors

Quantization

方法介绍

效果

tensorflow的量化

如何量化模型

实现（8bit）

结构

(de)quantization

当遇到连续的被量化的操作时

参考资料