【Paper Reading】Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Conference: ICCV 2019 poster
Source：https://arxiv.org/abs/1908.05900
Unofficial Code: https://github.com/WenmuZhou/PAN.pytorch
Feature: Fast + Curve

文本检测的想法与 PSENet 一致，segmentation + kernel + Expansion
light-weight backbone 仅用于提取特征，使用单独设计的模块进行特征融合，取代 FPN / U-shape，从而减少计算量，提升速度
expansion 部分的参数在训练过程中习得，即本文提出的 Pexel Aggregation (PA)，使用 pixel-wise predicted similarity vectors 进行文本行的连接/构建

可以看作 low computation cost U-shape network
使用 3x3 的separable convolution(depthwise conv)
up-scale & down-scale enhancement
input：从 backbone 不同层级得到的，不同分辨率的特征图，即 feature pyramid
output：enhanced feature pyramid
相对于 FPN / u-shape 的额外优势：

【Paper Reading】Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network
融合各个 FPEM 输出的 feature pyramid

借用聚类的概念，kernels 可以看作 cluster centers，文本像素则是 samples to be clustered
通过衡量 text pixel 与 kernel 之间的距离，来判定是否进行 aggregate
训练阶段，使用 aggregation loss 来实现这一规则：
- N是本文实例的数目， $\delta_{agg}$ 是常量，设置为0.5，用于过滤简单样本
- F§ is the similarity vector of the pixel p
- G(Ki) is the similarity vector of the kernel Ki，can be calculated by $\Sigma_{q \in K_i}F(q) / |K_i|$
此外，需要保证不同文本实例的 kernel 维持足够的距离，训练阶段使用 discrimination loss 来描述这一规则：

$L_{dis}$ 保持 kernels 间的距离不低于 $δ_{dis}$
测试阶段，使用预测得到的 similarity vector 来实现 pixel 到 kernel 的 aggregation
i. 在 kernels 的分割结果中，寻找彼此连接的部分，每一个连通的部分就是一个的单独的 kernel
ii. 对于每一个 Ki，融合周围与之similarity vector的欧氏距离小于阈值的像素点
iii. 重复 ii 直到没有符合要求的像素点