DensePose: Dense Human Pose Estimation In The Wild

Rıza Alp G¨uler, Natalia Neverova, Natalia Neverova
DensePose: Dense Human Pose Estimation In The Wild
DensePose-COCO: a large-scale ground-truth dataset with image-to-surface correspondences manually annotated on 50K COCO images（标数据的难度可想而知）
DensePose-RCNN: densely regress part-specific UV coordinates within every human region at multiple frames per second（居然还是使用回归这么原始的方法？还R-CNN？）

Abstract

dense human pose estimation: dense correspondences between an RGB image and a surface-based representation of the human body
可想而知即使是标一张数据的难度也是较大的，因此作者介绍了一种有效的标注方法
in the wild: in the presence of background, occlusions and scale variations，这样的数据标注更困难，遑论预测

Introduction

二维图像的理解和三维重建密切相关的
基于DenseReg，用CNN回归3D模型与RGB图像间点的对应关系。但是这里的问题相比于DenseReg更困难，因为in the wild，人的姿势变化更剧烈。
contributions:
1. introduce the first manually-collected ground truth dataset for the task, by gathering dense correspondences between the SMPL model and persons appearing in the COCO dataset
2. use the resulting dataset to train CNN-based systems that deliver dense correspondence ‘in the wild’, by regressing ody surface coordinates at any image pixel, observing a superiority of region-based models over fully-convolutional networks
3. use sparse correspondences defined over a randomly chosen subset of image pixels per training sample to ‘inpaint’ the supervision signal in the rest of the image domain

COCO-DensePose Dataset

Head, Torso, Lower/Upper Arms, Lower/Upper Legs, Hands and Feet
head, hands and feet: use the manually obtained UV fields provided in the SMPL model
rest of the parts: obtain the unwrapping via multidimensional scaling applied to pairwise geodesic distances

Accuracy of human annotators

人标记的数据也是有误差的，尤其是对于比较精细的部位，如头、手脚等

Evaluation Measures

Pointwise evaluation: evaluates correspondence accuracy over the whole image domain through the Ratio of Correct Point (RCP) correspondences (a correspondence is declared correct if the geodesic distance is below a certain threshold). 对于不同阈值 $t$ 计算AUC(Area Under the Curve)

{A U C}_{a} = \frac{1}{a} \int_{0}^{a} f (t) d t

Per-instance evaluation: geodestic point similarity

{G P S}_{j} = \frac{1}{P_{j}} \sum_{p \in P_{j}} \exp (\frac{- g (i_{p}, {\hat{i}}_{p})^{2}}{2 κ^{2}})

Learning Dense Human Pose Estimation

DensePose-RCNN: combining the DenseReg approach(FCN) with the Mask-RCNN architecture. proposing regions-of-interest (ROI), extracting region-adapted features through ROI pooling and feeding the resulting features into a region-specific branch
如下图一，类似Faster R-CNN，先提取特征，获取RoI proposal并进行RoI pooling，将特征继续进行卷积，获得类别和patch。将得到的patch输入图二中类似Mask R-CNN的multi-task模型中，得到最终结果。其中应用了cross cascading
DensePose: Dense Human Pose Estimation In The Wild