Shih-En Wei---The Robotics Institute Carnegie Mellon University
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models.The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation.
1. Introduction
CPM (convolutional pose machines)
- inherit the benefits --- the implicit learning of long-range dependencies between image and multi-part cues, tight integration between learning and inference, a modular sequential design
- learn feature representations for both image and spatial context directly from data.
- allows for golbally joint training with backpropagation.
- efficiently handle large training datasets.
2D belief maps for location of each part.At a particular stage in the CPM, the spatial context of part beliefs provide strong disambiguating cues to a subsequent stage. As a result, each stage of a CPM produces belief maps with increasingly refined estimates for the locations of each part'
We find, through experiments, that large receptive fields on the belief maps are crucial for learning long range spatial relationships and the result in improved accuracy.
- learning implicit spatial models via a sequential composition of convolutional architectures
- a systematic approach to designing and training such an architecture to learn both image features and image-dependent spatial models for structured prediction tasks, without the need for any graphical model style inference.
2 Related work
3. Method
Our goal is to predict the image locations Y = (Y1, ..., Yp) for all P parts.
A classifier in the first stage t = 1, therefore produces the following belief values:
In subsequent stages, the classifier predicts a belief for assigning a location to each part Yp = every z is belong Z; based on (1) features of the image data xt z 2 Rd again, and (2) contextual information from the preceeding classifier in the neighborhood around each Yp:
3.2 Convolutional Pose Machines
3.2.1 Keypoinnt Localization Using Local Image Evidence
The first stage of a convolutional pose machine predicts part beliefs from only local image evidence.
3.2.2 Sequential Prediction with Learned Spatial Context Features
A predictor in subsequent stages (gt > 1) can use the Spatial context Ψ
The design of the network is guided by achieving a receptive field at the output layer of the second stage network that is large enough to allow the learning of potentially complex and long-range correlations between parts.
Accuracy improves with the size of the receptive field.
3.3 Learning in Convolutional Pose Machines
4 Evaluation
Addressing vanishing gradients:
Benefit of end-to-end learning:
Comparison on training schemes:
4.2 Datasets and Quantitative Analysis
Leeds Sports Pose (LSP) Dataset: