Beyond View Transformation Cycle-Consistent Global 论文翻译以及理解

Beyond View Transformation Cycle-Consistent Global and Partial Perception Gan for View-Invariant Gait Recognition论文翻译以及理解

翻译格式：一段英文，一段中文

下面围绕着这张图来解释这篇论文：

2. RELATED WORK

2.1. Gait recognition

Gait recognition still has remained a challenging problem due to the change of camera viewpoint, clothing, and belonging.
Among these, view variation is the main and frequent factor in practical application. The existing approaches generally fall into two categories: model-based [12] and appearance based[1, 8, 11] methods. For example, Chen et al. projected the coefficients of 3-D Gravity Center Trajectory (GCT) to different view planes to obtain view variation of gait features by means of the estimation of limb parameters [12]. Muramatsu et al. put forward an Arbitrary View Transformation Model (AVTM) that accurately matched a pair of gait features from arbitrary views [8]. Recently, some researchers employed deep learning method for view-invariant feature learning[3, 13]. In contrast, this paper proposes a novel framework through simultaneously perceiving global contexts and local body parts to synthesize target view gait images for cross view gait recognition.

由于相机视角，衣服和携带物的变化，步态识别仍然是一个具有挑战性的问题。在这些因素当中，视角变化是影响步态识别在实际应用中的主要和常见因素。现有方法通常分为两类：基于模型[12]和基于外观的[1,8,11]方法。例如，陈等人将三维重心轨迹（GCT）的系数投影到不同的视平面，通过肢体参数的估计获得视图变化的步态特征[12]。 Muramatsu等人提出了一种任意视图转换模型（AVTM），可以精确匹配任意视图中的一对步态特征[8]。最近，一些研究人员采用深度学习方法进行视图不变特征学习[3,13]。相比之下，本文提出了一种新的框架，通过同时获取全局上下文信息和局部身体部位来合成目标步态视图图像，用于跨视图步态识别。

2.2. Generative adversarial network (GAN)

The GAN framework was firstly proposed by Goodfellow et al. [15] to generate visually realistic images, which had been used for image synthesis [14, 17, 18], image translation [19], and etc. The recently proposed TP-GAN [17] synthesized frontal view face through the simultaneous consideration of global structure and local details. CycleGAN by Zhu et al. [19] learned to translate an image from a source domain to a target domain in the absence of paired examples employing an adversarial loss. GaitGAN [14] was related to our work. It was taken as a regressor to generate invariant Gait Energy Image (GEI) to side view, which contained two discriminators: fake/real discriminator and identification discriminator. In contrast, this paper proposes CA-GAN for the gait image synthesis of target view which combines forward cycle-consistency loss and adversarial loss.

GAN框架由Goodfellow等人首次提出[15]用来生成逼真的视觉图像，用于图像合成[14,17,18]，图像转化[19]等。最近提出的TP-GAN [17]通过同时考虑全局结构和局部细节来合成人脸正视图。朱等人提出的的CycleGAN。 [19]在不采用配对样例的对抗性损失的的情况下，可以将图像从源域转换到目标域。GaitGAN [14]与我们的工作相关。它仅仅是一个回归器，一个被用来生成步态能量图像（GEI）的侧视图的回归器，它包含两个判别器：伪/真判别器和身份鉴别器。 相比之下，本文提出了CA-GAN用于目标步态能量图的合成，它结合了前向循环一致性损失和对抗性损失。（只是目标能量图的合成，并没有给出识别的方法，识别的方法在4.3节给出了说明）

3. PROPOSED METHOD

Our aim is to improve the performance of cross-view gait recognition via image synthesis technology. Firstly, we translate gait images from source views to target view employing our designed CA-GAN. Subsequently, we exploit the synthesized
gait images of target view for cross-view gait recognition task. In this work, we focus on learning the CAGAN’s mapping functions from source domain IS (gait images under source views) to target domain IT(gait images under target view) given training gait images {iSm}Mm=1 ∈ ISand {iTn }Nn=1 ∈ IT, where M and N are the numbers ofgait images under source views and target view respectively. As illustrated in Fig. 1, our model includes two mappings:G : IS → ITand R : IT → IS. Specifically, G and R denote the two-branch generative network and the reconstruction network, respectively. Moreover, we introduce an attentive adversarial network D, which aims to distinguish between the real gait images {iTn } of target domain and the synthesized gait images {G(iSm)} of target view.

3.1. Network architecture

3.1.1. Two-branch generative network

In order to take advantage of both global contexts and local body parts, we construct two-branch generative network with encoder-decoder structure to synthesize target view gait images. The general architecture is shown in Fig. 1. Our proposed two-branch generative network G is parametrized by θ, which consists of one global network $G_{\theta ^{g}}$ processing the global structure and three body part networks $G_{\theta_{i} ^{l}}$ ,i ∈{0, 1, 2} generating the local details. We employ the architecture from CycleGAN [19] for our global network $G_{\theta ^{g}}$ and reconstruction network $R_{\theta }$ , which has shown impressive results for image translation and superresolution. As shown in Fig. 1, $G_{\theta ^{g}}$ and $R_{\theta }$ are composed of a down-sampling encoder and an up-sampling decoder.

为了利用全局信息和局部身体部位信息，我们构建了具有编码器 - 解码器结构的双分支生成网络，用来合成目标视图步态图像。通常它的结构如图1所示。我们提出的双分支生成网络G由θ参数化，由一个全局网络 $G_{\theta ^{g}}$ 处理全局结构和三个身体部分网络 $G_{\theta_{i} ^{l}}$ ，i∈{0,1,2生成局部细节信息。我们将CycleGAN [19]的架构用于我们的全局网络 $G_{\theta ^{g}}$ 和重建网络 $R_{\theta }$ ，因为它已经在图像转换和超分辨率领域带来了可喜的结果。如图1所示， $G_{\theta ^{g}}$ 和 $R_{\theta }$ 由下采样编码器和上采样解码器组成。

The three inputs of local networks $G_{\theta_{i} ^{l}}$ are the head part, two forearms part, and two lower legs part, respectively. Each $G_{\theta_{i} ^{l}}$ i ∈ {0, 1, 2} learns a separate parameter set for rotating the body part to target view. We adopt the local pathway architecture of TP-GAN [17] for our local network owing to the outstanding learning ability of robust local descriptors, which is also based on an encoder-decoder structure. Fig. 2. Weight generation part architecture.

局部网络 $G_{\theta ^{l}}$ 的三个输入分别是头部，两个前臂部分和两个下腿部分的人体轮廓剪影。每个 $G_{\theta_{i} ^{l}}$ ，i∈{0,1,2}分别学习一组用于将身体部位旋转到目标视图的参数集。由于TP-GAN [17]对局部信息的刻画能力具有很好的鲁棒性，我们采用TP-GAN [17]的局部路径架构用于我们的局部网络，TP-GAN [17]的局部路径架构也是基于编码器 - 解码器结构。

Beyond View Transformation Cycle-Consistent Global 论文翻译以及理解

To effectively integrate the information from the global and local networks, we firstly fuse the output feature maps
of three body part networks to one feature map. Specifically, we put each output feature map of body part networks at
one template according to the location of each part. The three templates are integrated into one feature map through a maxout fusing strategy, which has the same spatial resolution as output of global network. Subsequently, we concatenate the output feature maps of the global and local networks to an overall feature map and feed it to successive convolution layers to generate the synthesized gait image of target view.

为了有效地整合来自全局和局部网络所学习到的信息，我们首先将三个身体部位网络的输出特征图融合到一个特征图中。具体地，我们根据每个部分的位置将身体部位网络的每个输出特征图放在一个模板上。这三个模板通过maxout融合策略集成到一个特征图中，该策略具有与全局网络输出相同的空间分辨率。随后，我们将全局和局部网络的输出特征图和整体特征图拼接起来，并将其馈送到一系列连续的卷积层中用以生成目标视图的合成步态图像。（这种直接用轮廓剪影的合成轮廓剪影方法的意义在哪里呢，总感觉哪里不对）

3.1.2. Attentive adversarial network

The objective of CA-GAN is to synthesize the photorealistic target view gait images without introducing any artifacts. When we employ a single strong discriminator network, CAGAN tends to over-emphasize high-frequency image features to fool the current discriminator network, leading to the decline of gait details learning ability. However, any local patch sampled from the synthesized gait images should have similar structures to the corresponding real image patch. In addition, attention mechanism [20] can adaptively focus on the most valuable local patches to image synthesis task.

CA-GAN的目的是合成逼真的目标视图步态图像而不引入任何伪造图像（也就是没有生成器）。当我们使用一个强判别网络时，CA-GAN易于过分学习高频图像特征以欺骗当前的鉴别器网络，这会导致步态细节学习能力下降。然而，从合成的步态图像采样的任何局部图像，应该具有与对应的真实图像块类似的结构。此外，注意力机制[20]可以自适应地关注最有价值的局部图像块到图像合成任务。（我感觉对注意力机制这一块儿我还要进一步了解）

Because each local patch of gait image has different contributions to target view image synthesis, we design an AAN that adaptively learns different weights to all local image patches and classifies them as real or fake separately. As illustrated in Fig. 1, our AAN outputs a W × H weighted probability map of local image patches instead of one scalar value. Each probability value corresponds to a local detail region instead of the whole gait image. When training our CA-GAN, we utilize the weighted sum of the cross-entropy loss values over all local patches to update the AAN. The AAN consists of two parts: discriminative feature generation part and weight generation part. We adapt the PatchGAN discriminator [19] for the discriminative feature generation part, which aims to classify whether all local image patches are real or fake. The AAN is divided into two branches at the second convolution layer (conv2 layer).

由于步态图像的每个局部斑块对目标视图图像的合成具有不同的贡献，因此我们设计了一种AAN的网络结构，其自适应地学习所有局部图像块的不同权重，并且将这些局部图像块分别分类为真实或伪造的。如图1所示，我们的AAN输出局部图像块的W×H加权概率图而不是一个标量值。 每个概率值对应于局部细节区域而不是整个步态图像。（这里有点像细粒度识别）当训练CA-GAN时，我们利用所有局部图像块上的交叉熵损失值的加权和来更新AAN（有多少个局部像素块，就有几部分交叉熵损失）。AAN由两部分组成：判别特征生成部分和权重生成部分。我们将PatchGAN判别器[19]用于判别特征生成部分，其目的在于对所有局部图像块是真实的还是假的进行分类。 AAN在第二卷积层（conv2层）被分成两个分支。（图中前面的GT和SG一同输入到第一个卷积层）

For the weight generation part, we design a tiny convolution neural network that outputs W × H dimensional weight matrix of local image patches. It is composed of one 2-stride convolution layer that has the kernel size of 4 × 4 and produces a feature map with 32 channels, three fully connected layers, two ReLU nonlinear units, and one softmax layer, as shown in Fig. 2. We employ the ReLU nonlinear unit by virtue of its good convergence performance.

对于权重生成部分，我们设计了一个微型的卷积神经网络，输出局部图像块的W×H维权重矩阵。它由一个2步幅卷积层组成，其核心尺寸为4×4，并产生一个具有32个通道的特征图，三个完全连接的层，两个ReLU非线性单元和一个softmax层，如图2所示。我们采用ReLU非线性单元，具有良好的收敛性能。（W×H个概率应该就是Attention了吧，好喜欢这个模块）

3.2. Synthesis loss function

3.2.1. Adversarial loss

To effectively exploit prior knowledge of the gait images’ distribution of target view in the training process, we employ adversarial loss [15] for matching the distribution of synthetic images to the data distribution of target domain. Specifically,
for the mapping function: $G : I^{S} \rightarrow I^{T}$ and its discriminator D, the objective can be expressed as:

Beyond View Transformation Cycle-Consistent Global 论文翻译以及理解

where G tries to generate gait images $G\left ( i^{s} \right )$ that look similar to images from target domain $ I^{T}$, while D aims to distinguish between synthesized gait images $G\left ( i^{s} \right )$ and real target domain samples $ i^{T}$.

其中,G 尽可能生成与原域看起来相似的步态图像，D主要用来区分目标源域真实的步态图像和合成的步态图像。

3.2.2. Forward cycle-consistency loss

Although the mapping function G and its discriminator D can generate the gait images of the same distribution as target domainITthrough adversarial training. However, they do not guarantee that input entity and output entity are the same identity because G can map the input gait images to any random permutation of gait images in target domain. To make G gain the identity-preserving ability, we exploit forward cycleconsistency loss [19] that can translate the synthesized gait
images of target domain back to the original images, i.e., $i^{S} \rightarrow G\left ( i^{S}\right ) \rightarrow R\left ( G\left ( i^{S}\right ) \right ) \approx i^{S}$ , as shown in Fig. 1. It can be defined as:

尽管映射函数G及其判别器D可以通过对抗训练生成与目标域相同分布的步态图像。然而，它们不保证输入实体和输出实体是相同的身份，因为G可以将输入的步态图像映射到目标域中的步态图像的任何随机排列。为了使G获得身份保持能力，我们利用前向循环一致性损失[19]将合成的步态图像从目标域转化为原始图像，即，i.e., $i^{S} \rightarrow G\left ( i^{S}\right ) \rightarrow R\left ( G\left ( i^{S}\right ) \right ) \approx i^{S}$ ，如图1所示。它可以定义为：

Beyond View Transformation Cycle-Consistent Global 论文翻译以及理解

Notice that the mapping function R : $I^{T} \rightarrow I^{S}$ can be viewed as a “discriminator”, which can generate the corresponding reconstructed image from the synthesized gait image.The forward cycle-consistency loss enforces the reconstructed image to have a small distance with the corresponding real gait image of source domain.

值得注意的是，映射函数 R : $I^{T} \rightarrow I^{S}$ 可以被视为“判别器”，它可以从合成的步态图像生成相应的重建图像。
前向循环一致性损失强制重建图像与源域相应的真实步态图像具有较小的距离。（也就是越相似的意思）。

3.2.3. Overall objective function

The overall objective function is a weighted sum of all thelosses defined above:

总体目标函数是上面定义的所有损失的加权和：

Beyond View Transformation Cycle-Consistent Global 论文翻译以及理解
where λ controls the relative importance of the two losses

其中，其中λ控制两种损失的相对重要性。

其他的实验细节见论文。