深度学习 3d人脸重建_深度学习实时3D人脸跟踪

深度学习 3d人脸重建

Snapchat was made popular by putting funny dog ears on people’s head, swapping faces and other tricks, that beyond funny, look impossible, even magical. I am in the digital visual effects industry so I am familiar with that magic.. and the desire to understand how it works behind the scene.

通过将滑稽的狗耳朵放在人们的头上，交换面Kong和其他技巧来使S napchat变得流行，这超出了滑稽，看起来不可能，甚至是魔术。我属于数字视觉效果行业，所以我熟悉这种魔术。并且渴望了解它在幕后的工作原理。

魔术的背后 (Behind the magic)

Modifying people faces is routine work in Hollywood visual effects, it’s a well understood craft nowadays, but it typically requires tens of digital artists to achieve a photorealistic face transformation. How can we automate that?

修改人脸是好莱坞视觉效果中的常规工作，如今已广为人知，但是通常需要数十位数字艺术家才能实现逼真的人脸转换。我们该如何自动化呢？

Here’s a simplified breakdown of the steps these artists follow:

这是这些艺术家遵循的步骤的简化分类：

Tracking the position, shape and movement of the face relative to the camera in 3D

以3D方式跟踪面部相对于相机的位置，形状和运动
Animation of the 3D models to snap on the tracked face (e.g. a dog nose)

3D模型的动画可捕捉到跟踪的面部(例如，鼻子)
Lighting and rendering of the 3D models into 2D images

将3D模型照明并渲染为2D图像
Compositing of the rendered CGI images with the live action footage

将渲染的CGI图像与实景镜头合成

Automation of steps 2 and 3 is not very different from what happens in video games, it’s relatively straightforward. Compositing can be simplified to 3D foreground over live background, easy. The challenge is the tracking, how can a program ‘see’ the complex motion of a human head?

第2步和第3步的自动化与视频游戏中的自动化并没有太大不同，它相对简单。合成可以简化为实时背景下的3D前景，非常简单。挑战在于跟踪，程序如何“看到”人头的复杂运动？

用人工智能追踪人脸 (Tracking faces with Artificial Intelligence)

The Computer Science community has been trying to track faces automatically for a long time and it’s hard. In the recent years, Machine Learning came to the rescue and many Deep Learning papers are published every year on the topic. I’ve spent a while looking for the “state of the art” and realised doing this in real-time is VERY HARD! A good reason to try and tackle the challenge (and that would work nicely with the AR beauty mode I have implemented).

很长时间以来，计算机科学界一直在尝试自动跟踪人脸，这很难。近年来，机器学习得到了广泛的应用，每年都有很多关于该主题的深度学习论文发表。我花了一段时间寻找“最先进的技术”，并意识到实时地做到这一点非常困难！尝试解决挑战的一个很好的理由(这与我实现的AR美容模式很好搭配)。

“trying to track faces.. it’s hard.. doing this in real-time is VERY HARD!”

“试图跟踪人脸……很难。实时做到这一点非常困难！”

Here’s how I did it.

这是我的方法。

设计网络 (Designing the network)

Convolutional Neural Networks are popular for visual analysis of images and commonly used for applications such as object detection and image recognition.

卷积神经网络广泛用于图像的视觉分析，并且通常用于诸如对象检测和图像识别等应用。

深度学习 3d人脸重建_深度学习实时3D人脸跟踪 — this publication⁹) 本出版物publication)

For a deep neural network to be evaluated in real-time (at least 30 times per second), a compact network is desired¹. With the popularity of Machine Learning and smart phones, new models are discovered every year that push the limit of efficiency — offering a trade-off between computational precision and overhead. Among such models, MobileNet, SqueezeNet and ShuffleNet are popular for applications on mobile devices, thanks to their compactness.

对于要实时(每秒至少30次)进行评估的深度神经网络，需要一个紧凑的网络¹。随着机器学习和智能手机的普及，每年都会发现新的模型，这些模型推动了效率的极限—在计算精度和开销之间进行权衡。在这些模型中，由于其紧凑性，MobileNet，SqueezeNet和ShuffleNet在移动设备上很受欢迎。

ShuttleNet V2¹ was recently introduced and offers state of the art performances, coming in various sizes to balance between speed and accuracy. It ships with PyTorch, one more reason to pick that model.

ShuttleNet V2¹是最近推出的，可提供最先进的性能，具有各种尺寸，可以在速度和精度之间取得平衡。它与PyTorch一起提供，这是选择该模型的另一个原因。

选择要学习的功能 (Choosing the features to learn)

Now I need to find what features the CNN should learn. A common approach is defining a list of anchor points for different key parts of the face, also called ‘facial landmarks’.

现在，我需要找到CNN应该学习的功能。一种常见的方法是为面部的不同关键部位定义锚定点列表，也称为“面部标志”。

The points are numbered and associated strategically around the eyes, eyebrows, nose, mouth and jawline. I want to train the network to identify the coordinate of each point, so I can later reconstruct masks or geometric meshes based on them.

这些点已编号，并且在眼睛，眉毛，鼻子，嘴巴和下巴周围有策略地关联。我想训练网络以识别每个点的坐标，以便以后可以基于它们重建蒙版或几何网格。

建立训练数据集 (Building a training dataset)

Because I want to augment videos with 3D effects, I looked for a dataset with 3D landmark coordinates. 300W-LP is one of the few dataset that comes with 3D positions, it’s pretty large and as a bonus offers a good diversity of face angles. I want to benchmark my solution against the state of the art, recent publications test their models on AFLW2000–3D so I go for 300W-LP for training and test on AFLW2000–3D for comparison.

因为我想用3D效果来增强视频，所以我寻找了具有3D地标坐标的数据集。 300W-LP是少数具有3D位置的数据集之一，它非常大，并且额外提供了很好的面部角度多样性。我想以最新技术为基准对我的解决方案进行基准测试，最近的出版物在AFLW2000-3D上测试了他们的模型，因此我选择300W-LP进行培训，并在AFLW2000-3D上进行测试以进行比较。

A note on these datasets, they are meant for the research community and generally not free for commercial use.

关于这些数据集的注释，它们仅供研究人员使用，通常不免费用于商业用途。

扩充数据集 (Augmenting the dataset)

Dataset augmentation improves the accuracy of the training by adding even more variations to the set that it already has. I apply the following transformations to each image and landmark, to create new ones, by a random amount: rotation up to -/+ 40° around the centre, up to 10% translation and scale, and horizontal flip. I apply a different random transformation in memory on each image and for each learning pass (epoch) for additional augmentation.

数据集扩充通过向其已有的集合中添加更多变体来提高训练的准确性。我对每个图像和地标应用以下变换，以随机的数量创建新的变换：围绕中心旋转最多-/ + 40°，最大平移和缩放10％，以及水平翻转。我对每个图像和每个学习通道(时期)在内存中应用了不同的随机变换，以进行其他增强。

It’s also necessary to crop the input image close to the bounding box of the landmarks for the CNN to recognise the landmarks at their relative locations. That’s done as a preprocess to save on load time from disk during training.

还必须将输入图像裁剪到地标边界框附近，以使CNN能够识别其相对位置处的地标。这样做是为了节省培训期间磁盘加载时间的预处理。

设计损失函数 (Designing the loss function)

Typically an L2 loss function is used to measure the prediction error for landmark positions. A recent publication⁴ describes a so-called Wing loss function, that performs better for this application, which I could verify. I parametrise it with w=10 and ε = 2 as suggested by the author and sum the result over all landmark coordinates.

通常，L2损失函数用于测量界标位置的预测误差。最近的出版物⁴描述了一种所谓的Wing损失函数，该函数对该应用程序的性能更好，我可以验证一下。根据作者的建议，我用w = 10和ε= 2对其参数化，并对所有界标坐标上的结果求和。

训练网络 (Training the network)

Training a deep neural network is a very expensive operation that requires powerful computers. Using my laptop would have taken weeks, literally, for one training phase and building a decent setup costs thousands of dollars. I decided to leverage the cloud so I can pay just for the compute power I need.

训练深度神经网络是一项非常昂贵的操作，需要功能强大的计算机。实际上，使用我的笔记本电脑要花几个星期才能完成一个培训阶段，而建立一个像样的安装程序则要花费数千美元。我决定利用云，以便我可以仅为所需的计算能力付费。

I chose Genesis Cloud, that offers very competitive prices and $50 free credit to get started. I build a Linux VM with a GeForce GTX 1080 Ti, prepare an OS and storage image where I setup PyTorch and upload my code and the datasets, all through ssh. Once the system is setup, it can be started and shut down on demand, creating a snapshot allows to resume the work where I left it.

我选择了Genesis Cloud ，它提供了极具竞争力的价格和$ 50的免费赠金，可以开始使用。我使用GeForce GTX 1080 Ti构建了Linux VM，准备了操作系统和存储映像，并在其中设置PyTorch并通过ssh上传了我的代码和数据集。设置好系统后，就可以按需启动和关闭它了，创建快照可以在我离开系统的地方恢复工作。

The inner training loop processes mini-batches of 32 images to maximise the parallel computation on GPU. A learning pass (epoch) process the entire set of about 60,000 images and takes about 4 minutes. The training converges around 70 epochs so I let it run overnight for 100 epochs to be safe.

内部训练循环处理32幅图像的小批量，以最大程度地利用GPU进行并行计算。学习通行证(时代)处理大约60,000张图像的整个过程，大约需要4分钟。培训大约收敛了70个纪元，因此为了安全起见，我让它连续运行100个纪元。

I use the popular Adam optimiser that automatically adapts the learning rate, starting with a rate of 0.001. I found that setting the initial learning rate right is critical, if it’s too small the training converges too early in a sub-optimal solution. If it’s too large it has difficulties converging at all. I found the value through trial and error, which is time consuming.. and actually costly when paying the cloud per use!

我使用流行的亚当优化器，该器会自动调整学习率，从0.001开始。我发现正确设置初始学习率至关重要，如果它太小，则训练在次优解决方案中收敛得太早。如果太大，将很难收敛。我通过反复试验发现了价值，这很费时..而且每次使用云支付时实际上很昂贵！

评价 (Evaluation)

All these efforts paid off, with the bigger network ShuffleNet V2 2x, I obtain a Normalised Mean Error (NME) of 2.796 on AFLW2000–3D. That’s better than the state of the art model⁵ on that dataset and its NME of 3.07, by a good margin, despite that model being much heavier! ????

所有这些努力都得到了回报，借助更大的网络ShuffleNet V2 2x，我在AFLW2000-3D上获得的平均归一化误差(NME)为2.796。尽管该模型要重得多，但比该数据集上的最新模型and及其NME 3.07好很多。 ????

A comparison of the predicated landmarks with the ground truth confirms the theoretical result, the landmarks find their way with precision even for large angles (though AFLW2000–3D contain challenging cases with extreme angles where my model fails).

预测的地标与地面真相的比较证实了理论结果，即使在大角度下，地标也可以精确地找到它们的路径(尽管AFLW2000-3D包含我的模型失效的极端角度的具有挑战性的情况)。

I don’t have a big CUDA GPU for inference (evaluation of the model) on my MacBook Pro. To maximise the utilisation of hardware resources I convert the model to the portable format ONNX and use the library ONNX Runtime from Microsoft that can infer the model much more efficiently on my machine (yes it runs on OSX!). The inference time is under 100ms on the CPU, which is pretty good, despite not real-time. But I keep in mind I use the biggest version of ShuffleNet V2 (2x) here for maximal precision and I can opt for a smaller one for speed (e.g. the 1x version is 4 times faster). I could also get better numbers running on GPU.

我的MacBook Pro上没有用于推理(模型评估)的大型CUDA GPU。为了最大限度地利用硬件资源，我将模型转换为可移植格式ONNX，并使用Microsoft的库ONNX Runtime，该库可以在我的计算机上更高效地推断模型(是的，它可以在OSX上运行！)。尽管不是实时的，但在CPU上的推理时间不到100毫秒。但是我要记住，我在这里使用ShuffleNet V2的最大版本(2x)以获得最大的精度，我可以选择较小的版本以提高速度(例如1x版本要快4倍)。我还可以获得在GPU上运行的更好的数字。

All was set for a robust facial tracking system.. or so I thought.

一切都设置为一个强大的面部跟踪系统。

视频的局限性 (Limitations with videos)

I finally get to plug the trained network to a video stream. An additional step is required in using a separate model to detect the face bounding box, so we can crop the image close to where the landmark will be, like in the training dataset. After a good amount of research, I opt to use this lightweight face detector that is very fast to evaluate yet precise (I did try others).

我终于将训练有素的网络插入视频流。使用单独的模型来检测人脸边界框需要额外的步骤，因此我们可以像训练数据集中那样将图像裁剪到接近地标的位置。经过大量研究，我选择使用这种轻巧的面部检测器，该检测器的评估速度非常快，但准确性很高(我确实尝试过其他方法)。

Despite all my efforts, with great sadness, I find that the result on videos is not robust at all: the markers shake a lot, drift over time and don’t detect eye blinks. ????

尽管付出了所有的努力，但我感到非常悲伤，我发现视频的结果根本不可靠：标记晃动很多，随着时间的流逝并没有检测到眨眼。 ????

What did I do wrong?

我做错什么了？

摇摇欲坠的标记 (Shaky markers)

After digging further, I realised what it is. The training dataset are still photographs, not videos frames! It does seem to work well on stills, but videos are different. They represent an additional challenge: temporal consistency (and motion blur).

进一步挖掘之后，我意识到它是什么。训练数据集仍然是照片，而不是视频帧！它似乎在静止图像上效果很好，但是视频却有所不同。它们代表了另一个挑战：时间一致性(和运动模糊)。

As noted in this work⁶ from University of Technology Sydney and Facebook Reality Labs, the ground truth markers from the training dataset are annotated manually on each photo, which is not very precise. Different annotators position the landmarks at slightly different locations, as illustrated above.

正如悉尼科技大学和Facebook Reality Labs的工作⁶所指出的那样，训练数据集中的地面真相标记是手动标注在每张照片上的，并不是很精确。如上所示，不同的注释器将地标放置在稍微不同的位置。

Our network actually learns imprecise landmark locations — which result into random jittering in a small zone during prediction rather than exact location.

我们的网络实际上会学习到不精确的地标位置-在预测过程中会导致小范围内的随机抖动，而不是精确的位置。

A common approach to address this issue is to stabilise the predicted markers after the fact. To that effect, as suggested by this other publication⁷ from Google Research, I use the 1€ filter⁸ is to smooth the motion noise. I found it gives better result than the traditional Kalman filter and is simpler to implement. I also use it to stabilise the box returned by the face detector, because it shakes a lot as well and that is not helping.

解决此问题的常用方法是在事后稳定预测的标记。为此，正如Google Research的另一本出版物suggested所建议的那样，我使用1€滤波器⁸来平滑运动噪声。我发现它比传统的卡尔曼滤波器具有更好的结果，并且更易于实现。我还使用它来稳定人脸检测器返回的盒子，因为它也会晃动很多，并且没有帮助。

The motion filter makes a big difference. But it’s not perfect, specially with a low quality webcam like my laptop’s that produce noisy videos, which the model doesn’t seem to like very much.

运动过滤器有很大的不同。但这并不是完美的，特别是对于像我的笔记本电脑这样的低质量网络摄像头，它会产生嘈杂的视频，而该型号似乎并不太喜欢这种视频。

The paper from UTS and Facebook RL mentioned above suggests to fine-tune the trained model essentially as follow:

上面提到的UTS和Facebook RL的论文建议对训练后的模型进行以下基本调整：

Predict the landmark positions by computing the optical flow forward across frames (OpenCV can do that)
通过计算跨帧的光流来预测界标位置(OpenCV可以做到)
Compute the optical flow backward from that result
从该结果向后计算光流
Compute a loss function based on the distance between the prediction from optical flow and the prediction from pre-trained model
根据光流预测与预训练模型之间的距离计算损失函数

I haven’t tried it myself but the demo video looks promising.

我自己还没有尝试过，但是演示视频看起来很有希望。

眼睛和眉毛僵硬 (Stiff eyes and eyebrows)

The trained network suffers with eye motion, it doesn’t see eye blinks and can’t fully ‘close’ the eye landmarks. The eyebrows landmarks also seem stiff in their motion compared to what they are tracking.

训练有素的网络会遭受眼球运动的困扰，看不到眨眼，也无法完全“关闭”眼界。与所跟踪的眉毛相比，眉毛的运动也显得僵硬。

My model became pretty good at predicting profiles, because there is a ton of them in the dataset (it’s actually perfectly balanced in terms of face angles), but did not learned eyes that well. A CNN can perform better if the features it learns tend to live around a consistent location, that’s why we have to detect the face bounding box before inference and crop around the landmarks during training: that puts the face at the centre every time.

我的模型非常擅长预测轮廓，因为数据集中有很多轮廓(实际上在面部角度方面是完美平衡的)，但对眼睛的了解却不够。如果CNN所学习的特征倾向于围绕一致的位置生活，那么CNN的性能会更好，这就是为什么我们必须在推理之前检测面部边界框并在训练过程中在地标周围进行裁剪：这使面部始终处于中心位置。

As the face angle varies, the features inside the face — eyes, mouth and nose — change shape and position in the frame and can even disappear at wide angle. This makes its more difficult for the model to recognise them well every time. I tried augmenting the dataset for extreme eye poses (closed, wide open), but that had little effect.

随着脸部角度的变化，脸部内部的特征(眼睛，嘴巴和鼻子)会改变框架中的形状和位置，甚至在广角时也会消失。这使得模型每次都很难很好地识别它们。我尝试扩充极端眼神姿势(闭合，全开)的数据集，但效果不大。

下一步 (Next steps)

This project was fun but costly in time and money ($15 from my pocket on top of the initial free credit of $50). It was a great opportunity to learn but I don’t intend to reinvent the wheel. Since I started this journey, Google has open-sourced MediaPipe, that offers a cross-platform implementation of their model. Though initial demos of MediaPipe show some drifting, stiffness and shaking, which validates my outcomes.

这个项目很有趣，但是时间和金钱都很昂贵(在最初的50美元免费信用额之外，我的口袋里还剩下15美元)。这是一个学习的好机会，但我不想改变自己的想法。自从我开始这一旅程以来，Google开源了MediaPipe ，它提供了其模型的跨平台实现。尽管MediaPipe的初始演示显示出一些漂移，僵硬和晃动，但可以验证我的结果。

Still, I have some ideas to improve my approach (starting with the optical flow based fine-tuning mentioned above).

我仍然有一些想法可以改善我的方法(从上面提到的基于光流的微调开始)。

3D可变形人脸模型 (3D Morphable Face Models)

To be able to build an app like Snapchat, we actually need more than a couple of 3D points. A common approach is to use a 3D Morphable Face Model (3DMM) and fit it on the annotated 3D points. The dataset 300W-LP actually comes with such data. The network could learn all the vertices of the 3D model instead of a small amount of landmarks, such that a 3D face will be predicted directly by the model and ready to use for some AR fun!

为了构建像Snapchat这样的应用程序，我们实际上需要多个3D点。一种常见的方法是使用3D可变形人脸模型(3DMM)并将其拟合在带注释的3D点上。数据集300W-LP实际上带有此类数据。网络可以学习3D模型的所有顶点，而不是少量的地标，这样，模型就可以直接预测3D人脸，并可以将其用于一些AR乐趣！

多个网络 (Multiple Networks)

In order to improve overall precision, the aforementioned Wing loss paper, as well as the work from Google Research, suggest to predict some landmarks with a first lightweight model. Thanks to the initial information, we can crop and align the face vertically to help the second model deal with less variation of poses.

为了提高整体精度，上述Wing损失论文以及Google Research的工作都建议使用第一个轻量级模型预测一些地标。多亏了最初的信息，我们可以垂直裁剪和对齐面部，以帮助第二个模型处理较少的姿势变化。

I believe we can push that concept of specialised models further to involve multiple networks. We could train different CNNs for specific face angles, say one network for large angles (60°+), one for intermediate angles (30° to 60°) and one for frontal angles (0° to 30°), and somehow combine them.

我相信我们可以将专门模型的概念进一步推广到涉及多个网络的地方。我们可以针对特定的面部角度训练不同的CNN，比如说一个网络用于大角度(60°+)，一个网络用于中间角度(30°至60°)和一个网络用于正面角度(0°至30°)，并以某种方式将它们组合。

Alternatively, different networks could learn different face parts, one for the eyes, one for the mouth, etc.. with the hope that simplifying the work of a model will increase precision. A first pass could detect the rough position and orientation of the head as above then crop and align the area of interest for the relevant expert networks.

或者，不同的网络可以学习不同的面部部分，一个用于眼睛，一个用于嘴等，以期简化模型的工作将提高精度。第一步可以检测到头部的大致位置和方向，然后进行裁剪，并为相关专家网络对齐感兴趣的区域。

结论 (Conclusion)

Application of deep learning for face tracking is an active area of research where a lot of progress was made in the last couple of years. There is still room for improvement, in particular for tracking videos. The temporal coherence of predicted facial features and the precision for key areas like eyes and mouth remain a challenge.

深度学习在面部跟踪中的应用是研究的活跃领域，最近几年取得了很多进展。仍有改进的空间，尤其是在跟踪视频方面。预测的面部特征在时间上的连贯性以及关键区域(如眼睛和嘴巴)的精确度仍然是一个挑战。

I am excited to see what the future holds on that topic!

我很高兴看到该主题的未来！

[1]: N. Ma et al., ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design (2018), ECCV

[1]：N. Ma等人， ShuffleNet V2：高效CNN架构设计实用指南(2018)，ECCV

[2]: X. Dong et al., Style Aggregated Network for Facial Landmark Detection (2018), CVPR

[2]：X。Dong等人，《用于面部地标检测的样式聚合网络》 (2018年)，CVPR

[3]: X. Zhu et al., Face Alignment Across Large Poses: A 3D Solution (2016), CVPR

[3]：X. Zhu等人，《跨大姿势的面部对齐：3D解决方案》 (2016年)，CVPR

[4]: Z. Feng et al., Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks (2018), CVPR

[4]：Z。Feng等人，《通过卷积神经网络进行稳健的面部地标定位的机翼损失》 (2018年)，CVPR

[5]: J. Guo et al., Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment (2018), BMVC