VAE手写体识别项目实现（详细注释）从小项目通俗理解变分自编码器（Variational Autoencoder, VAE）

项目及代码来源：

https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/variational_autoencoder.py

在看代码前可以简单理解vae的基本概念上，推荐一篇知乎文章：https://zhuanlan.zhihu.com/p/55557709

还有https://yuanxiaosc.github.io/2018/08/26/%E5%8F%98%E5%88%86%E8%87%AA%E7%BC%96%E7%A0%81%E5%99%A8/

尤其第二篇，我就是在第二篇中理解通透了VAE的具体流程，在理解过程中我发现了模型中的几个小trick，我在注释中详细标注出来了并加上了自己的理解。方便之前没接触过vae的同志们直接上手，从程序中理解。

中文注释均为我个人的理解，对vae新手非常友好，看了注释后应该会对vae有了更深刻的认识，如果代码注释中有问题或错误请大家指出。

以下为一个mnist手写体识别图片的生成项目代码，可直接运行，并且最后展示了多个均匀分布的z解压后生成的图片间的关系即变化流程。

# -*- coding: utf-8 -*-
from __future__ import division, print_function, absolute_import

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import tensorflow as tf

# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST/", one_hot=True)

# Parameters
learning_rate = 0.001
num_steps = 3000 #迭代次数 30000
batch_size = 64
print(num_steps)
# Network Parameters
image_dim = 784 # MNIST images are 28x28 pixels
hidden_dim = 512
latent_dim = 2#潜在向量的维度

# A custom initialization (see Xavier Glorot init)
def glorot_init(shape):
    return tf.random_normal(shape=shape, stddev=1. / tf.sqrt(shape[0] / 2.))

#当我们在初始化网络的权重时，需要设置一个合理的随机值，避免出现 symmetry 的情况。
# 一般我们会将其初始化为均值为 0 的随机分布（高斯或者均匀分布）。
# 如果权重初始化过小(即选择的方差过小)，那么随着输入信号的改变，网络后端的改变也会过小。
# 同样的，如果权重初始化过大，随着输入信号的改变，网络后端的改变则会过大。X
# avier 方法提供了一个合理的方式来初始化权重。

#!!!!!!!!!!简单来说，就是将一个神经元的初始值权重初始化为均值为0，方差为 ????????????(????????)=1/???????????? 的随机分布（高斯或者均匀分布）。

# 其中 ???????????? 是该神经元的输入数目。
# tf.random_normal的作用是：从服从指定正太分布的数值中取出指定个数的值。

# Variables
weights = {
    'encoder_h1': tf.Variable(glorot_init([image_dim, hidden_dim])),
    'z_mean': tf.Variable(glorot_init([hidden_dim, latent_dim])), #均值
    'z_std': tf.Variable(glorot_init([hidden_dim, latent_dim])),  #方差
    'decoder_h1': tf.Variable(glorot_init([latent_dim, hidden_dim])),
    'decoder_out': tf.Variable(glorot_init([hidden_dim, image_dim]))
}
biases = {
    'encoder_b1': tf.Variable(glorot_init([hidden_dim])),
    'z_mean': tf.Variable(glorot_init([latent_dim])),
    'z_std': tf.Variable(glorot_init([latent_dim])),
    'decoder_b1': tf.Variable(glorot_init([hidden_dim])),
    'decoder_out': tf.Variable(glorot_init([image_dim]))
}

# Building the encoder
input_image = tf.placeholder(tf.float32, shape=[None, image_dim])
encoder = tf.matmul(input_image, weights['encoder_h1']) + biases['encoder_b1']
encoder = tf.nn.tanh(encoder)
print(encoder)
z_mean = tf.matmul(encoder, weights['z_mean']) + biases['z_mean']
print(z_mean)
z_std = tf.matmul(encoder, weights['z_std']) + biases['z_std']


# x->z->x
#假设存在一个分布Q(z)，分布中的z被变换为我们所想要的x的可能性比较大
#假设每一个z能变换成x且对应有一个x可转化为z，这样的分布为P(z|X),我们的目标就是Q向P(z|X)靠拢。
# 利用KL散度表示Q到P(z|X)的距离来表示Q的效果
# K-L散度，是一种量化两种概率分布P和Q之间差异的方式，又叫相对熵

# z的先验分布是高斯，我们不妨就假设它的后验也是高斯吧，
#     高斯分布可以用均值和方差两个数表示，那我们就让神经网络输出两个值，一个是均值一个是方差，
#     这个均值和方差构成的分布即为Q，然后我们通过在Q中采样得到z。虽然我们得到了Q的解析表达式，但是我们在encoder中输入x，
#     得到Q的均值和方差后，是通过对这个分布采样得到z的，这个采样操作的梯度可不能反传
#VAE的作者用了一个trick，引入一个 epsilon ～ N(0, 1) ,符合数学期望为0，方差为1。
#     然后我们从这个分布里采样一个 epsilon ,让z取值为 （Q的均值）+ （Q的标准差） * epsilon ，
#     这样就避免了从Q的分布里采样，只要从 epsilon 的分布中采样就可以了，而 epsilon 是不需要梯度的，
#     这样使得整个网络的梯度可以反传了。

# Sampler: Normal (gaussian) random distribution
eps = tf.random_normal(tf.shape(z_std), dtype=tf.float32, mean=0., stddev=1.0,
                       name='epsilon')
z = z_mean + tf.exp(z_std / 2) * eps
#Q的均值 + Q的标准差*eps                 #我们可以简单的将潜在变量视为数据的变换系数。
#此处我们可以看到为什么不是mean+epsilon*std， 是因为方差一定是正值，我们神经网络的输出不一定是正值，
# 所以我们将神经网络的输出作为log后的方差，即用log(std)来完成计算。

# Building the decoder (with scope to re-use these layers later)
decoder = tf.matmul(z, weights['decoder_h1']) + biases['decoder_b1']
decoder = tf.nn.tanh(decoder)
decoder = tf.matmul(decoder, weights['decoder_out']) + biases['decoder_out']
decoder = tf.nn.sigmoid(decoder)


# Define VAE Loss
def vae_loss(x_reconstructed, x_true):
    # Reconstruction loss  重构损失
    encode_decode_loss = x_true * tf.log(1e-10 + x_reconstructed) \
                         + (1 - x_true) * tf.log(1e-10 + 1 - x_reconstructed)
    #类似于交叉熵损失，做图片生成任务常用的loss
    encode_decode_loss = -tf.reduce_sum(encode_decode_loss, 1)
    # KL Divergence loss   KL损失 用以衡量潜在变量在单位高斯分布上的契合程度的KL散度
    # K-L散度，是一种量化两种概率分布P和Q之间差异的方式，又叫相对熵(其实是交叉熵-熵)，和N(0,1)之间的差异
    kl_div_loss = 1 + z_std - tf.square(z_mean) - tf.exp(z_std)
    kl_div_loss = -0.5 * tf.reduce_sum(kl_div_loss, 1)
    return tf.reduce_mean(encode_decode_loss + kl_div_loss)

loss_op = vae_loss(decoder, input_image)
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)

# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()

# Start training
with tf.Session() as sess:

    # Run the initializer
    sess.run(init)

    for i in range(1, num_steps+1):
        # Prepare Data
        # Get the next batch of MNIST data (only images are needed, not labels)
        batch_x, _ = mnist.train.next_batch(batch_size)
        print (i)
        # Train
        feed_dict = {input_image: batch_x}
        _, l = sess.run([train_op, loss_op], feed_dict=feed_dict)
        if i % 1000 == 0 or i == 1:
            print('Step %i, Loss: %f' % (i, l))

    # Testing
    # Generator takes noise as input
    noise_input = tf.placeholder(tf.float32, shape=[None, latent_dim])
    # Rebuild the decoder to create image from noise
    decoder = tf.matmul(noise_input, weights['decoder_h1']) + biases['decoder_b1']
    decoder = tf.nn.tanh(decoder)
    decoder = tf.matmul(decoder, weights['decoder_out']) + biases['decoder_out']
    decoder = tf.nn.sigmoid(decoder)

    # Building a manifold of generated digits
    n = 20
    x_axis = np.linspace(-3, 3, n)
    #为什么是-3到3,因为在-3到3几乎涵盖了一个标准正态分布的所有点，
    # vae是人工可以找到z到输出图像的映射规律的，所以我们按照一定顺序来依次映射，可以对比演变的过程
    print (x_axis)# (start,end,num_points)
    y_axis = np.linspace(-3, 3, n)

    canvas = np.empty((28 * n, 28 * n))
    for i, yi in enumerate(x_axis):
        for j, xi in enumerate(y_axis):
            z_mu = np.array([[xi, yi]] * batch_size)
            x_mean = sess.run(decoder, feed_dict={noise_input: z_mu})
            print(x_mean.shape)#  (64,784)其实是(batch_size，像素点数量)
            canvas[(n - i - 1) * 28:(n - i) * 28, j * 28:(j + 1) * 28] = \
            x_mean[0].reshape(28, 28)

    plt.figure(figsize=(8, 10))#生成图片的大小
    Xi, Yi = np.meshgrid(x_axis, y_axis) #产生一个以向量x为行，向量y为列的矩阵，
    plt.imshow(canvas, origin="upper", cmap="gray")
    plt.show()

几个理解上的小trick：

1.正态分布随机数epsilon的设置

z的先验分布是高斯，我们不妨就假设它的后验也是高斯吧，
高斯分布可以用均值和方差两个数表示，那我们就让神经网络输出两个值，一个是均值一个是方差，
这个均值和方差构成的分布即为Q，然后我们通过在Q中采样得到z。虽然我们得到了Q的解析表达式，
但是我们在encoder中输入x，得到Q的均值和方差后，是通过对这个分布采样得到z的，
这个采样操作的梯度可不能反传

VAE的作者用了一个trick，引入一个 epsilon ～ N(0, 1) ,
符合数学期望为0，方差为1。
然后我们从这个分布里采样一个 epsilon ,让z取值为 （Q的均值）+ （Q的标准差） * epsilon ，
 这样就避免了从Q的分布里采样，只要从 epsilon 的分布中采样就可以了，
而 epsilon 是不需要梯度的，这样使得整个网络的梯度可以反传了。

2.潜在变量z的设置（我们可以简单的将潜在变量视为数据的变换系数。）

由1可知vae中潜在变量 Z=Q的均值 + Q的标准差*eps ，此处我们可以看到为什么不是mean+epsilon*std，是因为方差一定是正值，我们神经网络的输出不一定是正值，所以我们将神经网络的输出作为log后的方差，即用log(std)来完成计算。

3.最后图片展示的trick：

vae是人工可以找到z到输出图像的映射规律的，所以我们按照一定顺序来依次映射，可以对比演变的过程

VAE手写体识别项目实现（详细注释）从小项目通俗理解变分自编码器（Variational Autoencoder, VAE）

4. VAE手写体识别项目实现（详细注释）从小项目通俗理解变分自编码器（Variational Autoencoder, VAE）

5.用来理解整体结构的两个图

VAE手写体识别项目实现（详细注释）从小项目通俗理解变分自编码器（Variational Autoencoder, VAE）