1.自动编码器
自动编码器首先通过下面的映射,把输入 $x\in[0,1]^{d}$映射到一个隐层 $y\in[0,1]^{d^{'}}$(编码器):
$y=s(Wx+b)$
其中 $s$ 是非线性的函数,例如sigmoid. 隐层表示 $y$ ,即编码然后被映射回(通过解码器)一个重构的 $z$,形状与输入$x$ 一样:
$z=s(W^{'}y+b^{'})$
这里 $W^{'}$ 不是表示 $W$ 的转置。$z$ 应该看作在给定编码 $y$ 的条件下对 $x$ 的预测。反向映射的权重矩阵 $W^{'}$ 可能会被约束为前向映射的转置:
$W^{'}=W^{T}$,这就是所谓的tied weights, 所以这个模型的参数($W$, $b$, $b^{'}$)的优化就是使得平均重构误差最小化。
重构误差有很多方式衡量,主要取决于对输入给定的编码的分布的合适的假设。传统的平方差 $L(x,z)=||x-z||^{2}$ 可以应用到此。如果输入可以看作是位向量或者概率向量,交叉熵也可以用来衡量:
$L_{H}(x,z)=-\sum_{k=1}^{d}[x_{k}\log z_{k}+(1-x_{k})\log(1-x_{k})]$
这就是希望编码 $y$ 是一种分布的表示,这种表示能够捕捉到输入数据变化的主元的坐标,有一点类似于PCA.
确实,如果有一层线性隐层,并且用均方差最小化的准则来训练网络,那么 $k$ 个隐单元就是学习去把输入映射到数据最开始的 $k$ 个主元的的范围之内。如果隐层是非线性的,自动编码器就不同于PCA,因为自动编码器能够捕获数据分布的不同方面。
因为 $y$ 可以看做是 $x$ 的有损失的压缩,所以很难做到对所有的输入 $x$ 都能产生很好(损失小)的压缩。优化就是要使得对于训练样本的压缩很好,并且希望对于其他输入同样有很好的压缩,但是这里的其他输入并不是指任意的输入,一般要求其他输入要和训练集是服从同一分布的。这就是一个自动编码器泛化的意义:对于与训练样本来自同一分布的测试样本,产生很小的重构误差,但是对于在输入空间随意采样的的其他输入通常产生很高的重构误差。
首先用theano来实现自动编码器----
1 class dA(object): 2 """Auto-Encoder class 3 4 一个降噪自动编码器就是去通过把一些加噪声的输入映射到隐层空间,然后再映射回输入空间来 5 重构真实的输入。 6 (1)首先把真实输入加噪声 7 (2)把加噪声的输入映射到隐层空间 8 (3)重构真实输入 9 (4)计算重构误差 10 .. math:: 11 12 \tilde{x} ~ q_D(\tilde{x}|x) (1) 13 14 y = s(W \tilde{x} + b) (2) 15 16 x = s(W' y + b') (3) 17 18 L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)] (4) 19 20 """ 21 22 def __init__( 23 self, 24 numpy_rng, 25 theano_rng=None, 26 input=None, 27 n_visible=784, 28 n_hidden=500, 29 W=None, 30 bhid=None, 31 bvis=None 32 ): 33 """ 34 dA类的初始化:可视层单元数量,隐层单元数量,加噪声程度。重构过程还需要输入,权重 35 和偏置。 36 :type numpy_rng: numpy.random.RandomState 37 :param numpy_rng: number random generator used to generate weights 38 39 :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams 40 :param theano_rng: Theano random generator; if None is given one is 41 generated based on a seed drawn from `rng` 42 43 :type input: theano.tensor.TensorType 44 :param input: a symbolic description of the input or None for 45 standalone dA 46 47 :type n_visible: int 48 :param n_visible: number of visible units 49 50 :type n_hidden: int 51 :param n_hidden: number of hidden units 52 53 :type W: theano.tensor.TensorType 54 :param W: Theano variable pointing to a set of weights that should be 55 shared belong the dA and another architecture; if dA should 56 be standalone set this to None 57 58 :type bhid: theano.tensor.TensorType 59 :param bhid: Theano variable pointing to a set of biases values (for 60 hidden units) that should be shared belong dA and another 61 architecture; if dA should be standalone set this to None 62 63 :type bvis: theano.tensor.TensorType 64 :param bvis: Theano variable pointing to a set of biases values (for 65 visible units) that should be shared belong dA and another 66 architecture; if dA should be standalone set this to None 67 68 69 """ 70 self.n_visible = n_visible 71 self.n_hidden = n_hidden 72 73 # create a Theano random generator that gives symbolic random values 74 if not theano_rng: 75 theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) 76 77 # note : W' was written as `W_prime` and b' as `b_prime` 78 if not W: 79 # W is initialized with `initial_W` which is uniformely sampled 80 # from -4*sqrt(6./(n_visible+n_hidden)) and 81 # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if 82 # converted using asarray to dtype 83 # theano.config.floatX so that the code is runable on GPU 84 initial_W = numpy.asarray( 85 numpy_rng.uniform( 86 low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), 87 high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), 88 size=(n_visible, n_hidden) 89 ), 90 dtype=theano.config.floatX 91 ) 92 W = theano.shared(value=initial_W, name='W', borrow=True) 93 94 if not bvis: 95 bvis = theano.shared( 96 value=numpy.zeros( 97 n_visible, 98 dtype=theano.config.floatX 99 ), 100 borrow=True 101 ) 102 103 if not bhid: 104 bhid = theano.shared( 105 value=numpy.zeros( 106 n_hidden, 107 dtype=theano.config.floatX 108 ), 109 name='b', 110 borrow=True 111 ) 112 113 self.W = W 114 # b corresponds to the bias of the hidden 115 self.b = bhid 116 # b_prime corresponds to the bias of the visible 117 self.b_prime = bvis 118 # tied weights, therefore W_prime is W transpose 119 self.W_prime = self.W.T 120 self.theano_rng = theano_rng 121 # if no input is given, generate a variable representing the input 122 if input is None: 123 # we use a matrix because we expect a minibatch of several 124 # examples, each example being a row 125 self.x = T.dmatrix(name='input') 126 else: 127 self.x = input 128 129 self.params = [self.W, self.b, self.b_prime]
在栈式自动编码器中,前一层的输出将作为后面一层的输入。
现在计算隐层表示和重构信号:
def get_hidden_values(self, input): """ Computes the values of the hidden layer """ return T.nnet.sigmoid(T.dot(input, self.W) + self.b)
def get_reconstructed_input(self, hidden): """Computes the reconstructed input given the values of the hidden layer """ return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)
利用这些函数计算损失和更新参数:
1 def get_cost_updates(self, corruption_level, learning_rate): 2 """ This function computes the cost and the updates for one trainng 3 step of the dA """ 4 5 tilde_x = self.get_corrupted_input(self.x, corruption_level) 6 y = self.get_hidden_values(tilde_x) 7 z = self.get_reconstructed_input(y) 8 # note : we sum over the size of a datapoint; if we are using 9 # minibatches, L will be a vector, with one entry per 10 # example in minibatch 11 L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1) 12 # note : L is now a vector, where each element is the 13 # cross-entropy cost of the reconstruction of the 14 # corresponding example of the minibatch. We need to 15 # compute the average of all these to get the cost of 16 # the minibatch 17 cost = T.mean(L) 18 19 # compute the gradients of the cost of the `dA` with respect 20 # to its parameters 21 gparams = T.grad(cost, self.params) 22 # generate the list of updates 23 updates = [ 24 (param, param - learning_rate * gparam) 25 for param, gparam in zip(self.params, gparams) 26 ] 27 28 return (cost, updates)
现在可以定义一个函数迭代更新参数使得重构误差最小化:
1 da = dA( 2 numpy_rng=rng, 3 theano_rng=theano_rng, 4 input=x, 5 n_visible=28 * 28, 6 n_hidden=500 7 ) 8 9 cost, updates = da.get_cost_updates( 10 corruption_level=0., 11 learning_rate=learning_rate 12 ) 13 14 train_da = theano.function( 15 [index], 16 cost, 17 updates=updates, 18 givens={ 19 x: train_set_x[index * batch_size: (index + 1) * batch_size] 20 } 21 )
如果除了重构误差最小之外没有其他约束,我们自然希望重构的数据与输入完全一样最好,即隐层的编码维度与输入数据维度一样。
然而在[Bengio07] 中的实验表明,在实际中,隐单元数目比输入层单元多(称为超完备)的非线性自动编码器能够产生更加有用的表示(这里”有用“意思是产生更低的分类误差)。
为了获得对连续输入较好的重构,单隐层的自动编码器的非线性单元需要在第一层(编码)时权重较小,这样使得隐单元的非线性数据能够处于激活函数的近似线性范围内,而解码时,需要很大的权重。
对于二元输入,同样需要最小化重构误差。由于显示或者隐式的正则化,使得解码时的权重很难达到一个较大的值,优化算法会找到只适合与训练集相似的样本的编码,这正是我们想要的。
2.降噪自动编码器(DA)
DA的思路很简单:为了使得隐层学习出更加鲁棒的表示,防止简单地学习出一个等价的表示,我们训练一个DA,使得自动编码器能能够从加噪声的的数据中重构出真实的数据。
DA主要做两件事:对输入进行编码和消除噪声的负面影响。
在[Vincent08]中,随机加噪声就是随机把输入中的一些数据置为0,因此自动编码器就是试图通过没有加噪声的数据预测出加噪声的数据。
为了把上面的自动编码器转换成DA,需要加上对输入随机加噪声的操作,加噪声的方式很多,这里只是随机将输入中的部分数据置为0.
下面做法就是生成与输入形状一样的二项分布随机数,然后与输入相乘即可:
1 from theano.tensor.shared_randomstreams import RandomStreams 2 3 def get_corrupted_input(self, input, corruption_level): 4 """ This function keeps ``1-corruption_level`` entries of the inputs the same 5 and zero-out randomly selected subset of size ``coruption_level`` 6 Note : first argument of theano.rng.binomial is the shape(size) of 7 random numbers that it should produce 8 second argument is the number of trials 9 third argument is the probability of success of any trial 10 11 this will produce an array of 0s and 1s where 1 has a probability of 12 1 - ``corruption_level`` and 0 with ``corruption_level`` 13 """ 14 return self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level) * input
最终DA类代码变成下面:
1 class dA(object): 2 """Denoising Auto-Encoder class (dA) 3 4 A denoising autoencoders tries to reconstruct the input from a corrupted 5 version of it by projecting it first in a latent space and reprojecting 6 it afterwards back in the input space. Please refer to Vincent et al.,2008 7 for more details. If x is the input then equation (1) computes a partially 8 destroyed version of x by means of a stochastic mapping q_D. Equation (2) 9 computes the projection of the input into the latent space. Equation (3) 10 computes the reconstruction of the input, while equation (4) computes the 11 reconstruction error. 12 13 .. math:: 14 15 \tilde{x} ~ q_D(\tilde{x}|x) (1) 16 17 y = s(W \tilde{x} + b) (2) 18 19 x = s(W' y + b') (3) 20 21 L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)] (4) 22 23 """ 24 25 def __init__(self, numpy_rng, theano_rng=None, input=None, n_visible=784, n_hidden=500, 26 W=None, bhid=None, bvis=None): 27 """ 28 Initialize the dA class by specifying the number of visible units (the 29 dimension d of the input ), the number of hidden units ( the dimension 30 d' of the latent or hidden space ) and the corruption level. The 31 constructor also receives symbolic variables for the input, weights and 32 bias. Such a symbolic variables are useful when, for example the input is 33 the result of some computations, or when weights are shared between the 34 dA and an MLP layer. When dealing with SdAs this always happens, 35 the dA on layer 2 gets as input the output of the dA on layer 1, 36 and the weights of the dA are used in the second stage of training 37 to construct an MLP. 38 39 :type numpy_rng: numpy.random.RandomState 40 :param numpy_rng: number random generator used to generate weights 41 42 :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams 43 :param theano_rng: Theano random generator; if None is given one is generated 44 based on a seed drawn from `rng` 45 46 :type input: theano.tensor.TensorType 47 :paran input: a symbolic description of the input or None for standalone 48 dA 49 50 :type n_visible: int 51 :param n_visible: number of visible units 52 53 :type n_hidden: int 54 :param n_hidden: number of hidden units 55 56 :type W: theano.tensor.TensorType 57 :param W: Theano variable pointing to a set of weights that should be 58 shared belong the dA and another architecture; if dA should 59 be standalone set this to None 60 61 :type bhid: theano.tensor.TensorType 62 :param bhid: Theano variable pointing to a set of biases values (for 63 hidden units) that should be shared belong dA and another 64 architecture; if dA should be standalone set this to None 65 66 :type bvis: theano.tensor.TensorType 67 :param bvis: Theano variable pointing to a set of biases values (for 68 visible units) that should be shared belong dA and another 69 architecture; if dA should be standalone set this to None 70 71 72 """ 73 self.n_visible = n_visible 74 self.n_hidden = n_hidden 75 76 # create a Theano random generator that gives symbolic random values 77 if not theano_rng : 78 theano_rng = RandomStreams(rng.randint(2 ** 30)) 79 80 # note : W' was written as `W_prime` and b' as `b_prime` 81 if not W: 82 # W is initialized with `initial_W` which is uniformely sampled 83 # from -4.*sqrt(6./(n_visible+n_hidden)) and 4.*sqrt(6./(n_hidden+n_visible)) 84 # the output of uniform if converted using asarray to dtype 85 # theano.config.floatX so that the code is runable on GPU 86 initial_W = numpy.asarray(numpy_rng.uniform( 87 low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), 88 high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), 89 size=(n_visible, n_hidden)), dtype=theano.config.floatX) 90 W = theano.shared(value=initial_W, name='W') 91 92 if not bvis: 93 bvis = theano.shared(value = numpy.zeros(n_visible, 94 dtype=theano.config.floatX), name='bvis') 95 96 if not bhid: 97 bhid = theano.shared(value=numpy.zeros(n_hidden, 98 dtype=theano.config.floatX), name='bhid') 99 100 self.W = W 101 # b corresponds to the bias of the hidden 102 self.b = bhid 103 # b_prime corresponds to the bias of the visible 104 self.b_prime = bvis 105 # tied weights, therefore W_prime is W transpose 106 self.W_prime = self.W.T 107 self.theano_rng = theano_rng 108 # if no input is given, generate a variable representing the input 109 if input == None: 110 # we use a matrix because we expect a minibatch of several examples, 111 # each example being a row 112 self.x = T.dmatrix(name='input') 113 else: 114 self.x = input 115 116 self.params = [self.W, self.b, self.b_prime] 117 118 def get_corrupted_input(self, input, corruption_level): 119 """ This function keeps ``1-corruption_level`` entries of the inputs the same 120 and zero-out randomly selected subset of size ``coruption_level`` 121 Note : first argument of theano.rng.binomial is the shape(size) of 122 random numbers that it should produce 123 second argument is the number of trials 124 third argument is the probability of success of any trial 125 126 this will produce an array of 0s and 1s where 1 has a probability of 127 1 - ``corruption_level`` and 0 with ``corruption_level`` 128 """ 129 return self.theano_rng.binomial(size=input.shape, n=1, p=1 - corruption_level) * input 130 131 132 def get_hidden_values(self, input): 133 """ Computes the values of the hidden layer """ 134 return T.nnet.sigmoid(T.dot(input, self.W) + self.b) 135 136 def get_reconstructed_input(self, hidden ): 137 """ Computes the reconstructed input given the values of the hidden layer """ 138 return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime) 139 140 def get_cost_updates(self, corruption_level, learning_rate): 141 """ This function computes the cost and the updates for one trainng 142 step of the dA """ 143 144 tilde_x = self.get_corrupted_input(self.x, corruption_level) 145 y = self.get_hidden_values( tilde_x) 146 z = self.get_reconstructed_input(y) 147 # note : we sum over the size of a datapoint; if we are using minibatches, 148 # L will be a vector, with one entry per example in minibatch 149 L = -T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1 ) 150 # note : L is now a vector, where each element is the cross-entropy cost 151 # of the reconstruction of the corresponding example of the 152 # minibatch. We need to compute the average of all these to get 153 # the cost of the minibatch 154 cost = T.mean(L) 155 156 # compute the gradients of the cost of the `dA` with respect 157 # to its parameters 158 gparams = T.grad(cost, self.params) 159 # generate the list of updates 160 updates = [] 161 for param, gparam in zip(self.params, gparams): 162 updates.append((param, param - learning_rate * gparam)) 163 164 return (cost, updates)