Word2Vec：如何检查训练模型的向量值？答案

【问题标题】：Word2Vec: How to check the trained model for the value of the vector?Word2Vec：如何检查训练模型的向量值？
【发布时间】：2018-09-14 03:58:47
【问题描述】：

我最近尝试使用 word2vec，我训练了我的模型并获得了分配的所有向量。但是，我不知道如何找到每个向量的值。

我尝试打印模型，但它只输出它训练过的所有向量。但是，我还是不明白，我认为向量是基于每个单词的，但不知何故，一切都在一个列表中。

我对 word2vec 的理解是每个词（假设这个 W1）都有自己的向量，并且每个向量代表当前词（W1）和 word2（W2）之间的相似性。由于每个单词都分配有稀疏向量，因此它应该包含许多仅用于 W1 的向量。然而，当我打印我的模型时，我收到（也许）只有一个词，但我不确定这是哪个词。谁能帮帮我？

我的代码：

import collections
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

batch_size = 20
embedding_size = 2
num_sampled = 15


sentences  = ["I have something that I want to say to him",
            "How are you",
            "We can see many stars tonight",
            "That's our house",
            "sung likes cats",
            "she loves dogs",
            "Do you know what he has done",
            "cats are great companions when they want to be",
            "We need to invest in clean, renewable energy",
            "women love his man",
            "queen love his king",
            "girl love his boy",
            "The line is too long. Why don't you come back tomorrow",
            "man and women roam in park",
            "Does it really matter",
            "dynasty king remain mortal"]

words = " ".join(sentences).split()
count = collections.Counter(words).most_common()
# Build dictionaries
reverse_dictionary = [i[0] for i in count] #reverse dic, idx -> word
dic = {w: i for i, w in enumerate(reverse_dictionary)} #dic, word -> id
voc_size = len(dic)
data = [dic[word] for word in words]


cbow_pairs = []
for i in range(1, len(data)-1) :
    cbow_pairs.append([[data[i-1], data[i+1]], data[i]])

    skip_gram_pairs = []
for c in cbow_pairs:
    skip_gram_pairs.append([c[1], c[0][0]])
    skip_gram_pairs.append([c[1], c[0][1]])



def  generate_batch (size):
    assert size < len(skip_gram_pairs)
    x_data=[]
    y_data = []
    r = np.random.choice(range(len(skip_gram_pairs)), size, replace=False)
    for i in r:
        x_data.append(skip_gram_pairs[i][0])  # n dim
        y_data.append([skip_gram_pairs[i][1]])  # n, 1 dim
    return x_data, y_data

# Input data
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device('/cpu:0'):
    # Look up embeddings for inputs.
    embeddings = tf.Variable(
        tf.random_uniform([voc_size, embedding_size], -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs) # lookup table

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.random_uniform([voc_size, embedding_size],-1.0, 1.0))
nce_biases = tf.Variable(tf.zeros([voc_size]))

# Compute the average NCE loss for the batch.
# This does the magic:
#   tf.nn.nce_loss(weights, biases, inputs, labels, num_sampled, num_classes ...)
# It automatically draws negative samples when we evaluate the loss.
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, train_labels, embed, num_sampled, voc_size))
# Use the adam optimizer
train_op = tf.train.AdamOptimizer(1e-1).minimize(loss)


# Launch the graph in a session# Launch 
with tf.Session() as sess:
    # Initializing all variables
    tf.global_variables_initializer().run()

    for step in range(100):
        batch_inputs, batch_labels = generate_batch(batch_size)
        _, loss_val = sess.run([train_op, loss],
                feed_dict={train_inputs: batch_inputs, train_labels: batch_labels})

    # Final embeddings are ready for you to use. Need to normalize for practical use
    trained_embeddings = embeddings.eval()
    print(trained_embeddings)

当前输出：不知何故，这个输出似乎只针对一个单词，而不是语料库中的所有单词。

[[-0.751498   -1.4963825 ]
 [-0.7022982  -1.4211462 ]
 [-1.6240289  -0.96706766]
 [-3.2109795  -1.2967492 ]
 [-0.8835893  -1.5251521 ]
 [-1.4316636  -1.4322135 ]
 [-1.8665589  -1.1734825 ]
 [-0.4726948  -1.836668  ]
 [-0.11171409 -2.0847342 ]
 [-1.0599283  -0.9792351 ]
 [-1.6748023  -0.9584413 ]
 [-0.8855507  -1.3226773 ]
 [-0.9565117  -1.5730425 ]
 [-1.2891663  -1.1687953 ]
 [-0.06940217 -1.7782353 ]
 [-0.92220575 -1.8264929 ]
 [-3.2258956  -1.105678  ]
 [-2.4262347  -0.9806146 ]
 [-0.36716968 -2.3782976 ]
 [-0.4972397  -1.9926786 ]
 [-0.65995616 -1.2129989 ]
 [-0.53334516 -1.5244756 ]
 [-1.4961753  -0.5592766 ]
 [-0.57391864 -1.9852302 ]
 [-0.6580112  -1.0749325 ]
 [-0.7821078  -1.598069  ]
 [-1.264001   -1.002861  ]
 [-0.23881587 -2.103974  ]
 [-0.3729657  -1.9456012 ]
 [-0.9266953  -1.516872  ]
 [-1.4948957  -1.1232641 ]
 [-1.109361   -1.3108519 ]
 [-2.0748782  -0.93853486]
 [-2.0241299  -0.8716516 ]
 [-0.9448593  -1.0530868 ]
 [-1.4578291  -0.57673496]
 [-0.31915158 -1.4830168 ]
 [-1.2568909  -1.0629684 ]
 [-0.50458056 -2.2233846 ]
 [-1.2059065  -1.0402468 ]
 [-0.17204402 -1.8913956 ]
 [-1.5484996  -1.0246676 ]
 [-1.7026784  -1.4470854 ]
 [-2.114282   -1.2304462 ]
 [-1.6737207  -1.2598573 ]
 [-0.9031189  -1.8086503 ]
 [-1.4084693  -0.9171761 ]
 [-1.261698   -1.5333931 ]
 [-2.7891722  -0.69629264]
 [-2.7634912  -1.0250676 ]
 [-2.171037   -1.3402877 ]
 [-1.5588827  -1.4741637 ]
 [-2.012083   -1.6028976 ]
 [-1.4286829  -1.485801  ]
 [-0.06908941 -2.370034  ]
 [-1.3277153  -1.2935033 ]
 [-0.52055264 -1.2549478 ]
 [-2.4971442  -0.6335571 ]
 [-2.7244987  -0.6136059 ]
 [-0.7155211  -1.8717885 ]
 [-2.1862056  -0.78832203]
 [-2.068198   -0.96536046]
 [-0.9023069  -1.6741301 ]
 [-0.39895654 -1.584905  ]
 [-0.656657   -1.6787726 ]
 [ 0.13354267 -2.105389  ]
 [-1.248123   -1.7273897 ]
 [-0.6168909  -1.3929827 ]
 [-0.1866242  -2.0612721 ]
 [-2.3246803  -1.1561321 ]
 [ 0.88145804  0.35487294]]

预期输出示例：

[-0.751498 -1.4963825 ] 显示这两个向量的值。例如，“如何”或“是”。

【问题讨论】：

有没有人可以帮我解决这个问题？

标签： python word2vec

【解决方案1】：

如果您已经训练了一个 Word2Vec 模型来学习每个单词的二维向量，那么每个单词都会有一个二维向量。

我无法评估您的完整实现 - 您可能应该使用已知良好的现成标准 Word2Vec 库。此外，Word2Vec 确实依赖于大量、多样化的训练数据——玩具大小的示例通常不会显示真实的行为和好处。

但是由于您的 sentences 看起来有几十个独特的单词，所以显示您的完整 trained_embeddings 的输出包含几十个二维向量似乎是正确的。

如果您只需要一个词的向量，则需要在训练前分配的全套中的任何位置查找它。

【讨论】：

感谢您的解释，但您能否详细解释一下我的输出。我仍然是初学者，不明白为什么最终输出是这样的。我可以知道每一行实际上代表什么吗？是相似度吗？例如... [-0.751498 -1.4963825 ] 代表什么？我知道通常向量表示一个词与另一个词的相似性。所以基于这个 [-0.751498 -1.4963825 ] 它有 2 个相似词。但是由于输出只包含上面的内容，所以我很困惑
很抱歉我的理解不足。我终于明白了。是的，代码没有错
无需道歉！正如您现在可能意识到的那样，二维向量本身并不是单词之间的差异。它们是空间中的点，恰好可以很好地表示某些输入单词，用于单词预测的Word2Vec-training 问题。然后，对我们来说幸运的是，这些点也适用于我们想知道的关于单词的其他语义事物。例如，这些点之间的相对距离是衡量词相似度的一个公平指标，甚至差异方向通常也与人类对相对词关联的理解相关。
再次感谢您，但如果可能的话，您是否可以查看我的这篇文章并给我一些关于我对特征提取和 word2vec 的理解的问题？谢谢你。 stackoverflow.com/questions/52379317/…