训练语料库中未看到标记“Text_4”/无效答案

【问题标题】：tag 'Text_4' not seen in training corpus/invalid训练语料库中未看到标记“Text_4”/无效
【发布时间】：2018-03-31 08:07:33
【问题描述】：

我需要一些帮助来诊断我在某些文本矢量过程中遇到的一些问题。实际上，我正在尝试应用 doc2vec 词嵌入来获取用于分类任务的向量。运行代码后，我遇到了一些很难弄清楚的错误，因为我是新手。下面是代码和输出

    def constructLabeledSentences(data):
    sentences=[]
    for index, row in data.iteritems():
        sentences.append(TaggedDocument(utils.to_unicode(row).split(), ['Text' + '_%s' % str(index)]))
    return sentences

    x_raw_doc_sentences = constructLabeledSentences(x_raw_train['Text'])
x_raw_doc_model = Doc2Vec(min_count=5, window=5, vector_size=300, sample=0.001, negative=5, workers=4, epochs=10,seed=1)
x_raw_doc_model.build_vocab(x_raw_doc_sentences)
x_raw_doc_model.train(x_raw_doc_sentences, total_examples=x_raw_doc_model.corpus_count, epochs=x_raw_doc_model.epochs)

运行模型后，我尝试使用以下方法提取向量：

x_raw_doc_train_arrays = np.zeros((x_raw_train.shape[0], 300))
for i in range (x_raw_train.shape[0]):
    x_raw_doc_train_arrays[i]=x_raw_doc_model.docvecs['Text_'+str(i)]

这是我得到的输出：

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-106-bc0222fef295> in <module>()
      1 x_raw_doc_train_arrays = np.zeros((x_raw_train.shape[0], 300))
      2 for i in range (x_raw_train.shape[0]):
----> 3     x_raw_doc_train_arrays[i]=x_raw_doc_model.docvecs['Text_'+str(i)]
      4 
      5 

~\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py in __getitem__(self, index)
   1197                 return self.vectors_docs[self._int_index(index, self.doctags, self.max_rawint)]
   1198             return vstack([self[i] for i in index])
-> 1199         raise KeyError("tag '%s' not seen in training corpus/invalid" % index)
   1200 
   1201     def __contains__(self, index):

KeyError: "tag 'Text_4' not seen in training corpus/invalid"

有什么我做错了，或者应该做我没有做的吗？

【问题讨论】：

标签： python python-3.x gensim doc2vec

【解决方案1】：

您是否查看过sentences 以确保存在TaggedDocument 和包含'Text_4' 的tags？

如果是这样，该文档是否有任何特殊之处可能会阻止它贡献其标签？例如，最初或在应用 min_count of words 并且忽略所有稀有词（这通常是矢量质量的好主意）之后，它是否是空的？

还请注意，您可以使用原始整数作为tags 中的各个标记值。（在这种情况下，docvecs 数组被初始化为包含所有索引的向量，直到您使用的最高索引 - 所以像 4 这样的值对应于无操作示例会得到一个向量，但它在训练期间根本没有被调整，保持在它的初始化随机值。）

【讨论】：

再次运行模型使其完美运行，谢谢。