将文本转换为矢量答案

【问题标题】：Transforming Text To Vector将文本转换为矢量
【发布时间】：2019-01-01 22:33:59
【问题描述】：

我有一本字典，里面有单词和每个单词的频率。

{'cxampphtdocsemployeesphp': 1,
'emptiness': 1, 
'encodingundefinedconversionerror': 1, 
'msbuildexe': 2,
'e5': 1, 
'lnk4049': 1,
'specifierqualifierlist': 2, .... }

现在我想使用这个字典创建一个词袋模型（我不想使用标准库和函数。我想使用算法来应用它。）

在字典中找到 N 个最流行的单词并计算它们。现在我们有了一本最流行单词的字典。
为字典中的每个标题创建一个维数等于 N 的零向量。
对于语料库中的每个文本，遍历字典中的单词并将相应的坐标加 1。

我有我的文本，我将使用它来使用函数创建向量。

函数看起来像这样，

def my_bag_of_words(text, words_to_index, dict_size):
"""
    text: a string
    dict_size: size of the dictionary

    return a vector which is a bag-of-words representation of 'text'
"""


 Let say we have N = 4 and the list of the most popular words is 

['hi', 'you', 'me', 'are']

Then we need to numerate them, for example, like this: 

{'hi': 0, 'you': 1, 'me': 2, 'are': 3}

And we have the text, which we want to transform to the vector:
'hi how are you'

For this text we create a corresponding zero vector 
[0, 0, 0, 0]

And iterate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:
'hi':  [1, 0, 0, 0]
'how': [1, 0, 0, 0] # word 'how' is not in our dictionary
'are': [1, 0, 0, 1]
'you': [1, 1, 0, 1]

The resulting vector will be 
[1, 1, 0, 1]

应用此功能的任何帮助都会非常有帮助。我正在使用 python 来实现。

谢谢，

尼尔

【问题讨论】：

请提供示例输出以供您输入。（my_bag_of_words 究竟会返回什么）

标签： python python-3.x nlp text-processing information-retrieval

【解决方案1】：

您需要首先计算每个词的语料库频率，针对每个单词的情况，并将它们保存在频率词典中。假设樱桃恰好在您的语料库中出现 78 次 cheery --> 78 您需要保留。然后按频率值对频率字典进行降序排序，然后保留前 N 对。

然后，对于您的枚举，您可以保留一个字典作为索引。例如，index dictionary 的 cherry --> term2。

现在，需要准备一个关联矩阵。它将是文档的向量，如下所示：

doc_id   term1 term2 term3 .... termN
doc1       35     0    23         1
doc2        0     0    13         2
   .        .     .     .         .
docM        3     1     2         0

您的语料库中的每个文档（文本、标题、句子）都需要有一个 id 或索引以及上面列出的。是时候为文档创建矢量了。遍历您的文档并通过标记它们来获取术语，每个文档都有标记。遍历标记，检查下一个标记是否存在于您的频率字典中。如果为真，请使用您的 索引字典 和 频率字典 更新您的零向量。

假设 doc5 有樱桃，我们在前 N 个流行术语中有它。获取它的频率（它是 78）和索引（它是 term5）。现在更新doc5的零向量：

doc_id   term1 term2 term3 .... termN
doc1       35     0    23         1
doc2        0     0    13         2
   .        .     .     .         .
doc5        0    78     0         0 (under process)

您需要针对语料库中每个文档的所有流行术语对每个标记执行此操作。

最后，您将得到一个 NxM 矩阵，其中包含语料库中 M 个文档的向量。

我可以建议你看看 IR-Book。 https://nlp.stanford.edu/IR-book/information-retrieval-book.html

您可能会考虑使用基于 tf-idf 的矩阵，而不是他们提出的基于语料库频率的术语关联矩阵。

希望这篇文章对你有帮助，

干杯

【讨论】：

【解决方案2】：

我从头到尾做了研究，也想分享我的答案！

我看起来像这样的数据已存储在一个列表中：

data_list = ['draw stacked dotplot r',
 'mysql select records datetime field less specified value',
 'terminate windows phone 81 app',
 'get current time specific country via jquery',
 'configuring tomcat use ssl',...]

接下来，我计算了列表中每个单词的频率，

words_counts = {}                                                      
for text in data_list:
   for word in text.split():
      if word in words_counts:
        words_counts[word] += 1
      else:
        words_counts[word] = 1

因此，我的 words_counts 字典将包含我的 data_list 中的所有单词及其频率。它看起来像这样

 {'detailed': 6,
 'ole_handle': 1,
 'startmonitoringsignificantlocationchanges': 2,
 'pccf02102': 1,
 'insight': 2,
 'combinations': 26,
 'tuplel': 1}

现在对于我们的 my_bag_of_word 函数，我需要按降序对我的 words_counts 字典进行排序，并为每个单词分配索引。

index_to_word = sorted(words_counts.key(), key = lambda x:words_counts[x], reverse = True) 
words_to_index = {word:i for i,word in enimerate(index_to_words)}

现在我们的 words_to_index 看起来像这样：

  {'address': 387,
 'behind': 706,
 'page': 23,
 'inherited': 1617,
 '106': 4677,
 'posting': 1293,
 'expressions': 876,
 'occured': 3241,
 'highest': 2989}

现在我们终于可以使用我们创建的字典来获取文本的向量了，

def my_bag_of_words(text, words_to_index, size_of_dictionary):
   word_vector = np.zeros(size_of_dictionary)
   for word in text.split():
       if word in words_to_index:
          word_vector[words_to_index[word]] += 1
   return word_vector

这确实是学习和理解这个概念的好方法。感谢大家的帮助和支持。

快乐学习

尼尔

【讨论】：

恭喜并感谢您分享您的经验、努力和解决方案。
@berkin 也感谢您的回答。快乐学习！