使用 map 预处理列表删除 doc2vec 的停用词而不丢失单词顺序答案

【问题标题】：Preprocessing a list of list removing stopwords for doc2vec using map without losing words order使用 map 预处理列表删除 doc2vec 的停用词而不丢失单词顺序
【发布时间】：2021-07-19 00:49:54
【问题描述】：

我正在用gensim 实现一个简单的doc2vec，不是 word2vec

我需要在不丢失正确顺序的情况下删除停用词。

每个列表都是一个文档，正如我对 doc2vec 的理解，模型将输入一个 TaggedDocuments 列表

model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)

dataset = [['We should remove the stopwords from this example'],
     ['Otherwise the algo'],
     ["will not work correctly"],
     ['dont forget Gensim doc2vec takes list_of_list' ]]

STOPWORDS = ['we','i','will','the','this','from']


def word_filter(lst):
  lower=[word.lower() for word in lst]
  lst_ftred = [word for word in lower if not word in STOPWORDS]
  return lst_ftred

lst_lst_filtered= list(map(word_filter,dataset))
print(lst_lst_filtered)

输出：

[['we should remove the stopwords from this example'], ['otherwise the algo'], ['will not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]

预期输出：

[[' should remove the stopwords   example'], ['otherwise the algo'], [' not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]

我的错误是什么以及如何解决？
还有其他有效的方法可以解决这个问题而不会丢失顺序正确吗？

提问前我检查的问题列表：

How to apply a function to each sublist of a list in python?

我对此进行了研究并尝试将其应用于我的具体案例

Removing stopwords from list of lists

顺序很重要我不能用set

Removing stopwords from a list of text files

这可能是一个可能的解决方案，类似于我已实施的解决方案。
我不明白这种差异，但我不知道如何处理它。在我的情况下，文档没有被标记（并且不应该被标记，因为是 doc2vec 而不是 word2vec）

How to remove stop words using nltk or python

在这个问题中，SO 处理的是列表而不是列表

【问题讨论】：

标签： python list gensim stop-words

【解决方案1】：

首先，请注意，从 Doc2Vec 训练中删除停用词并不重要。其次，请注意，如此小的玩具数据集不会提供来自Doc2Vec 的有趣结果。 Tha算法，如Word2Vec，只有在具有（1）比向量维数多得多的独特词的大型数据集上训练时才开始显示其价值； (2) 每个词的用法都有很多不同的例子——至少有几个，最好是几十个或几百个。

不过，如果您想去除停用词，最好在标记原始字符串之后这样做。（也就是说，将字符串拆分为单词列表。这就是Doc2Vec 无论如何都需要的格式。）而且，您不希望您的dataset 成为一个列表-一个字符串的列表。相反，您希望它是一个字符串列表（一开始），然后是一个带有多个令牌的列表列表。

以下应该有效：

string_dataset = [
     'We should remove the stopwords from this example',
     'Otherwise the algo',
     "will not work correctly",
     'dont forget Gensim doc2vec takes list_of_list',
]

STOPWORDS = ['we','i','will','the','this','from']

# Python list comprehension to break into tokens
tokenized_dataset = [s.split() for s in string_dataset]

def filter_words(tokens):
    """lowercase each token, and keep only if not in STOPWORDS"""
    return [token.lower() for token in tokens if token not in STOPWORDS]

filtered_dataset = [filter_words(s) for sent in tokenized_dataset]

最后，因为如上所述，Doc2Vec 需要多个单词示例才能正常工作，所以使用 min_count=1 几乎总是一个坏主意。

【讨论】：

Tnx，是的，我尝试创建一个最小可重现示例，但它的数据集肯定大于 +100k 文档。从文档中我不清楚的是，在 doc2vec 中，您还必须标记构成文档的所有字符串。
TaggedDocument 类的文档（radimrehurek.com/gensim/models/…）——这是Doc2Vec 训练示例的推荐类型）——将其第一个参数words描述为“unicode 列表” string tokens` 和介绍示例显示了将字符串预处理为令牌列表，如 radimrehurek.com/gensim/auto_examples/tutorials/…>。文档或示例中的某些部分是否向您表明了其他情况？
我确定我错误地解释了一个我再也找不到的非官方教程（这就是我认为我犯了一个错误的原因），因为在问这个问题之前我一直在徘徊为什么我要使用列表列表。顺便说一句，我将再次（并且更好）学习所有官方教程

【解决方案2】：

lower 是一个元素的列表，word not in STOPWORDS 将返回 False。将列表中的第一项带索引并以空格分隔

lst_ftred = ' '.join([word for word in lower[0].split() if word not in STOPWORDS])
# output: ['should remove stopwords example', 'otherwise algo', 'not work correctly', 'dont forget gensim doc2vec takes list_of_list']
# 'the' is also in STOPWORDS

【讨论】：