spacy 激进的词形还原和删除意外单词答案

【问题标题】：spacy aggressive lemmatization and removing unexpected wordsspacy 激进的词形还原和删除意外单词
【发布时间】：2020-11-27 16:30:28
【问题描述】：

我正在尝试清理一些文本数据。首先我删除了停用词，然后我尝试对文本进行 Lemmatize。但是名词之类的词被去掉了

样本数据

https://drive.google.com/file/d/1p9SKWLSVYeNScOCU_pEu7A08jbP-50oZ/view?usp=sharing 更新代码

# Libraries  
import spacy
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['covid', 'COVID-19', 'coronavirus'])

article= pd.read_csv("testdata.csv")
data = article.title.values.tolist()
nlp = spacy.load('en_core_web_sm')

def sent_to_words(sentences):
    for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
print ("*** Text  After removing Stop words:   ")
print(data_words_nostops)
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
data_lemmatized = lemmatization(data_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON'])
print ("*** Text  After Lemmatization:   ")

print(data_lemmatized)

去掉停用词后的输出是：

[['qaia', 'flags', 'amman', 'melbourne', 'jetstar', 'flights', 'recovery', 'plan'],
['western', 'amman', 'suburb', 'new','nsw', 'ground', 'zero', children],
['flight', 'returned', 'amman','qaia', 'staff', 'contract','driving'], ]]

Lematization 后的输出：

[['飞行'，'恢复'，'计划']

['郊区'，'地面']

['返回', '合同','驾驶']

在每条记录上，我都不明白以下内容：

-1st reord：为什么要删除这些词：“'qaia', 'flags', 'amman', 'melbourne', 'jetstar'

-2ed 记录：基本单词被删除与第一次记录相同，另外，我期待孩子们转换为孩子

-3ed，“驾驶”不转换为“驾驶”

我期待诸如“Amman”之类的单词不会被删除，而且我期待这些单词将从复数转换为单数。并且动词会被转换为不定式...

我在这里缺少什么？？？提前感谢

【问题讨论】：

删除的词看起来像专有名词。尝试将PROPN 添加到您的allowed_postags。您对词形还原的期望是正确的，但是 Spacy 的词形还原器并不是很好。如果您需要更好的性能，可以尝试lemminflect。
顺便说一句...我注意到您正在运行通过 Spacy 的 nlp 删除停用词的句子版本。这可能会打乱 pos 标签的分配，这会干扰词形还原等。检查 Spacy 分配给您的测试句子的标签，看看它们是否正确，并考虑通过 nlp 处理您的完整句子。
@bivouac0 感谢您的评论。关于停用词，我像这样扩展了英语单词列表stop_words = stopwords.words('english'); stop_words.extend(['covid', 'COVID-19', 'coronavirus']) bur 我已停用，因为我想检查词形分析器的行为
@bivouac0 我将PROPN 添加到allowed_postags .. 这对于像 "Amman" ， "flights" ** ... BUT ，像 ** 这样的词很有用“children” 未转换为 “child”

标签： python nlp nltk spacy lemmatization

【解决方案1】：

我猜你的大部分问题是因为你没有提供 spaCy 完整的句子，也没有为你的单词分配正确的词性标签。这可能会导致 lemmatizer 返回错误的结果。但是，由于您只提供了sn-ps的代码，而没有提供原文，因此很难回答这个问题。下次考虑将您的问题简化为其他人可以在他们的机器上运行的几行代码，并提供一个失败的示例输入。见Minimal Reproducible Example

这是一个有效的示例，与您正在做的事情很接近。

import spacy
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
allow_postags = set(['NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN'])
nlp = spacy.load('en')
text = 'The children in Amman and Melbourne are too young to be driving.'
words = []
for token in nlp(text):
    if token.text not in stop_words and token.pos_ in allow_postags:
        words.append(token.lemma_)
print(' '.join(words))

这会返回child Amman Melbourne young drive

【讨论】：

你能检查我更新的代码@bivouac0
您仍然有同样的问题，在您通过 gensim 运行 spaCy 的 lemmatizer 后它无法正常工作，因为您删除了停用词、标点符号并将所有内容都转换为小写。您需要对原始句子运行 spacy ，然后提取要保留的单词。您根本不需要 gensim（或 pandas）。顺便说一句..我想你也想要PROPN 而不是PRON 在你的alllowed_postags。
对于spacy.load('en') 可能值得参考这个post。例如，spacy.load('en_core_web_sm') 对于某些用例可能就足够了。