【问题标题】:Combining text stemming and removal of punctuation in NLTK and scikit-learn结合 NLTK 和 scikit-learn 中的文本词干提取和标点符号删除
【发布时间】:2014-11-25 09:54:22
【问题描述】:

我正在使用 NLTK 和 scikit-learnCountVectorizer 的组合来进行词干提取和标记化。

下面是CountVectorizer 的简单用法示例:

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)

sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

哪个会打印

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

现在,假设我要删除停用词并阻止这些词。一种选择是这样做:

from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vect = CountVectorizer(tokenizer=tokenize, stop_words='english') 

vect.fit(vocab)

sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

哪些打印:

Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]

但是在第二个版本中,我怎样才能最好地去掉标点符号呢?

【问题讨论】:

    标签: python text scikit-learn nltk


    【解决方案1】:

    有几个选项,尝试在标记化之前删除标点符号。但这意味着don't -> dont

    import string
    
    def tokenize(text):
        text = "".join([ch for ch in text if ch not in string.punctuation])
        tokens = nltk.word_tokenize(text)
        stems = stem_tokens(tokens, stemmer)
        return stems
    

    或者尝试在标记化后删除标点符号。

    def tokenize(text):
        tokens = nltk.word_tokenize(text)
        tokens = [i for i in tokens if i not in string.punctuation]
        stems = stem_tokens(tokens, stemmer)
        return stems
    

    已编辑

    上面的代码可以运行,但是速度很慢,因为它会多次循环相同的文本:

    • 一次性删除标点符号
    • 第二次标记化
    • 第三次停止。

    如果您有更多步骤,例如删除数字或删除停用词或小写等。

    最好将这些步骤尽可能集中在一起,如果您的数据需要更多预处理步骤,这里有几个更好的答案,它们会更有效:

    【讨论】:

    • 简单而有效。谢谢!
    • 请注意,第二个不会捕获... 或其他多字符标点符号。
    • @FredFoo 和其他人:对于提取的关键字而不是普通文档,您如何评价 GENSIM 和 Scikit?你可以回答我吗? stackoverflow.com/questions/40436110/rake-with-gensim
    猜你喜欢
    • 1970-01-01
    • 2016-06-02
    • 1970-01-01
    • 2013-01-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-01-08
    • 2021-01-04
    相关资源
    最近更新 更多