【问题标题】:CountVectorizer to build dictionary for removing extra wordsCountVectorizer 构建字典以删除多余的单词
【发布时间】:2021-01-26 00:54:43
【问题描述】:

我在 pandas 列中有一个句子列表:

sentence
I am writing on Stackoverflow because I cannot find a solution to my problem.
I am writing on Stackoverflow. 
I need to show some code. 
Please see the code below

我想通过它们进行一些文本挖掘和分析,例如获取词频。 为此,我正在使用这种方法:

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem."]
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)

如何将它应用到我的专栏中,在构建词汇表后删除多余的停用词?

【问题讨论】:

    标签: python pandas scikit-learn nlp countvectorizer


    【解决方案1】:

    您可以在CountVectorizer 中使用stop_words 参数,这将负责删除停用词:

    from nltk.corpus import stopwords
    from sklearn.feature_extraction.text import CountVectorizer
    text = ["I am writing on Stackoverflow because I cannot find a solution to my problem."]
    stopwords = stopwords.words("english") # you may add or define your stopwords here
    vectorizer = CountVectorizer(stop_words=stopwords)
    vectorizer.fit_transform(text)
    

    如果您想在 pandas 数据帧中进行所有预处理:

    from nltk.corpus import stopwords
    from sklearn.feature_extraction.text import CountVectorizer
    text = ["I am writing on Stackoverflow because I cannot find a solution to my problem.", "I am writing on Stackoverflow."]
    df = pd.DataFrame({"text": text})
    stopwords = stopwords.words("english") # you may add or define your stopwords here
    vectorizer = CountVectorizer(stop_words=stopwords)
    df["counts"] = vectorizer.fit_transform(df["text"]).todense().tolist()
    df
                                                    text              counts
    0  I am writing on Stackoverflow because I cannot...  [1, 1, 1, 1, 1, 1]
    1                     I am writing on Stackoverflow.  [0, 0, 0, 0, 1, 1]
    

    在这两种情况下,您都有一个已删除停用词的词汇:

    print(vectorizer.vocabulary_)
    {'writing': 5, 'stackoverflow': 4, 'cannot': 0, 'find': 1, 'solution': 3, 'problem': 2}
    

    【讨论】:

    • 感谢谢尔盖非常好的和清晰的解释!
    • 欢迎您!请注意,CountVectorizer 矩阵中的计数是按字母顺序排列的词汇(或词汇 dic 中的值给出的)。
    猜你喜欢
    • 2018-05-26
    • 1970-01-01
    • 2022-12-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-06-29
    • 1970-01-01
    相关资源
    最近更新 更多