【问题标题】:How to ignore word positions in grams when using CountVectorizer?使用 CountVectorizer 时如何忽略以克为单位的单词位置?
【发布时间】:2020-06-14 19:32:18
【问题描述】:

我有一个语料库,我想获得所有 2-gram 的频率。这是我正在使用的代码:

vec = CountVectorizer(ngram_range=(2,2).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

“words_freq”变量包含在语料库中找到的每个 gram 的频率,例如:

print(words_freq)
[('green apple', 10), ('yellow apple',2), ('apple green',5)]

但是,我想知道如何在不考虑 gram 中单词顺序的情况下获得每个 gram 的频率

例如,“green apple”和“apple green”应该被认为是相同的克并给出结果('green apple',15)。

感谢您的帮助。

【问题讨论】:

    标签: python nlp countvectorizer


    【解决方案1】:

    你可以使用下面的代码sn-p。请注意,它仅适用于二元组。

    words_freq = [('green apple', 10), ('yellow apple',2), ('apple green',5)]
    alternate_words_freq = {}
    for term, freq in words_freq:
        # Assume that bigrams are separated by a space
        # Reverse the bigram
        reverse_term = " ".join(term.split(" ")[::-1])
    
        if term in alternate_words_freq.keys():
            alternate_words_freq[term]+=freq
        elif reverse_term in alternate_words_freq.keys():
            alternate_words_freq[reverse_term]+=freq
        else:
            alternate_words_freq[term]=freq
    # Prints [('green apple', 15), ('yellow apple', 2)]
    
    print(list(alternate_words_freq.items()))
    

    【讨论】:

      猜你喜欢
      • 2023-03-31
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-05-26
      • 1970-01-01
      相关资源
      最近更新 更多