CountVectorizer max_features 如何处理相同频率的 ngram？答案

【问题标题】：How does CountVectorizer max_features process ngrams with the same frequencies?CountVectorizer max_features 如何处理相同频率的 ngram？
【发布时间】：2018-09-04 13:47:09
【问题描述】：

我对 CountVectorizer 和 TfidfVectorizer 有疑问。

我不清楚如何在 max_features 中选择具有相同频率的 ngram。如果我们在边界上具有相同频率的语料库中说 max_features = 10000 和 100 ngram，那么 CountVectorizer 如何区分哪些 ngram 将在特征中，哪些不会？玩具示例，我们有一个包含八个唯一单词的语料库。单词“jeans”和“cat”具有相同的频率 1。我们取 max_features=7。为什么“猫”出现在特征中而“牛仔裤”没有出现，反之则不然？

data = ['gpu processor cpu performance',
        'gpu performance ram computer computer',
        'cpu computer ram processor jeans processor cat']

cv = CountVectorizer(ngram_range=(1, 1), max_features=7)
cv_fit = cv.fit_transform(data).toarray()
cv.vocabulary_

out:
{'cat': 0,
 'computer': 1,
 'cpu': 2,
 'gpu': 3,
 'performance': 4,
 'processor': 5,
 'ram': 6}

【问题讨论】：

标签： python machine-learning scikit-learn nlp

【解决方案1】：

CountVectorizer 会切断词频，并且可能会使用正常排序来切断 max_features 处的项目。

max_features : int or None, default=None 如果不是 None，构建一个只考虑按术语排序的最高 max_features 的词汇整个语料库的频率。

我将数据从cat 更改为zat，现在jeans 进入列表。

>>> data = ['gpu processor cpu performance',
'gpu performance ram computer computer',
'cpu computer ram processor zat processor jeans']
>>> cv = CountVectorizer(ngram_range=(1, 1), max_features=7)
>>> cv_fit = cv.fit_transform(data).toarray()
>>> cv.vocabulary_
{u'ram': 6, u'jeans': 3, u'processor': 5, u'computer': 0, u'performance': 4, u'gpu': 2, u'cpu': 1}

本质上它依赖于排序顺序。

【讨论】：

【解决方案2】：

这是一个 link to the relevant source code，它出现在 _limit_features 辅助方法中：

    # Calculate a mask based on document frequencies
    dfs = _document_frequency(X)
    tfs = np.asarray(X.sum(axis=0)).ravel()
    mask = np.ones(len(dfs), dtype=bool)
    if high is not None:
        mask &= dfs <= high
    if low is not None:
        mask &= dfs >= low
    if limit is not None and mask.sum() > limit:
        mask_inds = (-tfs[mask]).argsort()[:limit]
        new_mask = np.zeros(len(dfs), dtype=bool)
        new_mask[np.where(mask)[0][mask_inds]] = True
        mask = new_mask

    new_indices = np.cumsum(mask) - 1  # maps old indices to new
    removed_terms = set()
    for term, old_index in list(six.iteritems(vocabulary)):
        if mask[old_index]:
            vocabulary[term] = new_indices[old_index]
        else:
            del vocabulary[term]
            removed_terms.add(term)
    kept_indices = np.where(mask)[0]

注意，limit 是这个辅助方法的一个参数，它被传递给self.max_features 的值。因此，如您所见，计算了一组词频：

tfs = np.asarray(X.sum(axis=0)).ravel()

并且代码本质上是基于文档频率值（由max_df 和min_df 值控制）构建一个布尔掩码。然后，要将掩码限制为仅高于limit 的值，它会：

mask_inds = (-tfs[mask]).argsort()[:limit]

这实际上返回了使用[:limit] 切片被切片为limit 长度的术语频率数组的排序索引。由于.argsort 默认使用快速排序算法，排序并不稳定，因此，我相信您无法保证在频率相等的情况下保留哪个项。这是快速排序碰巧放在那里的任何东西。如果使用稳定的排序算法（在这种情况下，唯一的算法是归并排序），那么由于the vocabulary is first sorted before the _limit_features helper function is called：

    if not self.fixed_vocabulary_:
        X = self._sort_features(X, vocabulary)

        n_doc = X.shape[0]
        max_doc_count = (max_df
                         if isinstance(max_df, numbers.Integral)
                         else max_df * n_doc)
        min_doc_count = (min_df
                         if isinstance(min_df, numbers.Integral)
                         else min_df * n_doc)
        if max_doc_count < min_doc_count:
            raise ValueError(
                "max_df corresponds to < documents than min_df")
        X, self.stop_words_ = self._limit_features(X, vocabulary,
                                                   max_doc_count,
                                                   min_doc_count,
                                                   max_features)

因此词汇表将按字典顺序排列。因此，如果假设 argsort 使用的是稳定的算法，我相信我们可以说会保留字典顺序最高的词，但是，由于它不稳定，我们不能做出这样的保证。

【讨论】：