【问题标题】:Group features of TF-IDF vector in scikit-learnscikit-learn 中 TF-IDF 向量的组特征
【发布时间】:2020-01-18 14:45:21
【问题描述】:

我正在使用 scikit-learn 通过以下代码来训练基于 TF-IDF 特征向量的文本分类模型:

model = naive_bayes.MultinomialNB()
feature_vector_train = TfidfVectorizer().fit_transform(X)
model.fit(self.feature_vector_train, Y)

我需要按照 TF-IDF 权重的降序对提取的特征进行排序,并将它们分成两组不重叠的特征,最后训练两个不同的分类模型。如何将主要特征向量分组为奇数集和偶数集?

【问题讨论】:

  • 您是否尝试通过特征的 TF-IDF 权重(即在将它们发送到模型之前)、模型赋予每个特征的权重或组合权重(TF-IDF 权重 *型号重量)?
  • @acattle 我需要在将它们发送到模型之前对其进行分组。
  • 所以你想根据它们的 TF-IDF 权重对特征进行排名,然后将它们分成两个独立的特征矩阵,用于两个独立的分类器?
  • @acattle 是的,没错。

标签: python scikit-learn text-classification tfidfvectorizer


【解决方案1】:

TfidfVectorizer 的结果是 n x m 矩阵 n 是文档数,m 是唯一词数。因此,feature_vector_train 中的每一列对应于数据集中的一个特定单词。从this tutorial 调整解决方案应该允许您提取最高和最低权重的词:

vectorizer = TfidfVectorizer()
feature_vector_train = vectorizer.fit_transform(X)
feature_names = vectorizer.get_feature_names()

total_tfidf_weights = feature_vector_train.sum(axis=0) #this assumes you only want a straight sum of each feature's weight across all documents
#alternatively, you could use vectorizer.transform(feature_names) to get the values of each feature in isolation

#sort the feature names and the tfidf weights together by zipping them
sorted_names_weights = sorted(zip(feature_names, total_tfidf_Weights), key = lambda x: x[1]), reversed=True) #the key argument tells sorted according to column 1. reversed means sort from largest to smallest
#unzip the names and weights
sorted_features_names, sorted_total_tfidf_weights = zip(*sorted_names_weights)

从这一点开始,您应该能够根据需要分离功能。将它们分成两组后,group1group2,您可以将它们分成两个矩阵,如下所示:

#create a feature_name to column index mapping
column_mapping = dict((name, i) for i, name, in enumerate(feature_names))

#get the submatrices
group1_column_indexes = [column_mapping[feat] for feat in group1]
group1_feature_vector_train  = feature_vector_train[:,group1_column_indexes] #all rows, but only group1 columns

group2_column_indexes = [column_mapping[feat] for feat in group2]
group2_feature_vector_train  = feature_vector_train[:,group2_column_indexes]

【讨论】:

    猜你喜欢
    • 2016-11-07
    • 2020-09-27
    • 2018-06-01
    • 2015-02-11
    • 2015-12-12
    • 2019-11-14
    • 2014-12-05
    • 2014-07-31
    • 2017-08-24
    相关资源
    最近更新 更多