所有预处理都会降低准确性答案

【问题标题】：All preprocessing worsens accuracy所有预处理都会降低准确性
【发布时间】：2017-12-08 07:22:04
【问题描述】：

我正在使用逻辑回归模型执行网格搜索交叉验证。我首先有我的默认模型，然后是应该预处理数据的模型。数据是属于 4 个类别之一的随机文本文档。即使我只是让它返回数据，我的预处理器似乎也会降低我的准确性和 f1 分数，如下所示。网格搜索在通过这个不应该做任何事情的预处理后选择的正则化参数 C。

Cs = {'C' : [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
gs_clf_LR = GridSearchCV(LogisticRegression(penalty='l2'), Cs, refit=True)
gs_clf_LR.fit(transformed_train_data, train_labels)
preds = gs_clf_LR.predict(transformed_dev_data)
#print gs_clf_LR.score(transformed_dev_data, dev_labels)
print gs_clf_LR.best_params_
print 'With optimal C, accuracy score is: ', gs_clf_LR.best_score_
print 'f1 score: ', metrics.f1_score(dev_labels, preds, average='weighted')
print metrics.classification_report(dev_labels, preds)
print

def better_preprocessor(string):
    #return re.sub(r'^[A-Z]', '^[a-z]', string)
    #return re.sub(r'(ing)$', '', string)
    #return re.sub(r'(es)$', '', string)
    #return re.sub(r's$', '', string)
    #return re.sub(r'(ed)$', '', string)
    return string


vec = CountVectorizer(preprocessor=better_preprocessor)
transformed_preprocessed_train_data = vec.fit_transform(train_data)
transformed_preprocessed_dev_data = vec.transform(dev_data)

gs_clf_LR.fit(transformed_preprocessed_train_data, train_labels)
preds_pp = gs_clf_LR.predict(transformed_preprocessed_dev_data)
#print gs_clf_LR.score(transformed_preprocessed_dev_data, dev_labels)
print gs_clf_LR.best_params_
print 'With optimal C, accuracy score is: ', gs_clf_LR.best_score_
print 'f1 score: ', metrics.f1_score(dev_labels, preds_pp, average='weighted')
print metrics.classification_report(dev_labels, preds_pp)

通过一些真正的预处理，例如我已经注释掉的正则表达式行，我还发现我的准确性和 f1 分数有所下降（这似乎是合理的，但我正在摆脱复数形式，并且被告知这应该会提高我的分数）。

【问题讨论】：

标签： python scikit-learn countvectorizer

【解决方案1】：

问题是您的预处理基本上什么都不做，因为预处理是在标记化之前在 CountVectorizer 中发生的。这意味着您可以通过函数获取整个文本，并且不会触发带有 $ 的正则表达式。

这是用您的better_preprocessing 拟合矢量化器的结果：

In [16]: data = ['How are you guys doing? Fine! We are very satisfied']

In [17]: vec.fit(data)
Out[17]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function better_preprocessor at 0x000002DB839FF048>,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

In [18]: vec.get_feature_names()
Out[18]: ['Fine', 'We', 'are', 'doing', 'guys', 'ow', 'satisfied', 'very', 'you']

这意味着您必须用您的函数覆盖analyzer 步骤，而不是preprocessor。比较：

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable 特征应该由单词或字符 n-gram 组成。选项‘char_wb’ 仅从单词边界内的文本创建字符 n-gram。如果一个 callable 被传递，它用于提取特征序列原始的、未处理的输入。

预处理器：可调用或无（默认）覆盖预处理（字符串转换）阶段同时保留标记化和 n-gram 生成步骤。

然而，你必须在你的函数中处理标记化，但你可以使用默认的'(?u)\\b\\w\\w+\\b'，所以这并不难。无论如何，我认为您的方法并不可靠，我建议您使用来自 NLTK 的 SnowballStemmer 之类的东西，而不是这些正则表达式。

【讨论】：

【解决方案2】：

您是否从数据中分离出随机生成的测试集（存在于交叉验证之外）来测试这两个模型？准确率下降可能是由于通过减少对数据的过度拟合而实现了更大的泛化。

【讨论】：

这将是transformed_dev_data，然后是transformed_preprocessed_dev_data变量。我应该只预处理训练数据而不是测试数据吗？