Gridsearch for NLP - 如何结合 CountVec 和其他功能？答案

【问题标题】：Gridsearch for NLP - How to combine CountVec and other features?Gridsearch for NLP - 如何结合 CountVec 和其他功能？
【发布时间】：2021-01-25 17:37:34
【问题描述】：

我正在做一个关于情感分析的基本 NLP 项目，我想使用 GridsearchCV 来优化我的模型。

下面的代码显示了我正在使用的示例数据框。 'Content' 是要传递给 CountVectorizer 的列，'label' 是要预测的 y 列，而 feature_1、feature_2 也是我希望包含在我的模型中的列。

'content': 'Got flat way today Pot hole Another thing tick crap thing happen week list',
'feature_1': '1', 
'feature_2': '34', 
'label':1}, 
{'content': 'UP today Why doe head hurt badly',
'feature_1': '5', 
'feature_2': '142', 
'label':1},
{'content': 'spray tan fail leg foot Ive scrubbing foot look better ',
 'feature_1': '7', 
'feature_2': '123', 
'label':0},])

我正在关注 stackoverflow 的回答：Perform feature selection using pipeline and gridsearch

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, feature_1=True, feature_2=True):
        self.feature_1=feature_1
        self.feature_2=feature_2
        
    def extractor(self, tweet):
        features = []

        if self.feature_2:
            
            features.append(df['feature_2'])

        if self.feature_1:
            features.append(df['feature_1'])
        
          
        return np.array(features)

    def fit(self, raw_docs, y):
        return self

    def transform(self, raw_docs):
        
        return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))

以下是我尝试将我的数据框适合的网格搜索：

lr = LogisticRegression()

# Pipeline
pipe = Pipeline([('features', FeatureUnion([("vectorizer", CountVectorizer(df['content'])),
                                            ("extractor", CustomFeatureExtractor())]))
                 ,('classifier', lr())
                ])
But yields results: TypeError: 'LogisticRegression' object is not callable

想知道是否还有其他更简单的方法可以做到这一点？

我已经参考了下面的线程，但是无济于事： How to combine TFIDF features with other features Perform feature selection using pipeline and gridsearch

【问题讨论】：

标签： python nlp pipeline modeling

【解决方案1】：

lr() 不行，LogisticRegression 确实不可调用，它有一些lr 对象的方法。

试试看（lr 不带括号）：

lr = LogisticRegression()
pipe = Pipeline([('features', FeatureUnion([("vectorizer", CountVectorizer(df['content'])),
                                            ("extractor", CustomFeatureExtractor())]))
                 ,('classifier', lr)
                ])

您的错误消息应该会消失。

【讨论】：