拆分数据集后的过采样 - 文本分类答案

【问题标题】：Oversampling after splitting the dataset - Text classification拆分数据集后的过采样 - 文本分类
【发布时间】：2021-03-12 10:43:26
【问题描述】：

我在对数据集进行过采样时遇到了一些问题。我所做的如下：

# Separate input features and target
y_up = df.Label

X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)

# setting up testing and training sets

X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)

class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]


# upsample minority
class_1_upsampled = resample(class_1,
                          replace=True, 
                          n_samples=len(class_0), 
                          random_state=27) #

# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])

因为我的数据集看起来像：

Label     Text 
1        bla bla bla
0        once upon a time 
1        some other sentences
1        a few sentences more
1        this is my dataset!

我应用了矢量化器将字符串转换为数字：

X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]

X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)

然后我应用了逻辑回归函数：

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)

但是，我在这一步遇到以下错误：

X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)

pred_up_log = upsampled_log.predict(X_test_up)

ValueError: X 每个样本有 3021 个特征；期待 5542

由于有人告诉我应该在将数据集拆分为训练 e 测试后应用过采样，因此我没有对测试集进行矢量化。我的疑问如下：

以后考虑对测试集进行矢量化是否正确：X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
将数据集拆分为训练和测试后考虑过采样是否正确？

另外，我尝试使用 Smote 功能。下面的代码有效，但如果可能的话，我更愿意考虑过采样，而不是 SMOTE。

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)

nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))

我们将不胜感激任何 cmets 和建议。谢谢

【问题讨论】：

你需要对整个数据集做vectorizer.fit_transform()，否则你的train中会有特征而不是你的test，反之亦然
你可以填写缺少的列，但它会超级混乱
感谢@StupidWolf。我如何将其应用于整个数据集？分train和test之前可以通过吗？
好的，我现在看到了这个问题。你需要上采样。我会先进行矢量化，然后在火车上进行上采样。您无需转换为密集数组。我看看我能不能写一个答案。
非常感谢。会很棒的。

标签： python scikit-learn vectorization logistic-regression text-classification

【解决方案1】：

最好对整个数据集进行countVectorizing和transformation，分成test和train，保持为稀疏矩阵，不转换回data.frame。

例如这是一个数据集：

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'Text':['This is bill','This is mac','here’s an old saying',
                           'at least old','data scientist years','data science is data wrangling', 
                           'This rings particularly','true for data science leaders',
                           'who watch their data','scientists spend days',
                           'painstakingly picking apart','ossified corporate datasets',
                           'arcane Excel spreadsheets','Does data science really',
                           'they just delegate the job','Data Is More Than Just Numbers',
                           'The reason that',
                           'data wrangling is so difficult','data is more than text and numbers'],
                   'Label':[0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0]})

我们进行向量化和变换，然后进行分割：

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_tfidf,df['Label'].values, 
                                                              test_size=0.2,random_state=42)

上采样可以通过重新采样少数类的索引来完成：

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up[up_idx,:], y_train_up[up_idx])

预测会奏效：

upsampled_log.predict(X_test_up)
array([0, 1, 0, 0])

如果您担心数据泄漏，那就是测试中的一些信息实际上通过使用 TfidfTransformer() 进入了训练。老实说，还没有看到具体的证据或演示，但下面是您单独应用 tfid 的替代方法：

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_counts,df['Label'].values, 
                                                              test_size=0.2,random_state=42)

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

tfidf_transformer = TfidfTransformer()
upsample_Xtrain = tfidf_transformer.fit_transform(X_train_up[up_idx,:])
upsamle_y = y_train_up[up_idx]

upsampled_log = LogisticRegression(solver='liblinear').fit(upsample_Xtrain,upsamle_y)

X_test_up = tfidf_transformer.transform(X_test_up)
upsampled_log.predict(X_test_up)

【讨论】：

这种方式可能会导致一些数据泄露。参见例如stats.stackexchange.com/q/154660/232706
如果你做 idf 部分，是的，其中一些可能。好的，我可以编辑答案。请不要含糊不清，明确指出您的链接部分
@StupidWolf 嗨，我找到了这个问题并阅读了您的答案。我对过度拟合也有类似的问题。我不知道你是否对此感兴趣，但如果你想看看，请看这里的链接：stackoverflow.com/questions/65191701/…。非常感谢