【发布时间】:2021-03-12 10:43:26
【问题描述】:
我在对数据集进行过采样时遇到了一些问题。 我所做的如下:
# Separate input features and target
y_up = df.Label
X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)
# setting up testing and training sets
X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)
class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]
# upsample minority
class_1_upsampled = resample(class_1,
replace=True,
n_samples=len(class_0),
random_state=27) #
# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])
因为我的数据集看起来像:
Label Text
1 bla bla bla
0 once upon a time
1 some other sentences
1 a few sentences more
1 this is my dataset!
我应用了矢量化器将字符串转换为数字:
X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]
X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)
然后我应用了逻辑回归函数:
upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)
但是,我在这一步遇到以下错误:
X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
pred_up_log = upsampled_log.predict(X_test_up)
ValueError: X 每个样本有 3021 个特征;期待 5542
由于有人告诉我应该在将数据集拆分为训练 e 测试后应用过采样,因此我没有对测试集进行矢量化。 我的疑问如下:
- 以后考虑对测试集进行矢量化是否正确:
X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index) - 将数据集拆分为训练和测试后考虑过采样是否正确?
另外,我尝试使用 Smote 功能。下面的代码有效,但如果可能的话,我更愿意考虑过采样,而不是 SMOTE。
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)
nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))
我们将不胜感激任何 cmets 和建议。 谢谢
【问题讨论】:
-
你需要对整个数据集做vectorizer.fit_transform(),否则你的train中会有特征而不是你的test,反之亦然
-
你可以填写缺少的列,但它会超级混乱
-
感谢@StupidWolf。我如何将其应用于整个数据集?分train和test之前可以通过吗?
-
好的,我现在看到了这个问题。你需要上采样。我会先进行矢量化,然后在火车上进行上采样。您无需转换为密集数组。我看看我能不能写一个答案。
-
非常感谢。会很棒的。
标签: python scikit-learn vectorization logistic-regression text-classification