按列（对象）分层拆分答案

【问题标题】：Stratify split by column (object)按列（对象）分层拆分
【发布时间】：2019-07-19 08:31:07
【问题描述】：

当尝试按列（分类）进行分层拆分时，它会返回错误。

Country     ColumnA    ColumnB   ColumnC   Label
AB            0.2        0.5       0.1       14  
CD            0.9        0.2       0.6       60
EF            0.4        0.3       0.8       5
FG            0.6        0.9       0.2       15

这是我的代码：

X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

所以我得到如下错误：

ValueError: could not convert string to float: 'AB'

【问题讨论】：

无法重现错误（使用“Country”作为“country_code”）
@ChristianSloper 好点，已修复。谢谢
@LucaMassaron 你能帮忙吗？谢谢

标签： python machine-learning split scikit-learn linear-regression

【解决方案1】：

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({
        'Country': ['AB', 'CD', 'EF', 'FG']*20,
        'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
    })

df['Country_Code'] = df['Country'].astype('category').cat.codes

X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

将country中的字符串值转换为数字并保存为新列
在创建x 训练数据时删除label (y) 以及字符串country 列

方法二

如果您要对其进行预测的测试数据稍后会出现，您将需要一种机制将其country 转换为code，然后再进行预测。在这种情况下，推荐的方法是使用LabelEncoder，您可以使用fit方法将字符串编码为标签，然后使用transform对测试数据的国家/地区进行编码。

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

df = pd.DataFrame({
        'Country': ['AB', 'CD', 'EF', 'FG']*20,
        'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
    })

# Train-Validation 
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)

# Test
test_df = pd.DataFrame({'Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] })
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))

【讨论】：

【解决方案2】：

在重现您的代码时，我发现错误来自试图将线性回归模型拟合到包含字符串的一组特征上。 This answer 为您提供了一些操作选项。我建议使用 X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country) 在您制作 train_test_split() 后对您的国家/地区进行一次热编码以保持您正在寻找的类平衡。

【讨论】：