有没有办法将列表用作 DataFrame 中的值？答案

【问题标题】：Is there a way to use Lists as values in a DataFrame?有没有办法将列表用作 DataFrame 中的值？
【发布时间】：2019-05-13 13:07:31
【问题描述】：

我正在处理著名的 Kaggle 挑战“房价”。我想用 sklearn.linear_model LinearRegression 训练我的数据集

阅读以下文章后： https://developers.google.com/machine-learning/crash-course/representation/feature-engineering

我编写了一个函数，将我的训练 DataFrame 中的所有字符串值转换为列表。例如，原始特征值可能看起来像这样 [Ex, Gd, Ta, Po]，转换后它看起来像这样：[1,0,0,0] [0,1,0,0] [0, 0,1,0] [0,0,0,1]。

当我尝试训练我的数据时，我收到以下错误：

Traceback（最近一次调用最后一次）：文件 “C:/Users/Owner/PycharmProjects/HousePrices/main.py”，第 27 行，在 linereg.fit(train_df, target) 文件“C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\linear_model\base.py”，第 458 行，合适 y_numeric=True, multi_output=True) 文件 "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\utils\validation.py", 第 756 行，在 check_X_y 中 estimator=estimator) 文件 "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\utils\validation.py", 第 567 行，在 check_array 中 array = array.astype(np.float64) ValueError: setting an array element with a sequence.

这仅在我按照我的解释转换某些列时发生。

有没有办法以向量作为值来训练线性回归模型？

这是我的转换函数：

def feature_to_boolean_vector(df, feature_name, new_name):
    vectors_list = [] #each tuple will represent an option
    feature_options = df[feature_name].unique()
    feature_options_length = len(feature_options)

    # creating a list the size of feature_options_length, all 0's
    list_to_be_vector = [0 for i in range(feature_options_length)]

    for i in range(feature_options_length):
        list_to_be_vector[i] = 1 # inserting 1 representing option number i
        vectors_list.append(list_to_be_vector.copy())
        list_to_be_vector[i] = 0

    mapping = dict(zip(feature_options, vectors_list)) # dict from values to vectors
    df[new_name] = df[feature_name].map(mapping)
    df.drop([feature_name], axis=1, inplace=True)

这是我的火车尝试（预处理后）：

linereg = LinearRegression()
linereg.fit(train_df, target)

提前谢谢你。

【问题讨论】：

get_dummies 通常用于此目的。另一种选择是将您的数据作为categories。

标签： python linear-regression sklearn-pandas

【解决方案1】：

LinearRegression 不支持列表作为功能。我看到您使用的是 one-hot，并且您可以将每个维度用作特征列。相比之下，您可以在 pandas 中使用更简单的方法pd.get_dummies。

print(df['feature'])
0    Ex
1    Gd
2    Ta
3    Po
Name: feature, dtype: object

df = pd.get_dummies(df['feature'])
print(df)
   Ex  Gd  Po  Ta
0   1   0   0   0
1   0   1   0   0
2   0   0   0   1
3   0   0   1   0

【讨论】：