【发布时间】:2019-11-19 13:44:01
【问题描述】:
我创建了一个 Pandas UDF,它将输入一个数据帧,预测并在 Primary_Key 和 Predictions 上输出一个数据帧。
schema = StructType([StructField('primary_id', IntegerType()),
StructField('prediction', FloatType())])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def apply_model(sample_df):
# run the model on the partitioned data set
ids = sample_df['primary_id']
x_train = sample_df.drop(['primary_id', 'partition_id'], axis = 1)
pred = model_broadcast.value.predict_proba(x_train)
return pd.DataFrame({'primary_id': ids, 'prediction': pred[:,1]})
sample_df - 是输入数据帧
当我测试它时,代码运行良好,如下所示:
a = apply_model.func(df)
输出 a.dtypes 给出
预测 float64 primary_id int64
运行以下代码时:
results = df.groupby('partition_id').apply(apply_model)
上述语句失败并出现错误:
TypeError: Invalid argument, not a string or column:
[26 rows x 32 columns] of type <class 'pandas.core.frame.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
【问题讨论】:
标签: pandas pyspark user-defined-functions sklearn-pandas