pyspark中的投票分类器UDF答案

【问题标题】：Voting classifier UDF in pysparkpyspark中的投票分类器UDF
【发布时间】：2021-12-26 21:56:33
【问题描述】：

我正在尝试在 pyspark 中实现投票分类器。

我使用了函数predict_from_multiple_estimator。传递给函数的参数是 estimators1，它们是在 pyspark 中训练和拟合的管道模型，X 是测试数据框，可能的类标签和权重值。

然后我尝试将此功能转换为 pyspark UDF。并调用带有测试数据框qa特征属性的函数来预测类标签。

estimators1 = [S1, S2]

#were S1 and S2 are spark pipeline models pipeline(featurizer,pca,logistic regression and naive bayesian)

w = [1,1]

label_encoder = [0, 1, 2]

def predictestimator(X, label_encoder, estimators=estimators1, weights=w):

# Predict 'soft' voting with probabilities

p1 = np.asarray([clf.predict_proba(X) for clf, X in zip(estimators, X_list)])
p2 = np.average(p1, axis=0, weights=weights)
p = np.argmax(p2, axis=1)

# Convert integer predictions to original labels:
return label_encoder.inverse_transform(p)

from pyspark.sql.functions import udf

udf1 = udf(predictestimator)

qa = featurizer.transform(test)# qa is a dataframe in pyspark which consists of features of images 
                                                                                                 qa is DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>,features: vector]

qa.withColumn("predictedlabel", udf1("features")).show() # when this statement is run it produces the error

我得到的错误：

PicklingError: Could not serialize object: TypeError: can't pickle dict_keys objects

【问题讨论】：

我认为您的编辑之一可能已经解决了问题？（如果有错误消息，您也可以将其添加到帖子中吗？）
帖子中添加了错误# PicklingError: Could not serialize object: TypeError: can't pickle dict_keys objects
嘿 Anu，看起来问题在于 dict_keys 是如何序列化的。请在下面的答案中检查解决方法。

标签： python apache-spark pyspark user-defined-functions voting

【解决方案1】：

我发现了为什么这不起作用。 Python 3 改变了使用 dict_keys 的方式。 Check out this very good explanation.

【讨论】：