【发布时间】:2021-12-26 21:56:33
【问题描述】:
我正在尝试在 pyspark 中实现投票分类器。
我使用了函数predict_from_multiple_estimator。传递给函数的参数是 estimators1,它们是在 pyspark 中训练和拟合的管道模型,X 是测试数据框,可能的类标签和权重值。
然后我尝试将此功能转换为 pyspark UDF。并调用带有测试数据框qa特征属性的函数来预测类标签。
estimators1 = [S1, S2]
#were S1 and S2 are spark pipeline models pipeline(featurizer,pca,logistic regression and naive bayesian)
w = [1,1]
label_encoder = [0, 1, 2]
def predictestimator(X, label_encoder, estimators=estimators1, weights=w):
# Predict 'soft' voting with probabilities
p1 = np.asarray([clf.predict_proba(X) for clf, X in zip(estimators, X_list)])
p2 = np.average(p1, axis=0, weights=weights)
p = np.argmax(p2, axis=1)
# Convert integer predictions to original labels:
return label_encoder.inverse_transform(p)
from pyspark.sql.functions import udf
udf1 = udf(predictestimator)
qa = featurizer.transform(test)# qa is a dataframe in pyspark which consists of features of images
qa is DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>,features: vector]
qa.withColumn("predictedlabel", udf1("features")).show() # when this statement is run it produces the error
我得到的错误:
PicklingError: Could not serialize object: TypeError: can't pickle dict_keys objects
【问题讨论】:
-
我认为您的编辑之一可能已经解决了问题? (如果有错误消息,您也可以将其添加到帖子中吗?)
-
帖子中添加了错误# PicklingError: Could not serialize object: TypeError: can't pickle dict_keys objects
-
嘿 Anu,看起来问题在于 dict_keys 是如何序列化的。请在下面的答案中检查解决方法。
标签: python apache-spark pyspark user-defined-functions voting