【发布时间】:2020-08-23 09:01:50
【问题描述】:
我有一个如下所示的 RDD:
[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]
我想分别计算第二个条目(城市/地铁)中每个不同值的每行最后一个值(1000、2000 等)的平均值和最大值。我正在使用以下代码来收集“城市”值:
rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()
但是,我得到了错误,可能是因为系列中的字符串值(例如“Unkown”)。
如何过滤掉包含字符串和空值的行(=只保留可转换为数字的行),然后计算最大值和平均值?
【问题讨论】:
标签: apache-spark filter types pyspark rdd