在 pyspark 上使用 map reduce答案

【问题标题】：Using map reduce on pyspark在 pyspark 上使用 map reduce
【发布时间】：2020-10-15 17:58:03
【问题描述】：

假设我在 pyspark 中有这个：

data = [{"age":1,"count":10},{"age":2,"count":20},{"age":3,"count":30}]

rdd = sc.parallelize( data )

如果“年龄”大于 2，我想让“计数”+ 10。像这样：

data = [{"age":1,"count":10},{"age":2,"count":20},{"age":3,"count":40}]

如何使用 map reduce 实现这一点？

【问题讨论】：

【解决方案1】：

可能有更好的解决方案。这个对我有用。

def add_count(x):
    x['count']+=10
    return x
    
new_data = list(map(lambda x: x if x['age']<=2 else add_count(x), data))

【讨论】：

【解决方案2】：

您可以转换为Dataframe，这更容易

df = rdd.toDF()

df.withColumn("count", when(df['age'] > 2, df['count'] + 10).otherwise(df['count'])).show(truncate=False)

输出：

+---+-----+
|age|count|
+---+-----+
|1  |10   |
|2  |30   |
|3  |40   |
+---+-----+

【讨论】：