PySpark 排序值答案

【问题标题】：PySpark sort valuesPySpark 排序值
【发布时间】：2021-02-14 16:06:53
【问题描述】：

我有一个数据：

[(u'ab', u'cd'),
 (u'ef', u'gh'),
 (u'cd', u'ab'),
 (u'ab', u'gh'),
 (u'ab', u'cd')]

我想对这些数据进行 mapreduce 并找出相同对出现的频率。

结果我得到：

[((u'ab', u'cd'), 2),
 ((u'cd', u'ab'), 1),
 ((u'ab', u'gh'), 1),
 ((u'ef', u'gh'), 1)]

如您所见，它并不正确，因为 (u'ab', u'cd') 必须是 3 而不是 2，因为 (u'cd', u'ab') 是同一对。

我的问题是如何使程序将 (u'cd', u'ab') 和 (u'ab', u'cd') 计为同一对？我正在考虑对每一行的值进行排序，但找不到任何解决方案。

【问题讨论】：

标签： apache-spark sorting pyspark mapreduce rdd

【解决方案1】：

您可以对值进行排序，然后使用reduceByKey 来计算对：

rdd1 = rdd.map(lambda x: (tuple(sorted(x)), 1))\
    .reduceByKey(lambda a, b: a + b)

rdd1.collect()
# [(('ab', 'gh'), 1), (('ef', 'gh'), 1), (('ab', 'cd'), 3)]

【讨论】：

【解决方案2】：

您可以按排序的元素键，并按键计数：

result = rdd.keyBy(lambda x: tuple(sorted(x))).countByKey()

print(result)
# defaultdict(<class 'int'>, {('ab', 'cd'): 3, ('ef', 'gh'): 1, ('ab', 'gh'): 1})

要将结果转换为列表，您可以：

result2 = sorted(result.items())

print(result2)
# [(('ab', 'cd'), 3), (('ab', 'gh'), 1), (('ef', 'gh'), 1)]

【讨论】：