【发布时间】:2017-04-21 10:19:32
【问题描述】:
考虑以下 sn-p(在 Python 2.7 上运行 Spark 2.1):
nums = range(0, 10)
with SparkContext("local[2]") as sc:
rdd = sc.parallelize(nums)
print("Number of partitions: {}".format(rdd.getNumPartitions()))
print("Partitions structure: {}".format(rdd.glom().collect()))
rdd2 = rdd.repartition(5)
print("Number of partitions: {}".format(rdd2.getNumPartitions()))
print("Partitions structure: {}".format(rdd2.glom().collect()))
输出是:
Number of partitions: 2
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
Number of partitions: 5
Partitions structure: [[], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [], [], []]
为什么重新分区后数据没有分布在所有分区上?
【问题讨论】:
标签: python apache-spark pyspark