【发布时间】:2019-02-25 03:55:43
【问题描述】:
当我打印出我的 rdd 的第一个元素时:
print("input = {}".format(input.take(1)[0]))
我得到的结果是:(u'motor', [0.001,..., 0.9])
[0.001,..., 0.9] 的类型是一个列表。
输入rdd中元素个数等于53304100
当我想按以下方式广播输入 RDD 时,我的问题就来了:
brod = sc.broadcast(input.collect())
生成的异常如下(我只展示了异常的第一部分):
WARN TaskSetManager: Lost task 56.0 in stage 1.0 (TID 176, 172.16.140.144, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
process()
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
TypeError: <lambda>() missing 1 required positional argument: 'document'
【问题讨论】:
标签: apache-spark pyspark