【发布时间】:2022-01-15 23:26:06
【问题描述】:
我是学习 py spark 的新手,所以请原谅这个非常基本但不详细的问题。我正在尝试 sc 读取 .tsv 文件,然后解析该文件。但是,当我尝试对其执行 .take() 读取文件后,它给了我以下错误,我无法理解。我在 Windows 上运行它。下面是代码:
print("TEST 1")
rdd = sc.textFile(tsv_path)
print("TEST 2", rdd.take(1))
rdd = rdd.map(lambda line: (line.split('\t')[0], line.split('\t')[1], line.split('\n')[2]))
print("TEST 3")
rdd = rdd.collect()
print("TEST 4")
print("Test:", rdd.take(1))
print(type(rdd))
这是我得到的错误:
TEST 1
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-5-231258a7df7b> in <module>
----> 1 outdegree("graph.tsv", "q1_out/")
<ipython-input-4-c31bb2bfe1ae> in outdegree(tsv_path, out_dir)
5 print("TEST 1")
6 rdd = sc.textFile(tsv_path)
----> 7 print("TEST 2", rdd.take(1))
8 rdd = rdd.map(lambda line: (line.split('\t')[0], line.split('\t')[1], line.split('\n')[2]))
9 print("TEST 3")
~\anaconda3\lib\site-packages\pyspark\rdd.py in take(self, num)
1566
1567 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1568 res = self.context.runJob(self, takeUpToNumLeft, p)
1569
1570 items += res
~\anaconda3\lib\site-packages\pyspark\context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
1225 # SparkContext#runJob.
1226 mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1227 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
1228 return list(_load_from_socket(sock_info, mappedRDD._jrdd_deserializer))
1229
~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
1307
1308 answer = self.gateway_client.send_command(command)
-> 1309 return_value = get_return_value(
1310 answer, self.gateway_client, self.target_id, self.name)
1311
~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOP-BI16OVUR executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\Users\user\anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 481, in main
RuntimeError: Python in worker has different version 3.9 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
任何帮助将不胜感激!谢谢!
【问题讨论】:
标签: python windows apache-spark pyspark