【问题标题】:Getting an error for PySpark when I try to run .take()当我尝试运行 .take() 时出现 PySpark 错误
【发布时间】:2022-01-15 23:26:06
【问题描述】:

我是学习 py spark 的新手,所以请原谅这个非常基本但不详细的问题。我正在尝试 sc 读取 .tsv 文件,然后解析该文件。但是,当我尝试对其执行 .take() 读取文件后,它给了我以下错误,我无法理解。我在 Windows 上运行它。下面是代码:

print("TEST 1")
rdd = sc.textFile(tsv_path)
print("TEST 2", rdd.take(1))
rdd = rdd.map(lambda line: (line.split('\t')[0], line.split('\t')[1], line.split('\n')[2]))
print("TEST 3")

rdd = rdd.collect()
print("TEST 4")

print("Test:", rdd.take(1))
print(type(rdd))

这是我得到的错误:

TEST 1
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-5-231258a7df7b> in <module>
----> 1 outdegree("graph.tsv", "q1_out/")

<ipython-input-4-c31bb2bfe1ae> in outdegree(tsv_path, out_dir)
      5     print("TEST 1")
      6     rdd = sc.textFile(tsv_path)
----> 7     print("TEST 2", rdd.take(1))
      8     rdd = rdd.map(lambda line: (line.split('\t')[0], line.split('\t')[1], line.split('\n')[2]))
      9     print("TEST 3")

~\anaconda3\lib\site-packages\pyspark\rdd.py in take(self, num)
   1566 
   1567             p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1568             res = self.context.runJob(self, takeUpToNumLeft, p)
   1569 
   1570             items += res

~\anaconda3\lib\site-packages\pyspark\context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
   1225         # SparkContext#runJob.
   1226         mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1227         sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
   1228         return list(_load_from_socket(sock_info, mappedRDD._jrdd_deserializer))
   1229 

~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1307 
   1308         answer = self.gateway_client.send_command(command)
-> 1309         return_value = get_return_value(
   1310             answer, self.gateway_client, self.target_id, self.name)
   1311 

~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOP-BI16OVUR executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Users\user\anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 481, in main
RuntimeError: Python in worker has different version 3.9 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

任何帮助将不胜感激!谢谢!

【问题讨论】:

标签: python windows apache-spark pyspark


【解决方案1】:

似乎您在驱动程序和工作程序上有不同的 python 版本。 驱动程序和工作人员需要具有相同的 python 版本。 在您的驱动机器中,您需要使用 Python 3.9 版本而不是 3.8

在您的驱动程序/工作人员上找到 Python 版本 3.9 并导出以下变量

export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9

【讨论】:

  • 所以我能够在 nootbook 中使用: %set_env PYSPARK_PYTHON=/usr/bin/python3.9 %set_env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9 但是,我仍然有那个第一个错误:调用 z:org.apache.spark.api.python.PythonRDD.runJob 时出错。 :org.apache.spark.SparkException:作业因阶段失败而中止:阶段 0.0 中的任务 0 失败 1 次,最近一次失败:阶段 0.0 中丢失任务 0.0 (TID 0)。
猜你喜欢
  • 2020-02-11
  • 1970-01-01
  • 2022-11-25
  • 2017-08-30
  • 2018-11-23
  • 2019-09-23
  • 2018-09-03
  • 2023-02-04
  • 2012-05-26
相关资源
最近更新 更多