当我尝试运行 .take() 时出现 PySpark 错误答案

【问题标题】：Getting an error for PySpark when I try to run .take()当我尝试运行 .take() 时出现 PySpark 错误
【发布时间】：2022-01-15 23:26:06
【问题描述】：

我是学习 py spark 的新手，所以请原谅这个非常基本但不详细的问题。我正在尝试 sc 读取 .tsv 文件，然后解析该文件。但是，当我尝试对其执行 .take() 读取文件后，它给了我以下错误，我无法理解。我在 Windows 上运行它。下面是代码：

print("TEST 1")
rdd = sc.textFile(tsv_path)
print("TEST 2", rdd.take(1))
rdd = rdd.map(lambda line: (line.split('\t')[0], line.split('\t')[1], line.split('\n')[2]))
print("TEST 3")

rdd = rdd.collect()
print("TEST 4")

print("Test:", rdd.take(1))
print(type(rdd))

这是我得到的错误：

TEST 1
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-5-231258a7df7b> in <module>
----> 1 outdegree("graph.tsv", "q1_out/")

<ipython-input-4-c31bb2bfe1ae> in outdegree(tsv_path, out_dir)
      5     print("TEST 1")
      6     rdd = sc.textFile(tsv_path)
----> 7     print("TEST 2", rdd.take(1))
      8     rdd = rdd.map(lambda line: (line.split('\t')[0], line.split('\t')[1], line.split('\n')[2]))
      9     print("TEST 3")

~\anaconda3\lib\site-packages\pyspark\rdd.py in take(self, num)
   1566 
   1567             p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1568             res = self.context.runJob(self, takeUpToNumLeft, p)
   1569 
   1570             items += res

~\anaconda3\lib\site-packages\pyspark\context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
   1225         # SparkContext#runJob.
   1226         mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1227         sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
   1228         return list(_load_from_socket(sock_info, mappedRDD._jrdd_deserializer))
   1229 

~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1307 
   1308         answer = self.gateway_client.send_command(command)
-> 1309         return_value = get_return_value(
   1310             answer, self.gateway_client, self.target_id, self.name)
   1311 

~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOP-BI16OVUR executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Users\user\anaconda3\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 481, in main
RuntimeError: Python in worker has different version 3.9 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

任何帮助将不胜感激！谢谢！

【问题讨论】：

这可能会帮助how-do-i-set-the-drivers-python-version-in-spark

标签： python windows apache-spark pyspark

【解决方案1】：

似乎您在驱动程序和工作程序上有不同的 python 版本。驱动程序和工作人员需要具有相同的 python 版本。在您的驱动机器中，您需要使用 Python 3.9 版本而不是 3.8

在您的驱动程序/工作人员上找到 Python 版本 3.9 并导出以下变量

export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9

【讨论】：

所以我能够在 nootbook 中使用： %set_env PYSPARK_PYTHON=/usr/bin/python3.9 %set_env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9 但是，我仍然有那个第一个错误：调用 z:org.apache.spark.api.python.PythonRDD.runJob 时出错。：org.apache.spark.SparkException：作业因阶段失败而中止：阶段 0.0 中的任务 0 失败 1 次，最近一次失败：阶段 0.0 中丢失任务 0.0 (TID 0)。