【问题标题】:Why does PySpark fail with random "Socket is closed" error?为什么 PySpark 会因随机“套接字已关闭”错误而失败?
【发布时间】:2017-12-18 13:12:10
【问题描述】:

我刚刚参加了 PySpark 培训课程,并且正在编译示例代码行的脚本(这解释了为什么代码块什么都不做)。每次我运行这段代码时,我都会收到一次或两次此错误。抛出它的线在运行之间改变。我尝试设置spark.executor.memoryspark.executor.heartbeatInterval,但错误仍然存​​在。我还尝试将.cache() 放在各行的末尾,没有任何变化。

错误:

16/09/21 10:29:32 ERROR Utils: Uncaught exception in thread stdout writer for python
java.net.SocketException: Socket is closed
        at java.net.Socket.shutdownOutput(Socket.java:1551)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply$mcV$sp(PythonRDD.scala:344)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply(PythonRDD.scala:344)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply(PythonRDD.scala:344)
        at org.apache.spark.util.Utils$.tryLog(Utils.scala:1870)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:344)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

代码:

from pyspark import SparkConf, SparkContext

def parseLine(line):
    fields = line.split(',')
    return (int(fields[0]), float(fields[2]))

def parseGraphs(line):
    fields = line.split()
    return (fields[0]), [int(n) for n in fields[1:]]

# putting the [*] after local makes it run one executor on each core of your local PC
conf = SparkConf().setMaster("local[*]").setAppName("MyProcessName")

sc = SparkContext(conf = conf)

# parse the raw data and map it to an rdd.
# each item in this rdd is a tuple
# two methods to get the exact same data:
########## All of these methods can use lambda or full methods in the same way ##########
# read in a text file
customerOrdersLines = sc.textFile("file:///SparkCourse/customer-orders.csv")
customerOrdersRdd = customerOrdersLines.map(parseLine)
customerOrdersRdd = customerOrdersLines.map(lambda l: (int(l.split(',')[0]), float(l.split(',')[2])))
print customerOrdersRdd.take(1)

# countByValue groups identical values and counts them
salesByCustomer = customerOrdersRdd.map(lambda sale: sale[0]).countByValue()
print salesByCustomer.items()[0]

# use flatMap to cut everything up by whitespace
bookText = sc.textFile("file:///SparkCourse/Book.txt")
bookRdd = bookText.flatMap(lambda l: l.split())
print bookRdd.take(1)

# create key/value pairs that will allow for more complex uses
names = sc.textFile("file:///SparkCourse/marvel-names.txt")
namesRdd = names.map(lambda line: (int(line.split('\"')[0]), line.split('\"')[1].encode("utf8")))
print namesRdd.take(1)

graphs = sc.textFile("file:///SparkCourse/marvel-graph.txt")
graphsRdd = graphs.map(parseGraphs)
print graphsRdd.take(1)

# this will append "extra text" to each name.
# this is faster than a normal map because it doesn't give you access to the keys
extendedNamesRdd = namesRdd.mapValues(lambda heroName: heroName + "extra text")
print extendedNamesRdd.take(1)

# not the best example because the costars is already a list of integers
# but this should return a list, which will update the values
flattenedCostarsRdd = graphsRdd.flatMapValues(lambda costars: costars)
print flattenedCostarsRdd.take(1)

# put the heroes in ascending index order
sortedHeroes = namesRdd.sortByKey()
print sortedHeroes.take(1)

# to sort heroes by alphabetical order, we switch key/value to value/key, then sort
alphabeticalHeroes = namesRdd.map(lambda (key, value): (value, key)).sortByKey()
print alphabeticalHeroes.take(1)

# make sure that "spider" is in the name of the hero
spiderNames = namesRdd.filter(lambda (id, name): "spider" in name.lower())
print spiderNames.take(1)

# reduce by key keeps the key and performs aggregation methods on the values.  in this example, taking the sum
combinedGraphsRdd = flattenedCostarsRdd.reduceByKey(lambda value1, value2: value1 + value2)
print combinedGraphsRdd.take(1)

# broadcast: this is accessible from any executor
sentData = sc.broadcast(["this can be accessed by all executors", "access it using sentData"])

# accumulator:  this is synced across all executors
hitCounter = sc.accumulator(0)

【问题讨论】:

  • 你能告诉它在哪个步骤返回错误吗?你们有印刷作品吗?
  • 您可能混淆了源端口和目标端口。默认连接模式Any(available) >> Target Port,可能默认端口是80,那么你就无法连接到80端口。我强烈建议您使用 Wireshark 检查客户端和服务器连接。
  • 什么是 Spark 版本?您可以启动pyspark 并键入一些命令而不会出现错误吗?是Windows,不是吗?上面的代码是怎么执行的?
  • 你的机器上安装了python吗?

标签: apache-spark pyspark


【解决方案1】:

免责声明:我没有在 Spark 代码库的那部分上花费足够的时间,但让我一些提示您可能会找到解决方案。以下内容只是说明在哪里搜索更多信息而不是解决问题的方法。


您面临的异常是由于代码here 中看到的其他一些问题(正如您在java.net.Socket.shutdownOutput(Socket.java:1551) 行中看到的那样,即执行worker.shutdownOutput() 时)。

16/09/21 10:29:32 ERROR Utils: Uncaught exception in thread stdout writer for python
java.net.SocketException: Socket is closed
        at java.net.Socket.shutdownOutput(Socket.java:1551)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply$mcV$sp(PythonRDD.scala:344)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply(PythonRDD.scala:344)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply(PythonRDD.scala:344)
        at org.apache.spark.util.Utils$.tryLog(Utils.scala:1870)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:344)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

这让我相信 ERROR 是其他早期错误的后续。

stdout writer for python 的名字是the name of the thread,它(使用EvalPythonExec 物理运算符并且)负责Spark 和pyspark 之间的通信(因此您可以执行python 代码而无需进行太多更改) .

事实上,the scaladoc of EvalPythonExec 提供了大量关于 pyspark 内部使用的底层通信基础设施以及使用套接字连接到外部 Python 进程的信息。

Python 评估的工作原理是通过套接字将必要的(预计的)输入数据发送到外部 Python 进程,并将 Python 进程的结果与原始行结合起来。

此外,默认情况下使用python,除非使用PYSPARK_DRIVER_PYTHONPYSPARK_PYTHON 覆盖(如您在pyspark shell 脚本herehere 中所见)。这是出现在失败线程名称中的名称。

16/09/21 10:29:32 错误实用程序:python

的线程标准输出编写器中未捕获异常

我建议使用以下命令检查系统上的 python 版本。

python -c 'import sys; print(sys.version_info)'

That should be Python 2.7+,但可能是您使用了最新的 Python,但未通过 Spark 进行良好测试。 猜测...


您应该包含 pyspark 应用程序执行的整个日志,这就是我希望找到答案的地方。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-06-25
    • 1970-01-01
    • 2012-03-31
    • 2015-07-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多