【发布时间】:2018-05-21 18:43:04
【问题描述】:
在 docker 容器中运行独立 spark-2.3.0-bin-hadoop2.7
- df1 = 5 行
- df2 = 10 行
-
数据集非常小。
df1 schema: Dataframe[id:bigint, name:string] df2 schema: Dataframe[id:decimal(12,0), age: int]
内联
df3 = df1.join(df2, df1.id == df2.id, 'inner')
df3 schema: Dataframe[id:bigint, name:string, age: int]
在执行df3.show(5)时,出现如下错误
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/apache/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 466, in collect
port = self._jdf.collectToPython() File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name) File "/usr/apache/spark-2.3.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw) File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o43.collectToPython. : org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
尝试按照this suggestion 将广播超时设置为-1,但得到了同样的错误
conf = SparkConf().set("spark.sql.broadcastTimeout","-1")
【问题讨论】:
-
你能打电话给
df1.show()和df2.show而不出错吗? -
是的 df1.show() 和 df2.show() 工作得很好
标签: python apache-spark pyspark apache-spark-sql