【问题标题】:How to fix Py4JJavaError while using .toPandas() function?使用 .toPandas() 函数时如何修复 Py4JJavaError?
【发布时间】:2019-12-30 20:16:17
【问题描述】:

我是 pyspark 的新手,我正在尝试使用 word_tokenize() 函数。 这是我的代码:

import nltk
from nltk import word_tokenize
import pandas as pd

df_pd = df2.select("*").toPandas()
df2.select('text').apply(word_tokenize)
df_pd.show()

我使用 JDK 1.8、Python 3.7、spark 2.4.3。

你能告诉我我做错了什么吗?如何解决? 该部分下面的代码运行良好,没有任何错误。

我收到了这样的消息:


Py4JJavaError: An error occurred while calling o106.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 330, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
    at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:260)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:50)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
    at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:48)
    at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:517)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


 and more....

【问题讨论】:

  • 似乎您的执行程序/驱动程序没有足够的内存。数据集大小是多少,你能给出命令你如何提交作业吗?

标签: java python apache-spark pyspark


【解决方案1】:

toPandas 针对较小的数据集进行了优化。正如建议的那样,这可能是由于内存不足,您收到了错误。

尝试限制您的数据集大小: df_pd = df2.limit(10).select("*").toPandas()

应用您的函数,然后运行 ​​.head(10) 以消除内存错误的问题。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-03-07
    • 2021-11-17
    • 2021-09-24
    • 1970-01-01
    • 2020-02-14
    • 1970-01-01
    • 2019-07-31
    • 1970-01-01
    相关资源
    最近更新 更多