【发布时间】:2017-11-02 15:49:43
【问题描述】:
我有一个包含两个工作人员的 Spark 集群 - 所有节点都有 16GB 的 RAM。 我正在使用带有 MEMORY = TRUE 参数的 sparklyr spark_read_csv(下面的代码)将数据从 S3 读取到 Spark,但是尽管有足够的内存,但大多数数据都会溢出到磁盘。 RStudio 服务器安装在与 Spark master 相同的节点上。任何想法为什么会发生这种情况,如果这是最佳的?我怎么调呢?谢谢!
flightsFull <- spark_read_csv(sc, "flights_spark",
path = "/s3fs/mypath/multipleFiles",
header = TRUE,
memory = TRUE,
columns = list(
Year = "character",
Month = "character",
DayofMonth = "character",
DayOfWeek = "character",
DepTime = "character",
CRSDepTime = "character",
ArrTime = "character",
CRSArrTime = "character",
UniqueCarrier = "character",
FlightNum = "character",
TailNum = "character",
ActualElapsedTime = "character",
CRSElapsedTime = "character",
AirTime = "character",
ArrDelay = "character",
DepDelay = "character",
Origin = "character",
Dest = "character",
Distance = "character",
TaxiIn = "character",
TaxiOut = "character",
Cancelled = "character",
CancellationCode = "character",
Diverted = "character",
CarrierDelay = "character",
WeatherDelay = "character",
NASDelay = "character",
SecurityDelay = "character",
LateAircraftDelay = "character"),
infer_schema = FALSE)
编辑:添加配置文件内容
spark-defaults.conf
spark.master=spark://ip-host.eu-west-1.compute.internal:7077
spark.jars=/opt/bluedata/bluedata-dtap.jar
spark.executor.extraClassPath=/opt/bluedata/bluedata-dtap.jar
spark.driver.extraClassPath=/opt/bluedata/bluedata-dtap.jar
spark-env.sh
SPARK_MASTER_HOST=ip-host.eu-west-1.compute.internal
SPARK_WORKER_CORES=8
SPARK_WORKER_MEMORY=32768m
【问题讨论】:
-
我无法重现错误,您可以添加spark配置文件吗? spark-default.conf 和 spark-env.sh
-
你也可以附加环境吗?你怎么
spark-submit这个应用程序? -
sparklyr只是打电话给CACHE TABLE所以它可能是环境/默认问题
标签: r apache-spark apache-spark-sql sparklyr