【问题标题】:SparkR Stage X contains a task of very large sizeSparkR Stage X 包含一个非常大的任务
【发布时间】:2018-03-18 10:33:51
【问题描述】:

使用 R 数据框调用 createOrReplaceTempView 时收到此警告:

createOrReplaceTempView (as.Data.Frame(products), "prod")

我应该忽略这个警告吗?这效率低吗?

谢谢!

【问题讨论】:

    标签: apache-spark sparkr


    【解决方案1】:

    这些只是警告。如果您想尝试避免它们,请在注册临时表并对数据执行某些功能之前对数据进行重新分区并对其调用操作。重新分区将导致随机播放。

    例如,

    set.seed(123)
    df<- data.frame(thing1=rnorm(100000), thing2=rep("ThisIsAString", 100000), stringsAsFactors = FALSE)
    sdf<- SparkR::createDataFrame(df) # Warnings for me
    SparkR::getNumPartitions(sdf) # 1 partition
    sdf<- SparkR::repartition(sdf, numPartitions=4L) # repartition, will cause a shuffle
    SparkR::getNumPartitions(sdf) # spark now knows to repartition the data, this will happen once an action is called on the data, i.e. counting the rows
    SparkR::cache(sdf) # Nothing has happened yet
    SparkR::nrow(sdf) # Now cause the repartition and a count to happen. # Will be warned
    SparkR::createOrReplaceTempView(sdf, "sdfTable") # Make a temp table as you have in your example
    
    res<- SparkR::sql("SELECT thing1, thing2 FROM sdfTable WHERE thing1> 0.5") # SQL
    SparkR::nrow(res) # no warnings, 31002 observations found. 
    SparkR::getNumPartitions(res) # 4 partitions in the result
    

    【讨论】:

      猜你喜欢
      • 2015-05-06
      • 1970-01-01
      • 1970-01-01
      • 2015-05-09
      • 2018-03-19
      • 2018-09-02
      • 1970-01-01
      • 2016-10-02
      • 1970-01-01
      相关资源
      最近更新 更多