【问题标题】:value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)] after import导入后值 reduceByKey 不是 org.apache.spark.rdd.RDD[(Int, Int)] 的成员
【发布时间】:2021-04-12 11:21:06
【问题描述】:

我创建了这个 RDD:

scala> val data=sc.textFile("sparkdata.txt")

然后我尝试返回文件的内容:

scala> data.collect

我将现有数据划分为单个单词:

scala> val splitdata = data.flatMap(line => line.split(" "));
scala> splitdata.persist()
scala> splitdata.collect;

现在,我正在做 map reduce 操作:

scala> val mapdata = splitdata.map(word => (word,1));
scala> mapdata.collect;
scala> val reducedata = mapdata.reduceByKey(_+_);

要得到结果:

scala> reducedata.collect;

当我想显示前 10 行时:

splitdata.groupByKey(identity).count().show(10)

我收到以下错误:

<console>:38: error: value groupByKey is not a member of org.apache.spark.rdd.RDD[String]
       splitdata.groupByKey(identity).count().show(10)
                 ^
<console>:38: error: missing argument list for method identity in object Predef
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `identity _` or `identity(_)` instead of `identity`.
       splitdata.groupByKey(identity).count().show(10)
                            ^

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:

    类似于reduceByKey()groupByKey()PairRDDs 类型RDD[K, V] 的方法,而不是一般RDDs 的方法。虽然reduceByKey() 使用提供的二进制函数将RDD[K, V] 减少为另一个RDD[K, V],但groupByKey()RDD[K, V] 转换为RDD[(K, Iterable[V])]。要进一步按键转换Iterable[V],通常会使用提供的函数应用mapValues()(或flatMapValues)。

    例如:

    val rdd = sc.parallelize(Seq(
      "apple", "apple", "orange", "banana", "banana", "orange", "apple", "apple", "orange"
    ))
    
    rdd.map((_, 1)).reduceByKey(_ + _).collect
    // res1: Array[(String, Int)] = Array((apple,4), (banana,2), (orange,3))
    
    rdd.map((_, 1)).groupByKey().mapValues(_.sum).take(2)
    // res2: Array[(String, Int)] = Array((apple,4), (banana,2))
    

    如果您在申请groupByKey() 后只对获取组数感兴趣:

    rdd.map((_, 1)).groupByKey().count()
    // res3: Long = 3
    

    【讨论】:

      猜你喜欢
      • 2018-05-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-03-25
      • 2017-11-19
      • 2016-02-15
      相关资源
      最近更新 更多