【发布时间】:2016-02-09 17:13:46
【问题描述】:
我有一个.tsv 文件pageviews_by_second 由timestamp site 和requestsfields 组成:
"timestamp" "site" "requests"
"2015-03-16T00:09:55" "mobile" 1595
"2015-03-16T00:10:39" "mobile" 1544
"2015-03-16T00:19:39" "desktop" 2460
我希望第一行消失,因为它会导致我必须对数据执行的操作出错。
我尝试了以下方式:
1.在拆分之前过滤RDD
val RDD1 = sc.textFile("pageviews_by_second")
val top_row = RDD1.first()
//returns: top_row: String = "timestamp" "site" "requests"
val RDD2 = RDD1.filter(x => x!= top_row)
RDD2.first()
//returns: "2015-03-16T00:09:55" "mobile" 1595
2.RDD拆分后过滤
val RDD1 = sc.textFile("pageviews_by_second").map(_.split("\t")
RDD1.first() //returns res0: Array[String] = Array("timestamp, 'site", "requests")
val top_row = RDD1.first()
val RDD2 = RDD1.filter(x => x!= top_row)
RDD2.first() //returns: res1: Array[String] = Array("timestamp", "site" ,"requests")
val RDD2 = RDD1.filter(x => x(0)!="timestamp" && x(1)!="site" && x(2)!="requests")
RDD2.first() //returns: res1: Array[String] = Array("timestamp", "site" ,"requests")
3.使用“案例类”转换为DataFrame并对其进行过滤
case class Wiki(timestamp: String, site: String, requests: String)
val DF = sc.textFile("pageviews_by_second").map(_.split("\t")).map(w => Wiki(w(0), w(1), w(2))).toDF()
val top_row = DF.first()
//returns: top_row: org.apache.spark.sql.Row = ["timestamp","site","requests"]
DF.filter(_ => _ != top_row)
//returns: error: missing parameter type
val DF2 = DF.filter(_ => _ != top_row2)
为什么只有第一种方法能够过滤掉第一行而其他两种方法不能?在方法 3 中,为什么会出现错误以及如何纠正?
【问题讨论】:
标签: scala apache-spark apache-spark-sql spark-dataframe