【发布时间】:2017-08-20 07:15:00
【问题描述】:
我是 SparkSQL 的新手。请任何人帮助我。
我的具体问题是,如果我们可以将 RDD hospitalDataText 转换为 DataFrame(使用 .toDF()),其中 hospitalDataText 已经使用 Spark Context 读取了 csv 文件(不使用 sqlContext.read.csv("path"))。
为什么我们不能写 header.toDF() ?如果我试图将变量header RDD 转换为 DataFrame,则会引发错误:value toDF is not a member of String。 为什么? 我的主要目的是我想使用.show()函数查看变量header RDD的数据无法将 RDD 转换为 DataFrame?请检查下面给出的代码! 看起来像双标准 :'(
scala> val hospitalDataText = sc.textFile("/Users/TheBhaskarDas/Desktop/services.csv")
hospitalDataText: org.apache.spark.rdd.RDD[String] = /Users/TheBhaskarDas/Desktop/services.csv MapPartitionsRDD[39] at textFile at <console>:33
scala> val header = hospitalDataText.first() //Remove the header
header: String = uhid,locationid,doctorid,billdate,servicename,servicequantity,starttime,endtime,servicetype,servicecategory,deptname
scala> header.toDF()
<console>:38: error: value toDF is not a member of String header.toDF() ^
scala> val hospitalData = hospitalDataText.filter(a => a != header)
hospitalData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[40] at filter at <console>:37
scala> val m = hospitalData.toDF()
m: org.apache.spark.sql.DataFrame = [value: string]
scala> println(m)
[value: string]
scala> m.show()
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
|32d84f8b9c5193838...|
|213d66cb9aae532ff...|
|222f8f1766ed4e7c6...|
|222f8f1766ed4e7c6...|
|993f608405800f97d...|
|993f608405800f97d...|
|fa14c3845a8f1f6b0...|
|6e2899a575a534a1d...|
|6e2899a575a534a1d...|
|1f1603e3c0a0db5e6...|
|508a4fbea4752771f...|
|5f33395ae7422c3cf...|
|5f33395ae7422c3cf...|
|4ef07783ce800fc5d...|
|70c13902c9c9ccd02...|
|70c13902c9c9ccd02...|
|a950feff6911ab5e4...|
|b1a0d427adfdc4f7e...|
|b1a0d427adfdc4f7e...|
+--------------------+
only showing top 20 rows
scala> m.show(1)
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row
scala> m.show(1,true)
+--------------------+
| value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row
scala> m.show(1,2)
+-----+
|value|
+-----+
| 32|
+-----+
only showing top 1 row
【问题讨论】:
标签: scala dataframe apache-spark-sql spark-dataframe