使用 .toDF() 将 RDD 转换为 DataFrame 当使用 SparkContext（不是 sqlContext）读取 CSV 数据时答案

【问题标题】：Conversion of RDD to DataFrame using .toDF() When CSV data read using SparkContext (Not sqlContext)使用 .toDF() 将 RDD 转换为 DataFrame 当使用 SparkContext（不是 sqlContext）读取 CSV 数据时
【发布时间】：2017-08-20 07:15:00
【问题描述】：

我是 SparkSQL 的新手。请任何人帮助我。我的具体问题是，如果我们可以将 RDD hospitalDataText 转换为 DataFrame（使用 .toDF()），其中 hospitalDataText 已经使用 Spark Context 读取了 csv 文件（不使用 sqlContext.read.csv("path")）。 为什么我们不能写 header.toDF() ？如果我试图将变量header RDD 转换为 DataFrame，则会引发错误：value toDF is not a member of String。 为什么？ 我的主要目的是我想使用.show()函数查看变量header RDD的数据无法将 RDD 转换为 DataFrame？请检查下面给出的代码！ 看起来像双标准 :'(

scala> val hospitalDataText = sc.textFile("/Users/TheBhaskarDas/Desktop/services.csv")
hospitalDataText: org.apache.spark.rdd.RDD[String] = /Users/TheBhaskarDas/Desktop/services.csv MapPartitionsRDD[39] at textFile at <console>:33

scala> val header = hospitalDataText.first() //Remove the header
header: String = uhid,locationid,doctorid,billdate,servicename,servicequantity,starttime,endtime,servicetype,servicecategory,deptname

scala> header.toDF()

<console>:38: error: value toDF is not a member of String
       header.toDF()

              ^

scala> val hospitalData = hospitalDataText.filter(a => a != header)
hospitalData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[40] at filter at <console>:37

scala> val m = hospitalData.toDF()
m: org.apache.spark.sql.DataFrame = [value: string]

scala> println(m)
[value: string]

scala> m.show()
+--------------------+
|               value|
+--------------------+
|32d84f8b9c5193838...|
|32d84f8b9c5193838...|
|213d66cb9aae532ff...|
|222f8f1766ed4e7c6...|
|222f8f1766ed4e7c6...|
|993f608405800f97d...|
|993f608405800f97d...|
|fa14c3845a8f1f6b0...|
|6e2899a575a534a1d...|
|6e2899a575a534a1d...|
|1f1603e3c0a0db5e6...|
|508a4fbea4752771f...|
|5f33395ae7422c3cf...|
|5f33395ae7422c3cf...|
|4ef07783ce800fc5d...|
|70c13902c9c9ccd02...|
|70c13902c9c9ccd02...|
|a950feff6911ab5e4...|
|b1a0d427adfdc4f7e...|
|b1a0d427adfdc4f7e...|
+--------------------+
only showing top 20 rows


scala> m.show(1)
+--------------------+
|               value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row


scala> m.show(1,true)
+--------------------+
|               value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row


scala> m.show(1,2)
+-----+
|value|
+-----+
|   32|
+-----+
only showing top 1 row

【问题讨论】：

标签： scala dataframe apache-spark-sql spark-dataframe

【解决方案1】：

您一直说header 是RDD，而您发布的输出清楚地表明header 是String。 first() 不返回 RDD。您不能在 String 上使用 show()，但可以使用 println。

【讨论】：

天啊！感谢你让我睁开了眼睛。谢谢 :) 很抱歉我没有弄错 :( 再次感谢。:)