【问题标题】:Errors After Converting RDD to Dataframe: "java.lang.String is not a valid external type for schema of int"将 RDD 转换为 Dataframe 后的错误:“java.lang.String 不是 int 架构的有效外部类型”
【发布时间】:2021-07-05 05:17:27
【问题描述】:

我正在尝试在不使用案例类的情况下将 RDD 转换为 Dataframe。 csv 文件如下所示:

3,193080,De Gea <br>
0,158023,L. Messi <br>
4,192985,K. De Bruyne <br>
1,20801,Cristiano Ronaldo <br>
2,190871,Neymar Jr <br>


val players = sc.textFile("/Projects/Downloads/players.csv").map(line => line.split(',')).map(r => Row(r(1),r(2),r(3)))
# players: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[230] at map at <console>:34

val schema = StructType(List(StructField("id",IntegerType),StructField("age",IntegerType),StructField("name",StringType)))
# schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(age,IntegerType,true), StructField(name,StringType,true))

val playersDF = spark.createDataFrame(players,schema)
# playersDF: org.apache.spark.sql.DataFrame = [id: int, age: int ... 1 more field]

一切都很顺利,直到我尝试做一个 playerDF.show

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int 

我能做什么?

【问题讨论】:

    标签: apache-spark


    【解决方案1】:

    你有两个问题:

    1) 您的索引已关闭; Scala 是基于 0 的。 Row(r(1),r(2),r(3)) 应该是 Row(r(0),r(1),r(2))

    2) line.split 返回 Array[String] 而您的架构指示第一个和第二个字段应该是整数。您需要在创建数据框之前将它们转换为整数。

    基本上这就是你应该如何创建players

    val players = rdd.map(line => line.split(","))
                     .map(r => Row(r(0).toInt, r(1).toInt, r(2)))
    

    【讨论】:

      【解决方案2】:

      我认为最好的选择是提供架构并使用 existing facilities 读取 csv 文件。

      import org.apache.spark.sql.types._
      
      val playerSchema = StructType(Array(
          StructField("id", IntegerType, true),
          StructField("age", IntegerType, true),
          StructField("name", StringType, true)
      ))
      
      val players = spark
          .sqlContext
          .read
          .format("csv")
          .option("delimiter", ",")
          .schema(playerSchema)
          .load("/mypath/players.csv")
      

      结果如下:

      scala> players.show
      +---+------+-----------------+
      | id|   age|             name|
      +---+------+-----------------+
      |  3|193080|           De Gea|
      |  0|158023|         L. Messi|
      |  4|192985|     K. De Bruyne|
      |  1| 20801|Cristiano Ronaldo|
      |  2|190871|        Neymar Jr|
      +---+------+-----------------+
      
      scala> players.printSchema()
      root
       |-- id: integer (nullable = true)
       |-- age: integer (nullable = true)
       |-- name: string (nullable = true)
      
      scala>
      

      【讨论】:

        【解决方案3】:
        //Input
        StudentId,Name,Address
        101,Shoaib,Anwar Layout
        102,Shahbaz,Sara padlya
        103,Fahad,Munredy padlya
        104,Sana,Tannery Road
        105,Zeeshan,Muslim colony
        106,Azeem,Khusal nagar
        107,Nazeem,KR puram
        
        import org.apache.spark.sql.{Row, SQLContext, types}
        import org.apache.spark.sql.types._
        import org.apache.spark.{SparkConf, SparkContext}
        
        object SparkCreateDFWithRDD {
        
        
          def main(args: Array[String]): Unit = {
        
        
        
            val conf = new SparkConf().setAppName("Creating DF WITH RDD").setMaster("local")
        
            val sc = new SparkContext(conf)
        
            val sqlcontext = new SQLContext(sc)
        
            val rdd = sc.textFile("/home/cloudera/Desktop/inputs/studentDetails1.csv")
        
            val header = rdd.first()
        
            val rddData = rdd.filter(x => x != header).map(x => {
              val arr = x.split(",")
              Row(arr(0).toInt, arr(1), arr(2))
            })
        
            val schemas = StructType(Array(StructField("StudentId",IntegerType,false),
                               StructField("StudentName",StringType,false),StructField("StudentAddress",StringType,true)))
        
        
            val df = sqlcontext.createDataFrame(rddData,schemas)
        
            df.printSchema()
            df.show()
        
          }
        
        }
        
        +---------+-----------+--------------+
        |StudentId|StudentName|StudentAddress|
        +---------+-----------+--------------+
        |      101|     Shoaib|  Anwar Layout|
        |      102|    Shahbaz|   Sara padlya|
        |      103|      Fahad|Munredy padlya|
        |      104|       Sana|  Tannery Road|
        |      105|    Zeeshan| Muslim colony|
        |      106|      Azeem|  Khusal nagar|
        |      107|     Nazeem|      KR puram|
        +---------+-----------+--------------+
        

        【讨论】:

          猜你喜欢
          • 2017-09-19
          • 2020-10-17
          • 2017-10-03
          • 1970-01-01
          • 2019-12-21
          • 2016-08-03
          • 1970-01-01
          • 1970-01-01
          • 2021-11-29
          相关资源
          最近更新 更多