【问题标题】:how to manipulate my dataframe in spark?如何在火花中操作我的数据框?
【发布时间】:2016-10-16 21:59:23
【问题描述】:

我有一个来自 kafka 主题的嵌套 json rdd 流。 数据如下所示:

{ 
   "time":"sometext1","host":"somehost1","event":
   {"category":"sometext2","computerName":"somecomputer1"}
}

我把它变成了一个数据框,架构看起来像

root
 |-- event: struct (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- computerName: string (nullable = true)
 |-- time: string (nullable = true)
 |-- host: string (nullable = true)

我正在尝试将其保存到具有这样架构的 hdfs 上的配置单元表中

category:string
computerName:string
time:string
host:string

这是我第一次使用 spark 和 scala。如果有人可以帮助我,我会很感激。 谢谢

【问题讨论】:

    标签: scala hadoop apache-spark dataframe rdd


    【解决方案1】:
    // Creating Rdd    
    val vals = sc.parallelize(
      """{"time":"sometext1","host":"somehost1","event":  {"category":"sometext2","computerName":"somecomputer1"}}""" ::
        Nil)
    
    // Creating Schema   
    val schema = (new StructType)
      .add("time", StringType)
      .add("host", StringType)
      .add("event", (new StructType)
        .add("category", StringType)
        .add("computerName", StringType))
    
    import sqlContext.implicits._
    val jsonDF = sqlContext.read.schema(schema).json(vals)
    

    jsonDF.printSchema

    root
     |-- time: string (nullable = true)
     |-- host: string (nullable = true)
     |-- event: struct (nullable = true)
     |    |-- category: string (nullable = true)
     |    |-- computerName: string (nullable = true)
    
    // selecting columns
    val df = jsonDF.select($"event.*",$"time",
      $"host")
    

    df.printSchema

    root
     |-- category: string (nullable = true)
     |-- computerName: string (nullable = true)
     |-- time: string (nullable = true)
     |-- host: string (nullable = true)
    

    df.show

    +---------+-------------+---------+---------+
    | category| computerName|     time|     host|
    +---------+-------------+---------+---------+
    |sometext2|somecomputer1|sometext1|somehost1|
    +---------+-------------+---------+---------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-04-21
      • 1970-01-01
      • 2020-01-21
      • 2019-04-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-05-24
      相关资源
      最近更新 更多