【问题标题】:How to flatten RDD which contains sub list into main list如何将包含子列表的RDD展平为主列表
【发布时间】:2019-08-08 16:33:37
【问题描述】:
val rdd = df.rdd.map(
      line => Row(
        "BNK",
        format.format(Calendar.getInstance().getTime()),
        line(0),
        scala.xml.XML.loadString("<?xml version='1.0' encoding='utf-8'?>" + line(1)).child.map(_.text).filter(_.nonEmpty)


      )
    )

产生输出

 values = {Object[4]@9906} 
 0 = "BNK"
 1 = "18-3-2019"
 2 = "185687194277431.060001"
 3 = {$colon$colon@9910} "::" size = 20
  0 = "KH0010001"
  1 = "-1171035537.00"
  2 = "9"
  3 = "65232"
  4 = "1"
  5 = "KHR"
  6 = "TR"
  7 = "6-54-10-1-005-004"
  8 = "1"
  9 = "1"
  10 = "DC183050001002108"
  11 = "DC"
  12 = "20181101"
  13 = "185687194277431.06"
  14 = "1"
  15 = "1"
  16 = "5022_DMUSER__OFS_DM.OFS.SRC.VAL"
  17 = "1811012130"
  18 = "6012_DMUSER"
  19 = "PL.65232.......1.....KH0010001"

如何将values[3] 是带有20 items 的子列表展平到主列表中。

所以预期的输出:

 values = 
 0 = "BNK"
 1 = "18-3-2019"
 2 = "185687194277431.060001"
 3 = "KH0010001"
 4 = "-1171035537.00"
 5 = "9"
 6 = "65232"
 7 = "1"
 ..

【问题讨论】:

    标签: scala apache-spark flatten scala-xml


    【解决方案1】:

    更新问题后再次尝试。我认为模式需要手动生成,因为值是基于列表的。假设列表的大小始终为 20:

    val schema = StructType((0 to 22)
      .map(x => StructField(x.toString, IntegerType))
      .toList)
    spark.createDataFrame(df.rdd.map(line => Row.fromSeq("BNK" :: format.format(Calendar.getInstance().getTime()) :: line(0) :: scala.xml.XML.loadString("<?xml version='1.0' encoding='utf-8'?>" + line(1)).child.map(_.text).filter(_.nonEmpty).toList)), schema)
    

    如果列表大小不总是 20,则需要对其进行封顶/填充。希望对您有所帮助。

    【讨论】:

    • 可以提供完整的吗?
    • 我想我误解了你的问题。如果您想将值展平到不同的列中,这是不可能的,因为架构是特定于表的,而不是单独的行。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-12-03
    • 1970-01-01
    • 2020-07-05
    • 2016-01-26
    相关资源
    最近更新 更多