【问题标题】:Array[Array[String]] to String in a column with Scala and Spark使用 Scala 和 Spark 将 Array[Array[String]] 转换为列中的字符串
【发布时间】:2021-02-03 05:43:36
【问题描述】:

这是我的数据框:

+--------------------+                          
|    NewsId|             newsArr|            transArr|
+----------+--------------------+--------------------+
|        26|[Republicans, Sto...|[[R, IH0, P, AH1,...|
|        29|[ISIS, Claims, Re...|[[AY1, S, AH0], [...|
|       474|[Concert, for, Tr...|[[K, AA1, N, S, E...|
|       964|[How, a, Fractiou...|[[HH, AW1], [AH0]...|
|      1677|[Review:, ‘Kong:,...|[[n/a], [n/a], [S...|
|      1697|[The, Rice-Size, ...|[[DH, AH0], [n/a]...|
|      1806|[Populists, Appea...|[[P, AA1, Y, AH0,...|
|      1950|[Uber, Board, Sta...|[[Y, UW1, B, ER0]...|
|      2040|[Health, Bill’s, ...|[[HH, EH1, L, TH]...|
|      2214|[Unmasking, the, ...|[[n/a], [DH, AH0]...|

我想将“transArr”列单元格变成这样的字符串:

+--------------------+                          
|    NewsId|             newsArr|      transArr|
+----------+--------------------+--------------+
|        26|[Republicans, Sto...|R IH0 P AH1...|
|        29|[ISIS, Claims, Re...|AY1 S AH0...  |
|       474|[Concert, for, Tr...|K AA1 N S E...|
|       964|[How, a, Fractiou...|HH AW1 AH0... |
|      1677|[Review:, ‘Kong:,...|n/a n/a S...  |
|      1697|[The, Rice-Size, ...|DH AH0 n/a... |
|      1806|[Populists, Appea...|P AA1 Y AH0...|
|      1950|[Uber, Board, Sta...|Y UW1 B ER0...|
|      2040|[Health, Bill’s, ...|HH EH1 L TH...|
|      2214|[Unmasking, the, ...|n/a DH AH0... |

有没有相对简单的解决方案?

【问题讨论】:

    标签: arrays scala dataframe apache-spark


    【解决方案1】:

    使用concat_ws & flatten,检查下面的代码。

    scala> df.printSchema
    root
     |-- data: array (nullable = true)
     |    |-- element: array (containsNull = true)
     |    |    |-- element: string (containsNull = true)
    
    
    scala> df
    .withColumn(
         "flatten",
         concat_ws(" ",flatten($"data"))
    )
    .show(false)
    
    +------------+-------+
    |data        |flatten|
    +------------+-------+
    |[[abc, cdf]]|abc cdf|
    +------------+-------+
    

    【讨论】:

    • 没有flatten我得到了相同的结果,尝试只使用concat_ws并将数组列包装到col中。
    • 当然,我已经发布了我的示例
    【解决方案2】:

    使用concat_ws:

    import spark.implicits._
    val df: DataFrame = Seq(
      ("a1", Array("2", "3", "5")),
      ("b2", Array("1", "6", "23")),
      ("b1", Array("df", "l2", "14")),
      ("c1", Array("te", "3pa", "gw"))
    ).toDF("key", "values")
    df.show()
    val newDF = df.withColumn("values", concat_ws(" ", col("values")))
    newDF.show()
    newDF.printSchema()
    

    输出:

    +---+-------------+
    |key|       values|
    +---+-------------+
    | a1|    [2, 3, 5]|
    | b2|   [1, 6, 23]|
    | b1| [df, l2, 14]|
    | c1|[te, 3pa, gw]|
    +---+-------------+
    
    +---+---------+
    |key|   values|
    +---+---------+
    | a1|    2 3 5|
    | b2|   1 6 23|
    | b1| df l2 14|
    | c1|te 3pa gw|
    +---+---------+
    
    root
     |-- key: string (nullable = true)
     |-- values: string (nullable = false)
    

    【讨论】:

    • 您的 values 类型为 Array[String] 而不是 Array[Array[String]]
    • @Srinivas 是的,你是对的。我错过了一对括号。在这种情况下,需要使用 flatten
    猜你喜欢
    • 2019-05-10
    • 2022-12-17
    • 1970-01-01
    • 1970-01-01
    • 2019-07-25
    • 2015-11-16
    • 2017-12-29
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多