【问题标题】:Better way to concatenate many columns?连接多列的更好方法?
【发布时间】:2019-11-27 14:23:00
【问题描述】:

我有 30 列。 26 个列名是字母的名称。我想把这 26 列做成一列作为一个字符串。

price  dateCreate  volume  country  A  B  C  D  E ..... Z
19     20190501    25      US       1  2  5  6  19      30
49     20190502    30      US       5  4  5  0  34      50

我想要这个:

price  dateCreate  volume  country  new_col
19     20190501    25      US       "1,2,5,6,19,....30"
49     20190502    30      US       "5,4,5,0,34,50"

我知道我可以这样做:

df.withColumn("new_col", concat($"A", $"B", ...$"Z"))

但是,将来在遇到此问题时,我想知道如何更轻松地连接许多列。有什么办法吗?

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:

    从 Spark 2.3.0 开始,您可以在 spark-sql 本身中直接使用连接运算符来执行此操作。

    spark.sql("select A||B||C from table");
    

    https://issues.apache.org/jira/browse/SPARK-19951

    【讨论】:

      【解决方案2】:

      只需将以下内容应用于要连接的任意数量的列

      val df= Seq((19,20190501,24, "US",  1 , 2,  5,  6,  19 ),(49,20190502,30, "US", 5 , 4,  5,  0,  34 )).
              toDF("price", "dataCreate", "volume", "country", "A","B","C","D","E")
      
      val exprs = df.columns.drop(4).map(col _)
      
      df.select($"price", $"dataCreate", $"volume", $"country", concat_ws(",", 
               array(exprs: _*)).as("new_col"))
      
      
      +-----+----------+------+-------+----------+
      |price|dataCreate|volume|country|   new_col|
      +-----+----------+------+-------+----------+
      |   19|  20190501|    24|     US|1,2,5,6,19|
      |   49|  20190502|    30|     US|5,4,5,0,34|
      +-----+----------+------+-------+----------+
      

      为了完整起见,这里是 pyspark 等价物

      import pyspark.sql.functions as F
      
      df= spark.createDataFrame([[19,20190501,24, "US",  1 , 2,  5,  6,  19 ],[49,20190502,30, "US", 5 , 4,  5,  0,  34 ]],
              ["price", "dataCreate", "volume", "country", "A","B","C","D","E"])
      
      exprs = [col for col in df.columns[4:]]
      
      df.select("price","dataCreate", "volume", "country", F.concat_ws(",",F.array(*exprs)).alias("new_col"))
      

      【讨论】:

        【解决方案3】:

        也许您的想法与下一个类似:

        斯卡拉

        import org.apache.spark.sql.functions.{col, concat_ws}
        
        val cols = ('A' to 'Z').map{col(_)}
        
        df.withColumn("new_col", concat_ws(",", cols:_*)
        

        Python

        from pyspark.sql.functions import col, concat_ws
        import string
        
        cols = [col(x) for x in string.ascii_uppercase]
        
        df.withColumn("new_col", concat_ws(",", *cols))
        

        【讨论】:

        猜你喜欢
        • 2020-03-22
        • 2018-11-25
        • 2020-08-03
        • 1970-01-01
        • 2023-04-06
        • 2015-06-20
        • 1970-01-01
        • 2021-08-10
        • 1970-01-01
        相关资源
        最近更新 更多