【问题标题】:Match pyspark dataframe column to list and create a new column匹配 pyspark 数据框列以列出并创建一个新列
【发布时间】:2022-01-26 20:20:02
【问题描述】:

我有以下列表。

lst=['name','age','country']

我有以下pysparkdataframe

column_a   column_b
Aaaa       name,age,subject
Bbbb       name,age,country,subject
Cccc       name,subject,percentage

我必须将列表与column_b 进行比较,并检查列表中的值是否是列的一部分,然后创建一个新列并使用column_b 中可用的列表值填充它。

下面是预期的输出。

column_a column_b                 column_c              
Aaaa     name,age,subject         name,age
Bbbb     name,age,country,subject name,age,country
Cccc     name,subject,percentage  name

【问题讨论】:

    标签: python dataframe pyspark


    【解决方案1】:

    没有重复

    array_intersect 允许你想要实现的操作。

    array_intersect 不允许重复,(即)如果column_b 的值为["name", "name"],那么column_c 将包含一次["name"]

    from pyspark.sql import functions as F
    
    data = [("Aaaa", ["name", "age", "subject"],),
            ("Bbbb", ["name", "age", "country", "subject"],),
            ("Cccc", ["name", "subject", "percentage"],),
            ("Dddd", ["name", "name"],),]
    
    df = spark.createDataFrame(data, ("column_a", "column_b",))
    
    lst=['name','age','country']
    lit_lst = [F.lit(v) for v in lst]
    
    df.withColumn("column_c", F.array_intersect(F.col("column_b"), F.array(lit_lst))).show(truncate=False)
    

    输出

    +--------+-----------------------------+--------------------+
    |column_a|column_b                     |column_c            |
    +--------+-----------------------------+--------------------+
    |Aaaa    |[name, age, subject]         |[name, age]         |
    |Bbbb    |[name, age, country, subject]|[name, age, country]|
    |Cccc    |[name, subject, percentage]  |[name]              |
    |Dddd    |[name, name]                 |[name]              |
    +--------+-----------------------------+--------------------+
    

    保留重复项

    为了保留重复项,可以应用filter 高阶函数。

    from pyspark.sql import functions as F
    
    data = [("Aaaa", ["name", "age", "subject"],),
            ("Bbbb", ["name", "age", "country", "subject"],),
            ("Cccc", ["name", "subject", "percentage"],),
            ("Dddd", ["name", "name"],),]
    
    df = spark.createDataFrame(data, ("column_a", "column_b",))
    
    df.withColumn("column_c", F.array(lit_lst))\
      .withColumn("column_c", F.expr("filter(column_b, element -> array_contains(column_c, element))"))\
      .show(truncate=False)
    

    输出

    +--------+-----------------------------+--------------------+
    |column_a|column_b                     |column_c            |
    +--------+-----------------------------+--------------------+
    |Aaaa    |[name, age, subject]         |[name, age]         |
    |Bbbb    |[name, age, country, subject]|[name, age, country]|
    |Cccc    |[name, subject, percentage]  |[name]              |
    |Dddd    |[name, name]                 |[name, name]        |
    +--------+-----------------------------+--------------------+
    

    【讨论】:

      猜你喜欢
      • 2018-03-06
      • 2021-06-28
      • 2020-04-11
      • 2022-01-15
      • 2019-09-21
      • 1970-01-01
      • 2023-03-18
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多