【问题标题】:replace column values in pyspark dataframe based multiple conditions替换基于多个条件的pyspark数据框中的列值
【发布时间】:2021-12-13 12:44:40
【问题描述】:

我有以下 pyspark 数据框

a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))



dcode   zone
0   480s  North
1   480s   West
2   499s   East
3   499s  North
4   650s   East
5   650s  North
6   702s  North
7   702s   West
8   736s  North
9   736s  South
10  736s   West
11  737s  North
12  737s   West

我希望我的数据框看起来像 -

 dcode   zone output
0   480s  North     NW
1   480s   West     NW
2   499s   East       
3   499s  North     NW
4   650s   East       
5   650s  North     NW
6   702s  North       
7   702s   West       
8   736s  North       
9   736s  South       
10  736s   West       
11  737s  North       
12  737s   West  

同样,我正在使用这个逻辑,但它没有给出想要的结果。

  df_ = df.withColumn("output", F.when((F.col("Zone") == "North") | (F.col("Zone") == "West") & (F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"), "NW"))

仅当区域为北或西且解码不在 736,737s,702s 中时,我才希望在输出列中出现 NW。

【问题讨论】:

    标签: apache-spark pyspark


    【解决方案1】:

    请考虑首先将您的pandasdf 转换为spark,因为您使用的是pypark 语法。然后我会建议使用isin 将您的代码重写为更简洁和更清晰的方式:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(df)
    
    df_ = df.withColumn("output", F.when(
      (F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
                                             ),"NW").otherwise(""))
    

    >>> df_.show(truncate=False)
    
    +-----+-----+------+
    |dcode|zone |output|
    +-----+-----+------+
    |480s |North|NW    |
    |480s |West |NW    |
    |499s |East |      |
    |499s |North|NW    |
    |650s |East |      |
    |650s |North|NW    |
    |702s |North|      |
    |702s |West |      |
    |736s |North|      |
    |736s |South|      |
    |736s |West |      |
    |737s |North|      |
    |737s |West |      |
    +-----+-----+------+
    

    【讨论】:

      【解决方案2】:

      您可以直接使用 SQL 风格的表达式(expr 函数)。

      import pyspark.sql.functions as F
      ......
      df = df.withColumn('output', F.expr("case when zone in ('North', 'West') and dcode not in ('736s', '737s', '702s') then 'NW' end"))
      ......
      

      【讨论】:

        【解决方案3】:

        检查括号

        顺便说一句,df = pd.DataFrame(dict(dcode=a, zone=b)) 不是 PySpark

        from pyspark.sql import functions as F
        import pandas as pd
        
        
        a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
        b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']
        
        
        df = pd.DataFrame(dict(dcode=a, zone=b))
        
        df_ = spark.createDataFrame(df)
        
        df_ = df_.withColumn("output", F.when((\
                                               ((F.col("Zone") == "North") | (F.col("Zone") == "West")) & ((F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"))\
                                              ), "NW"))
        
        df_.show()
        
        +-----+-----+------+
        |dcode| zone|output|
        +-----+-----+------+
        | 480s|North|    NW|
        | 480s| West|    NW|
        | 499s| East|  null|
        | 499s|North|    NW|
        | 650s| East|  null|
        | 650s|North|    NW|
        | 702s|North|    NW|
        | 702s| West|    NW|
        | 736s|North|    NW|
        | 736s|South|  null|
        | 736s| West|    NW|
        | 737s|North|    NW|
        | 737s| West|    NW|
        +-----+-----+------+
        
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2022-07-06
          • 1970-01-01
          • 2020-11-12
          • 1970-01-01
          • 2021-07-14
          • 1970-01-01
          相关资源
          最近更新 更多