【问题标题】:Data frame segmentation and dropping数据帧分割和丢弃
【发布时间】:2021-11-21 21:34:13
【问题描述】:

我在 pandas 中有以下 DataFrame:

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90], 
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]

我想创建一个新列,在该列中,我想根据 B 列的条件从 A 列获取值。条件是两个连续的“BW”之间没有“txt” ,那么我将在 C 列上有这些值。但是如果在两个连续的 ''BW'' 之间有 ''txt'',我想删除所有这些值。所以预期的输出应该是这样的:

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90], 
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
C = [1,10,23, BW, 24,24,55, BW, nan, nan, nan, nan, nan, nan, BW, 43,BW]

我不知道该怎么做。非常感谢任何帮助。

【问题讨论】:

    标签: python pandas dataframe iteration


    【解决方案1】:

    编辑:

    更新的答案在最终 df 中缺少 BW 的值。

    import pandas as pd
    import numpy as np
    
    BW = 999
    txt = -999
    A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
    B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
    
    df = pd.DataFrame({'A': A, 'B': B})
    df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
    df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.A)
    df['C'] = np.where(df['B'] == BW, df['B'], df['C'])
    df['C'] = df['C'].astype('Int64')
    df = df.drop('group', axis=1)
    In [435]: df
    Out[435]: 
         A    B     C
    0    1   24     1
    1   10   23    10
    2   23   29    23
    3   45  999   999 <-- BW
    4   24   49    24
    5   24   59    24
    6   55   72    55
    7   67  999   999 <-- BW
    8   73    9  <NA>
    9   26  183  <NA>
    10  13   17  <NA>
    11  96 -999  <NA> <-- txt is in the middle of BW
    12  53    2  <NA>
    13  23   49  <NA>
    14  24  999   999 <-- BW
    15  43  479    43
    16  90  999   999 <-- BW
    

    你可以这样实现,假设 BWtxt 是特定值,我只是用一些随机数填充它们以区分它们

    In [277]: BW = 999
    
    In [278]: txt = -999
    
    In [293]: A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
         ...: B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,49,BW,479,BW]
    
    In [300]: df = pd.DataFrame({'A': A, 'B': B})
    
    In [301]: df
    Out[301]: 
         A    B
    0    1   24
    1   10   23
    2   23   29
    3   45  999
    4   24   49
    5   24   59
    6   55   72
    7   67  999
    8   73    9
    9   26  183
    10  13   17
    11  96 -999
    12  53    2
    13  23   49
    14  24  999
    15  43  479
    16  90  999
    

    首先让我们拆分不同的值组,在这里我将它们拆分为唯一的组,其中每个组包含B 的值,这些值介于值BW 和下一个BW 之间。

    In [321]: df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
    
    In [322]: df
    Out[322]: 
         A    B      group
    0    1   24 0.00000000
    1   10   23 0.00000000
    2   23   29 0.00000000
    3   45  999        NaN
    4   24   49 1.00000000
    5   24   59 1.00000000
    6   55   72 1.00000000
    7   67  999        NaN
    8   73    9 2.00000000
    9   26  183 2.00000000
    10  13   17 2.00000000
    11  96 -999 2.00000000
    12  53    2 2.00000000
    13  23   49 2.00000000
    14  24  999        NaN
    15  43  479 3.00000000
    16  90  999        NaN
    

    接下来使用np.where(),我们可以根据您设置的条件替换这些值。

    In [360]: df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.B)
    
    In [432]: df
    Out[432]: 
         A    B      group            C
    0    1   24 0.00000000  24.00000000
    1   10   23 0.00000000  23.00000000
    2   23   29 0.00000000  29.00000000
    3   45  999        NaN 999.00000000
    4   24   49 1.00000000  49.00000000
    5   24   59 1.00000000  59.00000000
    6   55   72 1.00000000  72.00000000
    7   67  999        NaN 999.00000000
    8   73    9 2.00000000          NaN
    9   26  183 2.00000000          NaN
    10  13   17 2.00000000          NaN
    11  96 -999 2.00000000          NaN
    12  53    2 2.00000000          NaN
    13  23   49 2.00000000          NaN
    14  24  999        NaN 999.00000000
    15  43  479 3.00000000 479.00000000
    16  90  999        NaN 999.00000000
    

    这里我们需要将 B 等于 BW for C 设置回 B 的值。

    In [488]: df['C'] = np.where(df['B'] == BW, df['B'], df['C'])
    
    In [489]: df
    Out[489]: 
         A    B      group            C
    0    1   24 0.00000000  24.00000000
    1   10   23 0.00000000  23.00000000
    2   23   29 0.00000000  29.00000000
    3   45  999        NaN 999.00000000
    4   24   49 1.00000000  49.00000000
    5   24   59 1.00000000  59.00000000
    6   55   72 1.00000000  72.00000000
    7   67  999        NaN 999.00000000
    8   73    9 2.00000000          NaN
    9   26  183 2.00000000          NaN
    10  13   17 2.00000000          NaN
    11  96 -999 2.00000000          NaN
    12  53    2 2.00000000          NaN
    13  23   49 2.00000000          NaN
    14  24  999        NaN 999.00000000
    15  43  479 3.00000000 479.00000000
    16  90  999        NaN 999.00000000
    

    最后只需将 float 列转换为 int 并删除我们不再需要的 group 列。如果您想保持 NaN 值为 np.nan,则忽略到 Int64 的转换。

    In [396]: df.C = df.C.astype('Int64')
    
    In [397]: df
    Out[397]: 
         A    B      group     C
    0    1   24 0.00000000    24
    1   10   23 0.00000000    23
    2   23   29 0.00000000    29
    3   45  999        NaN   999
    4   24   49 1.00000000    49
    5   24   59 1.00000000    59
    6   55   72 1.00000000    72
    7   67  999        NaN   999
    8   73    9 2.00000000  <NA>
    9   26  183 2.00000000  <NA>
    10  13   17 2.00000000  <NA>
    11  96 -999 2.00000000  <NA>
    12  53    2 2.00000000  <NA>
    13  23   49 2.00000000  <NA>
    14  24  999        NaN   999
    15  43  479 3.00000000   479
    16  90  999        NaN   999
    
    In [398]: df = df.drop('group', axis=1)
    
    In [435]: df
    Out[435]: 
         A    B     C
    0    1   24    24
    1   10   23    23
    2   23   29    29
    3   45  999   999
    4   24   49    49
    5   24   59    59
    6   55   72    72
    7   67  999   999
    8   73    9  <NA>
    9   26  183  <NA>
    10  13   17  <NA>
    11  96 -999  <NA>
    12  53    2  <NA>
    13  23   49  <NA>
    14  24  999   999
    15  43  479   479
    16  90  999   999
    

    【讨论】:

      【解决方案2】:

      我不知道这是否是最有效的方法,但您可以通过以下方式映射列 B 中的值来创建一个名为 mask 的新列:'BW'True,@ 987654324@ 到 False 和所有其他值到 np.nan

      然后,如果您从 mask 向前填充 NaN,并从 mask 向后填充 NaN 并在逻辑上组合结果(只要向前或向后填充的列之一为 False,就设置为 True),您可以创建一个名为final_mask的列,其中包含txt的连续BW之间的所有值都用True填充。

      只有当final_mask 为 False 且 B 列不是“BW”时,您才可以使用 .apply 选择 A 列的值,如果 final_mask 为 False 且 B 列为“BW”,则选择 B 列',否则np.nan

      import numpy as np
      import pandas as pd
      
      A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
      B = [24,23,29, 'BW',49,59,72, 'BW',9,183,17, 'txt',2,49,'BW',479,'BW']
      df = pd.DataFrame({'A':A,'B':B})
      
      df["mask"] = df["B"].apply(lambda x: True if x == 'BW' else False if x == 'txt' else np.nan)
      df["ffill"] = df["mask"].fillna(method="ffill")
      df["bfill"] = df["mask"].fillna(method="bfill")
      df["final_mask"] = (df["ffill"] == False) | (df["bfill"] == False)
      
      df["C"] = df.apply(lambda x: x['A'] if (
          (x['final_mask'] == False) & (x['B'] != 'BW')) 
          else x['B'] if ((x['final_mask'] == False) & (x['B'] == 'BW')) 
          else np.nan, axis=1
      )
      

      >>> df
           A    B   mask  ffill  bfill  final_mask    C
      0    1   24    NaN    NaN   True       False    1
      1   10   23    NaN    NaN   True       False   10
      2   23   29    NaN    NaN   True       False   23
      3   45   BW   True   True   True       False   BW
      4   24   49    NaN   True   True       False   24
      5   24   59    NaN   True   True       False   24
      6   55   72    NaN   True   True       False   55
      7   67   BW   True   True   True       False   BW
      8   73    9    NaN   True  False        True  NaN
      9   26  183    NaN   True  False        True  NaN
      10  13   17    NaN   True  False        True  NaN
      11  96  txt  False  False  False        True  NaN
      12  53    2    NaN  False   True        True  NaN
      13  23   49    NaN  False   True        True  NaN
      14  24   BW   True   True   True       False   BW
      15  43  479    NaN   True   True       False   43
      16  90   BW   True   True   True       False   BW
      

      删除我们沿途创建的列:

      df.drop(columns=['mask','ffill','bfill','final_mask'])
      
           A    B    C
      0    1   24    1
      1   10   23   10
      2   23   29   23
      3   45   BW   BW
      4   24   49   24
      5   24   59   24
      6   55   72   55
      7   67   BW   BW
      8   73    9  NaN
      9   26  183  NaN
      10  13   17  NaN
      11  96  txt  NaN
      12  53    2  NaN
      13  23   49  NaN
      14  24   BW   BW
      15  43  479   43
      16  90   BW   BW
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2022-01-08
        • 1970-01-01
        • 1970-01-01
        • 2021-10-06
        • 1970-01-01
        • 2015-12-19
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多