【问题标题】:Add new columns to a pandas df after filtering过滤后将新列添加到熊猫 df
【发布时间】:2018-06-13 07:09:56
【问题描述】:

我有一个包含不同地点信息的 df。

import pandas as pd

d = ({
    'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
    'D' : ['Home','','Home','','Away','','Home','','Away','','Home','','Home','','Away','','Home',''],
    'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
    'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],   
    'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],           
    'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
    })

df = pd.DataFrame(data=d)

我想将该数据导出到它们各自的位置,这些位置标记为Column D。我还想添加基于Column B 中标记的函数的新列。

df['C'] = pd.to_timedelta(df['C'], errors="coerce").dt.total_seconds()

incl = ['Home', 'Away']    

for k, g in df[df.D.isin(incl)].groupby('D'):
    Stop = g.loc[df['B'] == 'Stop'].reset_index()['C']
    Start = g.loc[df['B'] == 'Start'].reset_index()['C']
    Res = g.loc[df['B'] == 'Res'].reset_index()['C']

    g['Start_diff'] = Start - Stop
    g['Res_diff'] = Start - Res

问题是这些函数多次出现,标记为Column F。因此,如果我们查看Home 的导出,我们将第一次在Column F 中获得差异。

输出:

    A   B       C       D       E       F   Start_diff  Res_diff
0   A   Stop    28800   Home    Num:    1   3000        2400
2   A   Res     29400   Home    Num:    1       
6   A   Start   31800   Home    Num:    1       
10  A   Stop    33300   Home    Num:    2       
12  A   Res     33660   Home    Num:    2       
16  A   Start   34800   Home    Num:    2       

而我希望预期的输出是:

    A   B       C       D       E       F   Start_diff  Res_diff
0   A   Stop    28800   Home    Num:    1   3000        2400
2   A   Res     29400   Home    Num:    1       
6   A   Start   31800   Home    Num:    1       
10  A   Stop    33300   Home    Num:    2   1500        1200    
12  A   Res     33660   Home    Num:    2       
16  A   Start   34800   Home    Num:    2       

我已尝试将for k, g in df[df.D.isin(incl)].groupby('D'): 更改为for k, g in df[df.D.isin(incl)].groupby('D').F.nunique():

但我收到一个错误TypeError: 'numpy.int64' object is not iterable

【问题讨论】:

    标签: python pandas dataframe group-by unique


    【解决方案1】:

    我认为需要DF 列的groupby 自定义函数,并用mask 替换重复值:

    def f(g):
        Stop = g.loc[df['B'] == 'Stop', 'C']
        Start = g.loc[df['B'] == 'Start', 'C']
        Res = g.loc[df['B'] == 'Res', 'C']
        g['Start_diff'] = Start.values[0] - Stop.values[0]
        g['Res_diff'] = Start.values[0] - Res.values[0]
    
        return (g)
    
    df = df[df.D.isin(incl)].groupby(['D', 'F']).apply(f)
    
    df[['Start_diff', 'Res_diff']] = df[['Start_diff', 'Res_diff']].mask(df.duplicated(['D','F']))
    print (df)
              C     D     E  F  A      B  Start_diff  Res_diff
    0   28800.0  Home  Num:  1  A   Stop      3000.0    2400.0
    2   29400.0  Home  Num:  1  A    Res         NaN       NaN
    4   31302.0  Away  Num:  1  A   Stop      2898.0    1800.0
    6   31800.0  Home  Num:  1  A  Start         NaN       NaN
    8   32400.0  Away  Num:  1  A    Res         NaN       NaN
    10  33300.0  Home  Num:  2  A   Stop      1500.0    1140.0
    12  33660.0  Home  Num:  2  A    Res         NaN       NaN
    14  34200.0  Away  Num:  1  A  Start         NaN       NaN
    16  34800.0  Home  Num:  2  A  Start         NaN       NaN
    

    【讨论】:

      猜你喜欢
      • 2021-08-18
      • 2016-01-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-10-24
      • 2017-07-24
      • 2021-05-20
      • 1970-01-01
      相关资源
      最近更新 更多