【问题标题】:Is there a way to condense data into one row of a Pandas DataFrame when they share one column value?当它们共享一个列值时,有没有办法将数据压缩到 Pandas DataFrame 的一行中?
【发布时间】:2020-12-24 16:58:47
【问题描述】:

我有一个包含几千行的 DataFrame。 DF 保存我组织内单位的单位标识符和响应时间。 它在 DF 中结构化,列 ["Event#", "UnitID", "First UnitEnroute", "First UnitArrived", "First UnitAtHospital"]

同一个Event#有很多不同的行,最后我只希望每个Event#有一行,["First UnitEnroute", "First UnitArrived", "First UnitAtHospital]" 由其他填写具有相同 Event# 的行。

造成这种情况的原因是季度末的结算失败,我们需要知道这些不同的事件是否有这 3 次分布在不同的单元中。我不需要列出的单位,只需要从同一事件的其他行中提取的第一个非 0 值。

以下是一些示例数据:

Event#      Unit    First UnitEnroute           First UnitArrived           First UnitAtHospital
2020000394    37    ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']
2020000394    38    ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']     
2020000394    36    ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']     
2020000394    39    ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']     
2020000617    58    ['1/2/2020', '12:06:13 PM'] ['1/2/2020', '12:07:39 PM'] ['1/2/2020', '12:43:10 PM']     
2020000849    74    ['1/2/2020', '6:42:19 PM']  ['1/2/2020', '6:53:53 PM']  ['1/2/2020', '7:28:32 PM']      
2020000849    75    ['0']                       ['0']                       ['0']
2020000927    81    ['0']                       ['0']                       ['0']
2020000927    80    ['0']                       ['0']                       ['0']
2020000997    86    ['0']                       ['0']                       ['0']
2020000997    87    ['0']                       ['0']                       ['0']
2020001218    99    ['1/3/2020', '11:50:39 AM'] ['1/3/2020', '11:52:40 AM'] ['1/3/2020', '12:29:37 PM']     
2020001218    98    ['0']                       ['1/3/2020', '11:52:40 AM'] ['0']
2020001255    102   ['1/3/2020', '12:44:30 PM'] ['0']                       ['0']
2020001255    103   ['1/3/2020', '12:40:19 PM'] ['0']                       ['0']
2020001258    98    ['1/3/2020', '12:49:00 PM'] ['1/3/2020', '12:57:22 PM'] ['1/3/2020', '1:39:03 PM']      
2020001258    103   ['0']                       ['0']                       ['0']
2020001258    104   ['0']                       ['0']                       ['0']
2020001258    105   ['0']                       ['0']                       ['0']

这是我尝试过的:

  1. 蛮力遍历每一行以找到一个非 0 值,然后将该值附加到该行。

     for row in DF:
         compare = list()
         for i in DF:
             if i[0] == row[0]:
                 addition = list(i)
                 compare = compare.append(addition)
                 print("Compare: {}".format(compare))
                 return compare
    
         for el in row.index:
             whatisit = row[el]
             if whatisit == 0:
                 for item in compare.index:
                     if item[el] == 0:
                         return
                     else:
                         replacement = item[el]
                         print("Replacement: {}".format(replacement))
                         return replacement
             row[el] = replacement
             return DF
    
    
  2. 使用 pandas.groupby().fillna() 对具有类似 event#s 和回填时间的行进行分组。 --我没有达到任何可以接受的结果。大多数时候,它把我的 DF 切成了一些奇怪的形状。我可能误解了如何使用它。

    DF = DF.groupby("Event#")["Unit", "First UnitEnroute", "First UnitArrived", "First UnitAtHospital"].fillna(method="bfill")

感谢任何方向,如果之前已经发布过,我很抱歉,我花了很多时间寻找可能的答案。我想我还没有完全发展出我需要查看代码的直觉,看看我如何将它应用到我的项目中。我不是专业的开发人员,我更像是一个动手提重物的员工哈哈。

【问题讨论】:

    标签: python pandas dataframe indexing pandas-groupby


    【解决方案1】:

    这是一个,IIUC。

    from io import StringIO
    import pandas as pd
    
    # create data frame
    df = pd.read_csv(StringIO(data), sep='\s\s+', engine='python')
    
    # drop the column `Unit`
    df = df.drop(columns='Unit')
    
    # re-shape
    df = df.melt(id_vars='Event#', var_name='first_unit', value_name='timestamp')
    
    # drop timestamp == ['0']
    mask = df['timestamp'].astype(str) != "['0']"
    df = df[mask]
    
    # drop duplicates
    df = df.drop_duplicates()
    
    # get min value for each group -- and re-shape
    df = (df.groupby(['Event#', 'first_unit'])['timestamp'].min()
          .unstack(level='first_unit')
          .reset_index()
         )
    
    print(df)
    
    first_unit      Event#            First UnitArrived  \
    0           2020000394  ['1/1/2020', '10:48:33 PM']   
    1           2020000617  ['1/2/2020', '12:07:39 PM']   
    2           2020000849   ['1/2/2020', '6:53:53 PM']   
    3           2020001218  ['1/3/2020', '11:52:40 AM']   
    4           2020001255                          NaN   
    5           2020001258  ['1/3/2020', '12:57:22 PM']   
    
    first_unit         First UnitAtHospital            First UnitEnroute  
    0           ['1/1/2020', '11:45:01 PM']  ['1/1/2020', '10:45:34 PM']  
    1           ['1/2/2020', '12:43:10 PM']  ['1/2/2020', '12:06:13 PM']  
    2            ['1/2/2020', '7:28:32 PM']   ['1/2/2020', '6:42:19 PM']  
    3           ['1/3/2020', '12:29:37 PM']  ['1/3/2020', '11:50:39 AM']  
    4                                   NaN  ['1/3/2020', '12:40:19 PM']  
    5            ['1/3/2020', '1:39:03 PM']  ['1/3/2020', '12:49:00 PM'] 
    

    这是原始数据(即用于创建数据框):

    data = '''Event#      Unit    First UnitEnroute           First UnitArrived           First UnitAtHospital
    2020000394    37    ['1/1/2020', '10:45:34 PM']  ['1/1/2020', '10:48:33 PM']  ['1/1/2020', '11:45:01 PM']
    2020000394    38    ['1/1/2020', '10:45:34 PM']  ['1/1/2020', '10:48:33 PM']  ['1/1/2020', '11:45:01 PM']     
    2020000394    36    ['1/1/2020', '10:45:34 PM']  ['1/1/2020', '10:48:33 PM']  ['1/1/2020', '11:45:01 PM']     
    2020000394    39    ['1/1/2020', '10:45:34 PM']  ['1/1/2020', '10:48:33 PM']  ['1/1/2020', '11:45:01 PM']     
    2020000617    58    ['1/2/2020', '12:06:13 PM']  ['1/2/2020', '12:07:39 PM']  ['1/2/2020', '12:43:10 PM']     
    2020000849    74    ['1/2/2020', '6:42:19 PM']   ['1/2/2020', '6:53:53 PM']   ['1/2/2020', '7:28:32 PM']      
    2020000849    75    ['0']                        ['0']                        ['0']
    2020000927    81    ['0']                        ['0']                        ['0']
    2020000927    80    ['0']                        ['0']                        ['0']
    2020000997    86    ['0']                        ['0']                        ['0']
    2020000997    87    ['0']                        ['0']                        ['0']
    2020001218    99    ['1/3/2020', '11:50:39 AM']  ['1/3/2020', '11:52:40 AM']  ['1/3/2020', '12:29:37 PM']     
    2020001218    98    ['0']                        ['1/3/2020', '11:52:40 AM']  ['0']
    2020001255    102   ['1/3/2020', '12:44:30 PM']  ['0']                        ['0']
    2020001255    103   ['1/3/2020', '12:40:19 PM']  ['0']                        ['0']
    2020001258    98    ['1/3/2020', '12:49:00 PM']  ['1/3/2020', '12:57:22 PM']  ['1/3/2020', '1:39:03 PM']      
    2020001258    103   ['0']                        ['0']                        ['0']
    2020001258    104   ['0']                        ['0']                        ['0']
    2020001258    105   ['0']                        ['0']                        ['0']
    '''
    

    【讨论】:

    • 谢谢,这看起来像我需要的。我在'Event#'df = df.melt(id_vars='Event#', var_name='first_unit', value_name='timestamp') 行遇到了一个keyError,知道为什么会这样吗?
    • 你能执行'Event#' in df.columns -- 列名中有嵌入空格吗? KeyError 可能意味着您指定为 id_var 的内容不在 df.columns
    【解决方案2】:

    由于您有几千行,我建议单独处理每一列并再次将它们合并在一起:

    df1 = df[ df['First UnitEnroute']!="['0']" ][['Event#', 'First UnitEnroute']]
    df1 = df1[~df1.duplicated(['Event#'])]
    df2 = df[ df['First UnitArrived']!="['0']" ][['Event#', 'First UnitArrived']]
    df2 = df2[~df2.duplicated(['Event#'])]
    df3 = df[ df['First UnitAtHospital']!="['0']" ][['Event#', 'First UnitAtHospital']]
    df3 = df3[~df3.duplicated(['Event#'])]
    df_result = df1.merge(df2, on = 'Event#', how='left').merge(df3, on = 'Event#', how='left')
    

    这样(如果我对问题的理解正确的话)您可以找到对于一个或多个第一单元统计数据没有时间戳的事件。在您的示例中,事件 2020001255

    【讨论】:

      猜你喜欢
      • 2021-12-20
      • 1970-01-01
      • 2020-03-03
      • 2021-05-30
      • 1970-01-01
      • 1970-01-01
      • 2016-02-03
      • 1970-01-01
      • 2012-09-26
      相关资源
      最近更新 更多