【问题标题】:Pandas dataframe random row selection per group with a boolean condition具有布尔条件的每组 Pandas 数据帧随机行选择
【发布时间】:2016-03-21 20:38:55
【问题描述】:

假设我有以下熊猫数据框:

df = pd.DataFrame({'name':['Dave','Lisa','John',Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],
'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})

数据帧1

            date            name
0   2015-01-31 07:14:39     Dave
1   2014-12-16 22:50:55     Lisa
2   2015-04-12 23:29:11     John
3   2015-04-08 17:57:29     Lisa
4   2015-01-30 03:51:12     Simon
5   2015-02-20 10:33:48     Simon
6   2014-12-15 23:54:03     Simon
7   2014-12-16 19:53:53     Simon
8   2014-12-18 00:15:02     Lisa
9   2015-04-01 21:36:55     Dave
10  2015-04-13 23:25:55     Dave
11  2015-02-18 14:10:40     John
12  2015-02-27 04:56:33     Lisa

数据帧2

    name           datemax
0   Dave    2015-04-13 23:25:55
1   John    2015-04-12 23:29:11
2   Lisa    2015-04-08 17:57:29
3   Simon   2015-02-20 10:33:48

'date' 和 'datemax' 列用日期时间对象填充。

我需要在 DATAFRAME1 中按“名称”分组,随机选择一个日期,但我希望这个选择的日期在第二个数据框 (DATAFRAME2) 中该名称的“日期最大值”之前。

我正在处理的真实数据框比此示例中的要大得多,因此我需要一种快速的方法来执行此操作。

【问题讨论】:

  • 它需要是随机的,还是可以是第一个有效日期?
  • 它必须是随机的:)

标签: python datetime pandas group-by


【解决方案1】:

你可以使用pd.DataFrame.sample 喜欢

In [697]: idx = df2.set_index('name').datemax

In [698]: (df1.groupby('name')
              .apply(lambda x: x.loc[x.date < idx[x.name]].sample(1))
              .reset_index(drop=True))
Out[698]:
                 date   name
0 2015-04-01 21:36:55   Dave
1 2015-02-18 14:10:40   John
2 2014-12-18 00:15:02   Lisa
3 2014-12-16 19:53:53  Simon

【讨论】:

    【解决方案2】:

    这是我的建议:

    import random
    
    df = pd.DataFrame({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
    
    df.date = [pd.to_datetime(x) for x in df.date]
    
    df2 = pd.DataFrame([['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48']])
    
    df2.columns = ['name','datemax']
    
    df2.datemax = [pd.to_datetime(x) for x in df2.datemax]
    
    df = df.merge(df2,how='left')
    
    grouped = df.groupby('name')
    
    grouped.apply(lambda x: random.choice([a for a in x['date'].values if a<x['datemax'].values[0]]))
    

    花了 18 毫秒,我猜它应该是线性缩放的。

    【讨论】:

      【解决方案3】:

      我会先拼接出所有不满足该条件的日期:

      In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
      Out[11]:
      0    2015-04-13 23:25:55
      1    2015-04-08 17:57:29
      2    2015-04-12 23:29:11
      3    2015-04-08 17:57:29
      4    2015-02-20 10:33:48
      5    2015-02-20 10:33:48
      6    2015-02-20 10:33:48
      7    2015-02-20 10:33:48
      8    2015-04-08 17:57:29
      9    2015-04-13 23:25:55
      10   2015-04-13 23:25:55
      11   2015-04-12 23:29:11
      12   2015-04-08 17:57:29
      Name: date, dtype: datetime64[ns]
      
      In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
      Out[12]:
      0      True
      1      True
      2     False
      3     False
      4      True
      5     False
      6      True
      7      True
      8      True
      9      True
      10    False
      11     True
      12     True
      Name: date, dtype: bool
      
      In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]
      
      In [14]: df_old
      Out[14]:
                        date   name
      0  2015-01-31 07:14:39   Dave
      1  2014-12-16 22:50:55   Lisa
      4  2015-01-30 03:51:12  Simon
      6  2014-12-15 23:54:03  Simon
      7  2014-12-16 19:53:53  Simon
      8  2014-12-18 00:15:02   Lisa
      9  2015-04-01 21:36:55   Dave
      11 2015-02-18 14:10:40   John
      12 2015-02-27 04:56:33   Lisa
      

      现在问题变得容易多了:pick a random row by name:

      df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
      
      In [21]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
      Out[21]:
                           date
      name
      Dave  2015-04-01 21:36:55
      John  2015-02-18 14:10:40
      Lisa  2014-12-16 22:50:55
      Simon 2014-12-15 23:54:03
      
      In [22]: df_old.groupby("name").agg(lambda x: x.iloc[np.random.randint(0,len(x))])
      Out[22]:
                           date
      name
      Dave  2015-01-31 07:14:39
      John  2015-02-18 14:10:40
      Lisa  2014-12-18 00:15:02
      Simon 2014-12-16 19:53:53
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-06-10
        • 2016-01-16
        • 1970-01-01
        • 2016-11-06
        • 2022-08-03
        • 2021-07-03
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多