【问题标题】:Merging two pandas dataframes with complex conditions合并两个条件复杂的熊猫数据框
【发布时间】:2017-12-27 10:36:11
【问题描述】:

我想合并两个数据框。让我们考虑以下两个df:

df1:

id_A,           ts_A,    course,     weight
id1, 2017-04-27 01:35:30, cotton,      3.5
id1, 2017-04-27 01:36:05, cotton,      3.5
id1, 2017-04-27 01:36:55, cotton,      3.5
id1, 2017-04-27 01:37:20, cotton,      3.5
id2, 2017-04-27 02:35:35, cotton blue, 5.0
id2, 2017-04-27 02:36:00, cotton blue, 5.0
id2, 2017-04-27 02:36:35, cotton blue, 5.0
id2, 2017-04-27 02:37:20, cotton blue, 5.0

df2:

id_B,  ts_B,                 value
id1,   2017-03-27 01:25:40,  100
id1,   2017-03-27 01:25:50,  200
id1,   2017-03-27 01:25:50,  230
id1,   2017-04-27 01:35:40,  240
id1,   2017-04-27 01:35:50,  200
id1,   2017-04-27 01:36:00,  350
id1,   2017-04-27 01:36:10,  400
id1,   2017-04-27 01:36:20,  500
id1,   2017-04-27 01:36:30,  600
id1,   2017-04-27 01:36:40,  700
id1,   2017-04-27 01:36:50,  800
id1,   2017-04-27 01:37:00,  900
id1,   2017-04-27 01:37:10, 1000
id2,   2017-04-27 02:35:40,  1000
id2,   2017-04-27 02:35:50,  2000
id2,   2017-04-27 02:36:00,  4500
id2,   2017-04-27 02:36:10,  3000
id2,   2017-04-27 02:36:20,  6000
id2,   2017-04-27 02:36:30,  5000
id2,   2017-04-27 02:36:40,  5022
id2,   2017-04-27 02:36:50,  5040
id2,   2017-04-27 02:37:00,  3200
id2,   2017-04-27 02:37:10,  9000

df1 应与 df2 合并,以使以下条件成立: 给定时间间隔作为 df1 中两个连续行之间的差异,我想将其与 df2 中在该时间间隔内跟随的所有行的平均值合并。例如,

id_A,           ts_A,    course,     weight
id1, 2017-04-27 01:35:30, cotton,      3.5

应该合并

id_B,  ts_B,                 value
id1,   2017-04-27 01:35:40,  240
id1,   2017-04-27 01:35:50,  200
id1,   2017-04-27 01:36:00,  350

并获得

id_A,           ts_A,    course,     weight  avgValue
id1, 2017-04-27 01:35:30, cotton,      3.5  263.3

我试图通过使用merge_asof 从另一个角度看待问题 - 这会将 df2 的缺失行包含到 df1 中,但我没有得到正确的结果:

pd.merge_asof(df2_sorted, df1, left_on='ts_B', right_on='ts_A', left_by='id_B', right_by='id_A', direction='backward')

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    我认为您需要merge_asof,但对于计数器,reset_index 用于df1 中每行的唯一值:

    df1 = df1.reset_index(drop=True)
    print (df1.index)
    RangeIndex(start=0, stop=8, step=1)
    
    df = pd.merge_asof(df2_sorted, 
                       df1.reset_index(), 
                       left_on='ts_B', 
                       right_on='ts_A', 
                       left_by='id_B', 
                       right_by='id_A')
    

    然后按输出列分组(不要忘记index 列)并聚合mean

    df = df.groupby(['id_A','ts_A', 'course', 'weight', 'index'], as_index=False)['value']
           .mean()
           .drop('index', axis=1)
    print (df)
      id_A                ts_A       course  weight        value
    0  id1 2017-04-27 01:35:30       cotton     3.5   263.333333
    1  id1 2017-04-27 01:36:05       cotton     3.5   600.000000
    2  id1 2017-04-27 01:36:55       cotton     3.5   950.000000
    3  id2 2017-04-27 02:35:35  cotton blue     5.0  1500.000000
    4  id2 2017-04-27 02:36:00  cotton blue     5.0  4625.000000
    5  id2 2017-04-27 02:36:35  cotton blue     5.0  5565.500000
    

    【讨论】:

    • 非常感谢。我将其应用于我的案例。几分钟,我就回来了。
    • 执行 df = df.groupby(schema2, as_index=False)['value'].mean().drop('index', axis=1) raise DataError( '没有要聚合的数字类型')pandas.core.base.DataError:没有要聚合的数字类型
    • 我认为你需要 df2['value'] = df2['value'].astype(float) 如果浮动或 df2['value'] = df2['value'].astype(float) 如果 ints 值作为第一步。
    猜你喜欢
    • 1970-01-01
    • 2018-09-16
    • 1970-01-01
    • 2017-06-11
    • 2016-01-01
    • 2019-12-18
    • 1970-01-01
    相关资源
    最近更新 更多