【问题标题】:LEFT JOIN on time interval and timestamp in pandas大熊猫中的时间间隔和时间戳的左连接
【发布时间】:2021-02-10 21:34:44
【问题描述】:

我有 2 个数据框:

标签:

import pandas as pd
marker_labels = pd.DataFrame({'cohort_id':[1,1, 1], 'marker_type':['a', 'b', 'a'], 'start':['2020-01-2', '2020-01-04 05', '2020-01-06'], 'end':[np.nan, '2020-01-05 16', np.nan]})
marker_labels['start'] = pd.to_datetime(marker_labels['start'])
marker_labels['end'] = pd.to_datetime(marker_labels['end'])
marker_labels.loc[marker_labels['end'].isnull(), 'end'] =  marker_labels.start + pd.Timedelta(days=1) - pd.Timedelta(seconds=1)

和数据:

import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({'hour': {36: Timestamp('2020-01-04 04:00:00'), 37: Timestamp('2020-01-04 04:00:00'), 38: Timestamp('2020-01-04 04:00:00'), 39: Timestamp('2020-01-04 04:00:00'), 40: Timestamp('2020-01-04 04:00:00'), 41: Timestamp('2020-01-04 04:00:00'), 42: Timestamp('2020-01-04 04:00:00'), 43: Timestamp('2020-01-04 04:00:00'), 44: Timestamp('2020-01-04 04:00:00'), 45: Timestamp('2020-01-04 05:00:00'), 46: Timestamp('2020-01-04 05:00:00'), 47: Timestamp('2020-01-04 05:00:00'), 48: Timestamp('2020-01-04 05:00:00'), 49: Timestamp('2020-01-04 05:00:00'), 50: Timestamp('2020-01-04 05:00:00'), 51: Timestamp('2020-01-04 05:00:00'), 52: Timestamp('2020-01-04 05:00:00'), 53: Timestamp('2020-01-04 05:00:00')}, 'metrik_0': {36: -0.30098661551885625, 37: -0.6402837079024638, 38: -2.6953511655638778, 39: 0.4036062912674384, 40: -0.035627996627399204, 41: -0.06510225503176624, 42: -1.9745426914329782, 43: 1.4112111331287631, 44: 0.18641277342651516, 45: 0.10780795451690242, 46: 0.31822895003286417, 47: -1.0804164740649171, 48: -1.6676697601556636, 49: -1.0354359757914047, 50: 1.8570215568670299, 51: 0.9055795225472866, 52: -0.020539970820695173, 53: -0.7975048293123836}, 'cohort_id': {36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1, 48: 1, 49: 1, 50: 1, 51: 1, 52: 1, 53: 1}, 'device_id': {36: 6, 37: 5, 38: 11, 39: 20, 40: 18, 41: 1, 42: 14, 43: 9, 44: 12, 45: 9, 46: 14, 47: 11, 48: 20, 49: 5, 50: 1, 51: 12, 52: 6, 53: 18}})
df

我想对列群组 ID 和时间间隔(小时为 BETWEEN(start, end))执行 LEFT JOIN。

类似的问题是:

到目前为止,我有多种方法,但最终找到了一个解决方案:

第一个:在简单的 pandas 列中速度慢、没有完全输出/可访问的结果:

def join_on_matching_interval(x):
    result = marker_labels[(marker_labels.cohort_id == x.cohort_id) & (x.hour >= marker_labels.start) & (x.hour <= marker_labels.end)]
    if len(result) == 0:
        result = []
    return result
    
df['marker_labels'] = df.apply(join_on_matching_interval, axis=1)
print(df.shape[0])
#df = df.explode('marker_labels') # this fails to work
df['size'] = df.marker_labels.apply(lambda x: len(x))
df[(df['size'] > 0)].head()

如何将结果作为列访问?

第二个:列正确但行数无效(而且速度快):

按照我上面分享的链接:

print(len(df))
print(len(marker_labels))
merged_res = df.merge(marker_labels, left_on=['cohort_id'], right_on=['cohort_id'], how='left')
print(len(merged_res)) # the number of rows has increased
merged_res = merged_res[(merged_res.hour.between(merged_res.start,merged_res.end)) | (merged_res.start.isnull())]
print(len(merged_res)) # but now not enough rows are left over.
  1. 案例1:不匹配(处理正确)
  2. 案例 2:完全匹配(正确处理)
  3. 案例 3:部分匹配(未处理 -> 记录被删除)

特别是对于 3 这意味着:

  • 我不想收到任何重复的邮件
  • 左侧的所有结果
  • 在时间间隔和时间戳重叠的情况下匹配

如何在条件中包含第三种情况?

【问题讨论】:

    标签: python pandas join time-series intervals


    【解决方案1】:

    你的意思是合并查询,然后再join:

    tmp = (df.reset_index()
             .merge(marker_labels, on='cohort_id', how='left')
             .query('start <= hour <= end')
             .set_index('index')
             .reindex(df.index)
          )
    
    out = tmp.combine_first(df)
    

    输出:

          cohort_id    device_id  end                  hour                 marker_type      metrik_0  start
    --  -----------  -----------  -------------------  -------------------  -------------  ----------  -------------------
    36            1            6  NaT                  2020-01-04 04:00:00  nan            -0.300987   NaT
    37            1            5  NaT                  2020-01-04 04:00:00  nan            -0.640284   NaT
    38            1           11  NaT                  2020-01-04 04:00:00  nan            -2.69535    NaT
    39            1           20  NaT                  2020-01-04 04:00:00  nan             0.403606   NaT
    40            1           18  NaT                  2020-01-04 04:00:00  nan            -0.035628   NaT
    41            1            1  NaT                  2020-01-04 04:00:00  nan            -0.0651023  NaT
    42            1           14  NaT                  2020-01-04 04:00:00  nan            -1.97454    NaT
    43            1            9  NaT                  2020-01-04 04:00:00  nan             1.41121    NaT
    44            1           12  NaT                  2020-01-04 04:00:00  nan             0.186413   NaT
    45            1            9  2020-01-05 16:00:00  2020-01-04 05:00:00  b               0.107808   2020-01-04 05:00:00
    46            1           14  2020-01-05 16:00:00  2020-01-04 05:00:00  b               0.318229   2020-01-04 05:00:00
    47            1           11  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -1.08042    2020-01-04 05:00:00
    48            1           20  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -1.66767    2020-01-04 05:00:00
    49            1            5  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -1.03544    2020-01-04 05:00:00
    50            1            1  2020-01-05 16:00:00  2020-01-04 05:00:00  b               1.85702    2020-01-04 05:00:00
    51            1           12  2020-01-05 16:00:00  2020-01-04 05:00:00  b               0.90558    2020-01-04 05:00:00
    52            1            6  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -0.02054    2020-01-04 05:00:00
    53            1           18  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -0.797505   2020-01-04 05:00:00
    

    【讨论】:

    • 不完全。您没有执行完整的 LEFT JOIN。在查询中(正如上面在我的代码中指出的,第三种情况被忽略。即过滤器过于严格。
    • 我不知道:combine_first 这是一个非常巧妙的解决方案!我印象深刻。
    • 虽然这只适用于每个时间间隔没有> 1个标签的情况。
    • 如果marker_labels 中的多行对应于df 中的给定行,则可能会失败。只需将set_index().re_index().combine_first() 替换为另一个merge
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-02-23
    • 2021-05-08
    • 1970-01-01
    • 1970-01-01
    • 2014-11-17
    • 1970-01-01
    相关资源
    最近更新 更多