pandas 基于不完全匹配的时间戳合并答案

【问题标题】：pandas merging based on a timestamp which do not match exactlypandas 基于不完全匹配的时间戳合并
【发布时间】：2016-04-25 03:32:37
【问题描述】：

哪些方法可用于合并时间戳不完全匹配的列？

DF1：

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:13 7261824 871631182

DF2：

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:37 7261824 871631182

我可以在 ['date', 'employee_id', 'session_id'] 加入，但有时同一员工会在同一日期有多个相同的会话，这会导致重复。我可以删除发生这种情况的行，但如果这样做，我会丢失有效的会话。

如果DF1的时间戳距离DF2的时间戳

['employee_id', 'session_id', 'timestamp<5minutes']

编辑 - 我认为之前有人会遇到这个问题。

我正在考虑这样做：

在每个数据帧上记录我的时间戳
创建一列，即时间戳 + 5 分钟（四舍五入）
创建一个时间戳列 - 5 分钟（四舍五入）

创建一个 10 分钟间隔字符串以加入文件

df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)

有人知道如何将这 5 分钟的间隔四舍五入到最接近的 5 分钟标记吗？

02:59:37 - 5 分钟 = 02:55:00

02:59:37 + 5 分钟 = 03:05:00

interval_string = '02:55:00-03:05:00'

pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']

有谁知道如何像这样打发时间？这似乎可以工作。您仍然根据日期、员工和会话进行匹配，然后查找基本在相同 10 分钟间隔或范围内的时间

【问题讨论】：

有趣的问题。天真的解决方案是在四舍五入到最接近 5 分钟的时间戳上合并，但是如果它们碰巧位于 5 分钟标记的不同侧，这会将一些会话保留为单独的行。您可以使用随机偏移量迭代地应用该过程，最多进行一定次数的迭代，这将产生更好的结果。最稳健的解决方案是聚类算法，但这更难实现。
This 可以提供一些灵感。
理想情况下，您希望在 join 操作上使用 SQL 样式的 where 子句，该子句使用 between 指定日期之一，并基于另一个日期指定两个边界。如果直接在数据库中执行此操作完全可行，或者使用像 SQLite 这样的内存数据库，我会推荐它。您需要在 pandas 中进行的 hack 会很糟糕，如果您以数据库方式进行操作，您仍然可以在之后将结果拉出到 pandas 进行交互式处理或其他任何操作。
@Lance 是否保证两个数据帧包含真正唯一的会话分别？即重复数据删除是否仅在您合并它们时适用？或者是否有可能在同一数据框中有两行用于“相同”会话的时间戳略有不同？
对不起，还是不明白。 在单个数据帧内，是否需要执行会话重复数据删除（考虑时间戳的微小差异）？

标签： python pandas

【解决方案1】：

考虑以下问题的迷你版本：

from io import StringIO
from pandas import read_csv, to_datetime

# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)

给了

>>> df1
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:03:00      7261824   871631183
2 2016-01-01 11:01:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
>>> df2
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:10:00      7261824   871631185

您希望在合并时将df2[0:3] 视为df1[0:3] 的重复项（因为它们分别相隔不到5 分钟），但将df1[3] 和df2[3] 视为单独的会话。

方案一：区间匹配

这基本上就是您在编辑中建议的内容。您希望将两个表中的时间戳映射到以时间戳为中心的 10 分钟间隔，四舍五入到最接近的 5 分钟。

每个间隔都可以由其中点唯一表示，因此您可以合并时间戳上的数据帧，四舍五入到最接近的 5 分钟。例如：

import numpy as np

# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9

# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)

# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

打印出来的

             interval  employee_id  session_id
0 2016-01-01 02:05:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:00:00      7261824   871631184
3 2016-01-01 14:00:00      7261824   871631185
4 2016-01-01 11:05:00      7261824   871631184
5 2016-01-01 14:10:00      7261824   871631185

请注意，这并不完全正确。会话 df1[2] 和 df2[2] 不被视为重复，尽管它们仅相隔 3 分钟。这是因为它们位于区间边界的不同侧。

解决方案 2：一对一匹配

这是另一种方法，它取决于df1 中的会话在df2 中具有零个或一个重复项的条件。

我们将df1 中的时间戳替换为df2 中与employee_id 和session_id 匹配的最接近的时间戳并且距离不到5 分钟。

from datetime import timedelta

# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
    matches = df2.loc[(df2.employee_id == row.employee_id) &
                      (df2.session_id == row.session_id)]

    deltas = matches.date_start_time - row.date_start_time
    deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]

    try:
        return matches.loc[deltas.idxmin()]
    except ValueError:  # no items
        return row

# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)

# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

打印出来的

      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
4 2016-01-01 14:10:00      7261824   871631185

这种方法要慢得多，因为您必须在整个 df2 中搜索 df1 中的每一行。我写的内容可能还可以进一步优化，但是在大型数据集上这仍然需要很长时间。

【讨论】：

看起来对我来说是一个不错的开始。关于您的第一个解决方案，我们是否可以包含正负区间范围以防止事件出现在区间错误的一侧？间隔将是我输入的示例中的字符串。不确定逻辑是否 100% 正确，但我让它在 excel 中处理测试数据。
我认为你的也会遇到同样的问题。考虑将连续时间范围映射到离散间隔。这意味着您总是可以想到一对在连续范围上足够接近但落入不同间隔的时间戳。我不确定我的方法是否完全等同于你的方法（尽管我认为可能是），但总体思路是成立的。
哈哈，真头疼。谢谢，尽管我稍后会对此进行测试并让您知道。它至少应该改善我的匹配度
刚刚意识到使用DatetimeIndex.snap 可以使区间分箱解决方案变得更好。直到现在才知道这种方法。
这看起来很有趣，但文档完全没有帮助。我之前没有使用过 DateTimeIndex，所以我不太确定如何进行。 df1 = df1.set_index(pd.DatetimeIndex(df1['call_start'], drop = False)) 这似乎已经创建了索引，但我尝试了一些添加 .snap 的尝试但没有成功。我在网上搜索时找不到一个很好的例子。

【解决方案2】：

我会尝试在 pandas 中使用这种方法：

pandas.merge_asof()

您感兴趣的参数是direction、tolerance、left_on 和right_on

以@Igor 回答为基础：

import pandas as pd
from pandas import read_csv
from io import StringIO

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)



df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

output

date_start_time date_start_time_x   employee_id_x   session_id_x    date_start_time_y   employee_id_y   session_id_y

2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182   2016-01-01 02:03:00 7261824.0   871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183   2016-01-01 06:05:00 7261824.0   871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184   2016-01-01 11:04:00 7261824.0   871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185   NaT NaN NaN

【讨论】：

很酷。这是事实发生后的几年，所以我使用它的代码是非常遗留的，而不是我打算更新的东西（因为害怕破坏我很久没看过的东西），但它是一个很好的功能，并且其他问题我会记住的。
如何首先加入 ('employee_id', 'session_id')，而不仅仅是最近的 date_start_time？我认为您在 merge_asof 中需要一个 by= 参数

【解决方案3】：

我建议使用内置的 pandas Series dt 舍入函数，将两个数据帧舍入到一个共同的时间，例如每 5 分钟舍入一次。因此，时间将始终采用格式：例如 01:00:00，然后是 01:05:00。这样，两个数据帧将具有相似的时间索引来执行合并。

请在此处查看文档和示例pandas.Series.dt.round

【讨论】：