我应该如何使用 pandas 处理时间序列数据中的重复时间？答案

【问题标题】：How should I Handle duplicate times in time series data with pandas?我应该如何使用 pandas 处理时间序列数据中的重复时间？
【发布时间】：2017-10-23 00:57:48
【问题描述】：

作为较大数据集的一部分，我从 API 调用返回以下内容：

{'时间': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), '价格': '0.052600'}

{'时间': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), '价格'：'0.052500'}

理想情况下，我会使用时间戳作为 pandas 数据帧的索引，但这似乎失败了，因为在转换为 JSON 时存在重复：

df = df.set_index(pd.to_datetime(df['Timestamp']))
print(new_df.to_json(orient='index'))

ValueError：对于 orient='index'，DataFrame 索引必须是唯一的。

有关处理这种情况的最佳方法的任何指导？丢弃一个数据点？时间不会比到秒更细，而且在那一秒内价格显然会发生变化。

【问题讨论】：

那么您需要告诉我们您如何同时处理多个价格事件tick：保持第一个、最后一个还是全部？保持第一价格？平均价格？最高和最低价格？ ...？这取决于您最终要对数据做什么。您需要告诉我们更多背景信息。

标签： python pandas time-series data-processing

【解决方案1】：

我认为您可以通过添加 ms by cumcount 和 to_timedelta 来更改重复的日期时间：

d = [{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052600'},
     {'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052500'}]
df = pd.DataFrame(d)
print (df)
      Price                Time
0  0.052600 2017-05-21 18:18:01
1  0.052500 2017-05-21 18:18:01

print (pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms'))
0          00:00:00
1   00:00:00.001000
dtype: timedelta64[ns]

df['Time'] = df['Time'] + pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms')
print (df)
      Price                    Time
0  0.052600 2017-05-21 18:18:01.000
1  0.052500 2017-05-21 18:18:01.001

new_df = df.set_index('Time')
print(new_df.to_json(orient='index'))
{"1495390681000":{"Price":"0.052600"},"1495390681001":{"Price":"0.052500"}}

【讨论】：

是的，这行得通 - 谢谢。变化是价格比精确到毫秒的时间更重要。
超级。顺便说一句，它只会更改为 ms 重复，所有唯一值都不会更改，因为添加了 0ms。祝你好运！

【解决方案2】：

您可以使用 .duplicated 来保留第一个或最后一个条目。看看pandas.DataFrame.duplicated

【讨论】：

不，因为丢失了价格，这是改变的。

【解决方案3】：

只是扩展accepted answer：添加循环有助于处理第一遍引入的任何新重复项。

此isnull 对于捕获数据中的任何 NaT 非常重要。因为任何 timedelta + NaT 仍然是 NaT。

def deduplicate_start_times(frame, col='start_time', max_iterations=10):
    """
    Fuzz duplicate start times from a frame so we can stack and unstack
    this frame.
    """

    for _ in range(max_iterations):
        dups = frame.duplicated(subset=col) & ~pandas.isnull(frame[col])

        if not dups.any():
            break

        LOGGER.debug("Removing %i duplicates", dups.sum())

        # Add several ms of time to each time
        frame[col] += pandas.to_timedelta(frame.groupby(col).cumcount(),
                                          unit='ms')

    else:
        LOGGER.error("Exceeded max iterations removing duplicates. "
                     "%i duplicates remain", dups.sum())

    return frame

【讨论】：