从 CET / CEST 到 UTC 的 TimeSeries 转换答案

【问题标题】：TimeSeries conversion from CET / CEST to UTC从 CET / CEST 到 UTC 的 TimeSeries 转换
【发布时间】：2022-01-11 01:07:35
【问题描述】：

我有两个时间序列文件，它们位于 CET / CEST 中。其中不好的一个，没有以正确的方式写入值。对于好的 csv，请看这里：

#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...

...一切正常使用：

        df['utc_time'] = pd.to_datetime(df[local_time_column])
                            .dt.tz_localize('CET', ambiguous="infer")
                            .dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')

将 test_bad.csv 转换为 UTC 时，由于缺少 10 月的 2 个小时，我得到了 AmbiguousTimeError。

# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017   # everything is as it should be
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7223
2017-10-29 02:00,7224   # the value of 2 am should actually be repeated PLUS 3 am is missing
2017-10-29 04:00,7226
2017-10-29 05:00,7227
...

有谁知道如何将时间序列文件转换为 UTC 并为新索引中缺少的日期添加 NaN 列的优雅方法？感谢您的帮助。

【问题讨论】：

要正确理解，您的第二个示例（“坏”数据）缺少条目，您想用 NaN 填充它们吗？另外，频率是否恒定（每小时）？
@MrFuppes，是的，我正在处理每小时数据。充其量，我想用相应的 CEST / CET 日期时间值填充缺失的条目，以便转换像第一个示例一样成功。
很遗憾，如果您没有记录时间是在转换之前还是之后，则无法可靠地解决歧义。 CSV 文件中是否缺少 #7225 或 Pandas 删除了它？
旁注，如果您格式化为来自 UTC 日期时间的字符串，请使用 '%Y-%m-%d %H:%M:%SZ' 之类的格式传递该信息，其中 Z 表示 UTC。
@MarkRansom，初始文件中缺少值 2017-10-29 02:00（第二次出现）和 2017-10-29 03:00。他们没有被丢弃。我还考虑过只创建一个新的 CET 索引，然后重新索引，但是由于 10 月 DST 的重复值，我收到了无法重新索引的错误

标签： python pandas datetime pytz datetimeoffset

【解决方案1】：

详细说明 Mark Ransom 的评论；

2017-10-29 02:00,7224

模棱两可；可能是2017-10-29 00:00 UTC 或 2017-10-29 01:00 UTC。这就是 pd.to_datetime 拒绝推断任何东西的原因。

借助一些本机 Python，您可以解决问题。假设您只是将 csv 加载到 df 而没有将任何内容解析为 datetime，您可以继续

from datetime import datetime
import pytz

df['local_time'] = [pytz.timezone('Europe/Berlin').localize(datetime.fromisoformat(t)) for t in df['local_time']]

# so you can make a UTC index:
df.set_index(df['local_time'].dt.tz_convert('UTC'), inplace=True)

# Now you can create a new, hourly index from that and re-index:
dti = pd.date_range(df.index[0], df.index[-1], freq='H')
df2 = df.reindex(dti)

# for comparison, the "re-created" local_time column:
df2['local_time'] = df2.index.tz_convert('Europe/Berlin').strftime('%Y-%m-%d %H:%M:%S').values

这应该会给你类似的东西

df2
                            value           local_time
2017-03-25 23:00:00+00:00  2016.0  2017-03-26 00:00:00
2017-03-26 00:00:00+00:00  2017.0  2017-03-26 01:00:00
2017-03-26 01:00:00+00:00  2018.0  2017-03-26 03:00:00
2017-03-26 02:00:00+00:00  2019.0  2017-03-26 04:00:00
2017-03-26 03:00:00+00:00     NaN  2017-03-26 05:00:00
                          ...                  ...
2017-10-29 00:00:00+00:00     NaN  2017-10-29 02:00:00
2017-10-29 01:00:00+00:00  7224.0  2017-10-29 02:00:00 # note: value randomly attributed to "second" 2 am
2017-10-29 02:00:00+00:00     NaN  2017-10-29 03:00:00
2017-10-29 03:00:00+00:00  7226.0  2017-10-29 04:00:00
2017-10-29 04:00:00+00:00  7227.0  2017-10-29 05:00:00

如上所述，值7224 归因于2017-10-29 01:00:00 UTC，但它也可以归因于2017-10-29 00:00:00 UTC，如果你不在乎，你很好。如果这是一个问题，我认为你能做的最好的事情就是放弃这个价值。您可以使用

df['local_time'] = pd.to_datetime(df['local_time']).dt.tz_localize('Europe/Berlin', ambiguous='NaT')

而不是上面代码中的原生 Python 部分。

【讨论】：

@MrFupppes，非常感谢。我知道了。我认为如果有人可以将其归因于 2017-10-29 00:00:00 UTC，那就太好了，但我还没有找到到达那里的方法。
@GregorJohnen 您可以通过在调用 localize 方法 (docs) 时设置 is_dst=True 来强制执行此操作。如果您设置此关键字，则 pytz 将本地化为 DST 处于活动状态的时间，以防日期/时间不明确（如您的情况）。我看到的问题是：这可能适用于这种特定情况，但可能不是总是所需的行为......
@Fuppes 先生，您如何看待使用 timedeltas 来代替，然后根据完整的 CET 向量重新索引，这将避免由于后者中的重复轴而遇到麻烦？这将为值向量中的缺失列插入 NaN。后记可以将 timedeltas 再次转换为 datetime (CET)，然后将本地时间列本地化为 UTC。你怎么看？
@MachineYogi 目前，我不确定这与（严格单调递增的）UTC 日期时间列有何不同。但是，如果您认为这是一种可行的方法并且可以为您提供所需的结果，为什么不将其添加为答案呢？
@MachineYogi 您还可以使用列表理解创建本地化的日期时间列，但无需从字符串解析 - 例如df['local_time'] = [pytz.timezone('Europe/Berlin').localize(t) for t in df['local_time']]

【解决方案2】：

只是为了提供我用于此解决方法的解决方案：它使用一些 try: except: 功能，以防出现时间不明确的错误。一方面，这应该将时间向量转换为 UTC，同时它还通过 reindexin 填充缺失值。欢迎提出改进建议。

try: # here everything is as expected and one hour is missing in Mar and one hour is repeated in Oct

# Localize tz-naive index of the DataFrame to target time zone.
df['time'] = df.iloc[:,0].dt.tz_localize('CET', ambiguous='infer').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
df = df.set_index(pd.to_datetime(df['time'], utc=True))

# Create a complete time vector in UTC for latter reindexing
idx = pd.date_range(df.index.min(), df.index.max(), freq=freq, tz='UTC')

# Verify that time vector is complete
if len(np.unique(np.diff(df.index))) == 1:
    print('Time vector is complete!')
else:
    # print dates which are not in the sequence and add them while simultaneously adding NaNs to the data columns
    print(f'These dates are not in the sequence:{idx.difference(df["utc_time"])}')
    df = df.reindex(idx).rename_axis('time')
    
except pytz.exceptions.AmbiguousTimeError: # here python does not know how to handle the non-reapeated time

# create the localized datetime column with a list comprehension
df['time'] = [pytz.timezone('Europe/Berlin').localize(t, is_dst=True) for t in df.iloc[:, 0]]

# make an UTC index:
df.set_index(df['time'].dt.tz_convert('UTC'), inplace=True)

# create a new index of desired frequency from that and re-index:
idx = pd.date_range(df.index[0], df.index[-1], freq=freq, tz='UTC')


# Verify that time vector is complete
if len(np.unique(np.diff(df.index))) == 1:
    print('Time vector is complete!')
else:
    # print dates which are not in the sequence and add them while simultaneously adding NaNs to the data columns
    print(f'These were the dates which were not in the sequence:{pd.Series(idx.difference(df["time"]))}')
    df = df.reindex(idx).rename_axis('time')

【讨论】：