Pandas read_csv 并删除夏令时答案

【问题标题】：Pandas read_csv and remove daylight savingPandas read_csv 并删除夏令时
【发布时间】：2012-12-23 15:34:17
【问题描述】：

我有一个 312.5MB 的 csv 文件，其中包含从 2003 年 7 月 27 日至今的 EURUSD 1 分钟 OHLC 数据，但日期都针对夏令时进行了调整，这意味着我得到了重复和空白。

由于它是一个如此大的文件，默认日期解析器太慢了，所以我这样做了：

tizo = dateutil.tz.tzfile('/usr/share/zoneinfo/GB')
def date_parse_1min(s):
    return datetime(int(s[6:10]), 
                    int(s[3:5]), 
                    int(s[0:2]), 
                    int(s[11:13]),
                    int(s[14:16]),tzinfo=tizo)

df = read_csv("EURUSD_1m_clean_w_header.csv",index_col=0,parse_dates=True, date_parser=date_parse_1min)

#verify that it's got the tz right:
df.index
Exception AttributeError: "'NoneType' object has no attribute 'toordinal'" in 'pandas.tslib._localize_tso' ignored
Exception AttributeError: "'NoneType' object has no attribute 'toordinal'" in 'pandas.tslib._localize_tso' ignored
<class 'pandas.tseries.index.DatetimeIndex'>
[2003-07-26 23:00:00, ..., 2012-12-15 23:59:00]
Length: 4938660, Freq: None, Timezone: tzfile('/usr/share/zoneinfo/GB')

不知道为什么会有属性错误。

df.index.get_duplicates()
<class 'pandas.tseries.index.DatetimeIndex'>
[2003-10-26 01:00:00, ..., 2012-10-28 01:59:00]
Length: 600, Freq: None, Timezone: None
df1 = df.tz_convert('GMT')
df1.index.get_duplicates()
<class 'pandas.tseries.index.DatetimeIndex'>
[2003-10-26 01:00:00, ..., 2012-10-28 01:59:00]
Length: 600, Freq: None, Timezone: None

如何让 pandas 删除夏令时偏移？显然，我可以计算出需要更改的正确整数索引并这样做，但必须有更好的方法。

【问题讨论】：

您能否将索引设置为每分钟频率的date_range，然后检查是否有任何差异与夏令时仅相差一个小时？
我也许可以做这样的事情，尽管我必须考虑数据中所有丢失的分钟数（每个周末等）

标签： python pandas python-dateutil

【解决方案1】：

如果您采用每年的第一个和最后一个重复值并将其间的数据移动一个小时，这应该是纠正问题的最简单方法。您显然必须考虑到第一个数据点从夏令时开始。

【讨论】：

第一天不是重复的，而是一个 1 小时的间隔，因为那时时钟会向前走。我可以这样做，但它不会很健壮，因为它会锁定数据中恰好存在的任何 1 小时间隔