我按照@Bahman Engheta 的回答创建了一个函数来从数据框中省略日期。
import pandas as pd
from datetime import datetime, timedelta
def omit_dates(df, list_years, list_dates, omit_days_near=3, omit_weekends=False):
'''
Given a Pandas dataframe with a DatetimeIndex, remove rows that have a date
near a given list of dates and/or a date on a weekend.
Parameters:
----------
df : Pandas dataframe
list_years : list of str
Contains a list of years in string form
list_dates : list of str
Contains a list of dates in string form encoded as MM-DD
omit_days_near : int
Threshold of days away from list_dates to remove. For example, if
omit_days_near=3, then omit all days that are 3 days away from
any date in list_dates.
omit_weekends : bool
If true, omit dates that are on weekends.
Returns:
-------
Pandas dataframe
New resulting dataframe with dates omitted.
'''
if not isinstance(df, pd.core.frame.DataFrame):
raise ValueError("df is expected to be a Pandas dataframe, not %s" % type(df).__name__)
if not isinstance(df.index, pd.tseries.index.DatetimeIndex):
raise ValueError("Dataframe is expected to have an index of DateTimeIndex, not %s" %
type(df.index).__name__)
if not isinstance(list_years, list):
list_years = [list_years]
if not isinstance(list_dates, list):
list_dates = [list_dates]
result = df.copy()
if omit_weekends:
result = result.loc[result.index.dayofweek < 5]
omit_dates = [ '%s-%s' % (year, date) for year in list_years for date in list_dates ]
for date in omit_dates:
result = result.loc[abs(result.index.date - datetime.strptime(date, '%Y-%m-%d').date()) > timedelta(omit_days_near)]
return result
这里是示例用法。假设您有一个包含 DateTimeIndex 和其他列的数据框,如下所示:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='1D')
df = pd.DataFrame(index = range)
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
生成的数据框如下所示:
value
2017-12-01 42
2017-12-02 35
2017-12-03 49
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-08 57
2017-12-09 3
2017-12-10 57
2017-12-11 46
2017-12-12 20
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-16 57
2017-12-17 4
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-23 45
2017-12-24 34
2017-12-25 42
2017-12-26 33
2017-12-27 17
2017-12-28 2
2017-12-29 2
2017-12-30 51
2017-12-31 19
2018-01-01 6
2018-01-02 43
2018-01-03 11
2018-01-04 45
2018-01-05 45
现在,让我们指定要删除的日期。我想删除日期“12-10”、“12-25”、“12-31”和“01-01”(遵循 MM-DD 表示法)以及这些日期 2 天内的所有日期。此外,我想从“2016”和“2017”这两个年份中删除这些日期。我还想删除周末日期。
我会这样调用我的函数:
years = ['2016', '2017']
holiday_dates = ['12-10', '12-25', '12-31', '01-01']
omit_dates(df, years, holiday_dates, omit_days_near=2, omit_weekends=True)
结果是:
value
2017-12-01 42
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-28 2
2018-01-03 11
2018-01-04 45
2018-01-05 45
这个答案正确吗?以下是 2017 年 12 月和 2018 年 1 月的日历:
December 2017
Su Mo Tu We Th Fr Sa
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
January 2018
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
看起来很有效。