删除所有不是一周开始或结束的行答案

【问题标题】：Drop All Rows That Aren't The Beginning or End of Week删除所有不是一周开始或结束的行
【发布时间】：2020-07-12 11:08:10
【问题描述】：

我有一个以日期为索引列的股票数据数据框。我想做的是删除所有不是一周开始或结束的行，有效地给我留下一个（主要是）周一和周五的数据框。诀窍是，我不想只寻找星期一和星期五，因为有些星期很短，从星期二开始或星期四结束（或其他情况。也许有些星期也有星期三休息？）。

我现在的逻辑（和一个可重现的代码）用于删除不是一周开始的所有行，如下所示：

import pandas_datareader.data as web
import numpy as np
import pandas as pd
from pandas import Series
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import warnings
warnings.filterwarnings("once")
from datetime import datetime, timedelta





# Import a stock dataset from Yahoo
ticker = 'SPY'
start = datetime(2010, 1, 1)
end = datetime.today().strftime('%Y-%m-%d')
# Download the df
df = web.DataReader(ticker, 'yahoo', start, end)

# Drop the Adj Close and Volume for now
df = df.drop(['Adj Close'], axis=1)
print(df)





# Check if day of week is Monday
print('Checking for beginnings of weeks...')
df = df.reset_index() # Make the date index an actual column again for now
df['week_day_objects'] = pd.to_datetime(df['Date'], format='%Y-%m-%d') # make the dates a datetime object
for i in range(len(df)-1, 0, -1): # start at the bottom of the DF and work backwards
    if df['week_day_objects'].iloc[i] > df['week_day_objects'].iloc[i-1] + timedelta(days=2): # first day of week is always > 2 days since the previous date, holidays included
        continue # if today is the start of the week, continue the loop...
    else:
        df = df.drop([df.index[i]]) # ...else, drop all rows that aren't at the beginning of the week

df = df.set_index(['Date']) # make the date column the index again
df = df.drop(['week_day_objects'], axis=1) # drop the datetime column now

# For review
df.to_csv('./Check_Week_Days.csv', index=True)

...但是，我一直试图将星期五（或者更确切地说，周末）也纳入此解决方案。而且我什至不确定这是不是最好的方法，所以我愿意接受建议。上面的逻辑基本上只是查找比前一行至少多 3 天的任何一天，这是一周的开始，因为新工作周的开始总是发生在上周最后一个工作日后至少 3 天。

根据要求，进行一些澄清。就像我上面提到的那样，我不只是想删除所有不是星期五或星期一的行，因为有些星期很短，所以一周的开始可以在星期二开始，或者一周的结束可以在星期四，所以我不想丢失那些行。我想要结束的是从该周的开始工作日开始到该周的最后一个工作日结束的行数据框，无论是星期五还是星期四/星期一或星期二。所以最终的数据集应该是这样的：

请注意大多数星期是周一到周五，但 18 日是周二，因为当年的 17 日是假期。我不希望将日历与假期同步，我想放弃该周开始的任何工作日和该周结束的任何工作日之间的所有中间天。希望有帮助吗？

谢谢！

【问题讨论】：

您不希望将其与交易假期日历合并吗？
@rpanai 没有。详情在问题中。
请提供mcve?特别是数据必须是文本格式而不是字符串。如果你能告诉我们哪一个是预期的输出，那就太好了。
@rpanai 查看我上面的编辑。

标签： python pandas

【解决方案1】：

您可以使用 datetime 对象的 dayofweek 属性来选择行并根据索引删除那些行。

import numpy as np
import pandas as pd

dates_df = pd.DataFrame(np.arange(np.datetime64('2000-01-03'), np.datetime64('2000-01-25')), columns=['date'])
dates_df.drop(dates_df[dates_df['date'].dt.dayofweek == 6].index)

上面的 sn-p 将删除所有星期日的值。

但您也可以选择与一周的第一天或最后一天匹配的数据，而不是丢弃它

dates_df[(dates_df['date'].dt.dayofweek == 1) | (dates_df['date'].dt.dayofweek == 4)]

【讨论】：

谢谢，但就像我提到的那样，我不是只寻找周一/周五的，因为由于假期和当天市场休市，有些周没有周五或周一。所以有些星期从星期二开始，有些甚至从星期三开始（即圣诞节）。所以我不能只删除不是周一或周五的所有行。这有意义吗？细节在问题中。感谢您的意见！
@MattWilson 正在努力帮助的人们，你只是说细节有问题。你会认为最终你的问题不是那么清楚吗？
@rpanai 该问题已在上面更新，为您提供更多说明和可重现的代码。谢谢。

【解决方案2】：

我已经通过以下函数使用星期几数字来解决这个问题：

# Check if day of week is Monday
print('Checking for beginnings of weeks...')
df = df.reset_index() # Make the date index an actual column again for now
df['week_day_objects'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').dt.dayofweek # make the dates a datetime object number
for i in range(len(df)-2, 1, -1): # start at the bottom of the DF and work backwards. Need to trim the top/bottom rows accordingly later.
    if (df['week_day_objects'].iloc[i] < df['week_day_objects'].iloc[i-1] and df['week_day_objects'].iloc[i] < df['week_day_objects'].iloc[i+1]) or # A beginning of the week will always have a day of week number less than the day after it, and the day before it
    (df['week_day_objects'].iloc[i] > df['week_day_objects'].iloc[i-1] and df['week_day_objects'].iloc[i] > df['week_day_objects'].iloc[i+1]): # ...and a EOW will always have a number greater than the day before it, and the day after it.

        continue # if today is the start or end of the week, skip...
    else:
        df = df.drop([df.index[i]]) # ...else, drop all rows that aren't at the beginning/end of the week

df = df.set_index(['Date']) # make the date column the index again
df = df.drop(['week_day_objects'], axis=1) # drop the datetime column now

# For review
df.to_csv('./Check_Week_Days.csv', index=True)

因此，一周开始的数字总是低于前一行/天的数字，并且也将低于明天的数字。在周末将其反转。无论星期的开始或结束是什么，无论是星期四结束还是星期二开始，这都使它工作。

这个循环并没有从数据帧的顶部/底部开始，尽管需要进行一些清理工作，但我会编写一个单独的代码来处理这个问题。

【讨论】：