【问题标题】:Filter pandas stock-ticker dataframe by first day in month of Jan按 1 月的第一天过滤 pandas 股票行情数据帧
【发布时间】:2018-01-28 18:12:56
【问题描述】:

抱歉,我对 Python 很陌生。

我有当前代码:

# Put data into a dataframe
df = pd.DataFrame(ZACKSP_raw_data)

""" Reformat dataframe data """    
# Change exchange from NSDQ to NASDAQ
df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')

# Change date format to DD/MM/YYYY
df['date'] = df['date'].dt.strftime('%d/%m/%Y')

# Round closing share price to 2 digits
df['close'] = df['close'].round(2)

# Filter data for Jan 
ZACKSP_data_StartOfJanYearMinus1 = df[df['date'] == '05/01/%s' % CurrentYearMinus1]

# Test
print(ZACKSP_data_StartOfJanYearMinus1.head())

它以以下格式返回数据:

现在我希望数组只保留 1 月份记录的第一个收盘价和 12 月份记录的最后一个收盘价(对于每个股票代码)。我曾想过尝试在一天中使用通配符,然后使用诸如 head() 或 tail() 之类的东西来实现这一目标,但我很挣扎。有什么想法吗?

【问题讨论】:

    标签: python pandas group-by ticker quandl


    【解决方案1】:

    所有日期时间都已排序的解决方案:

    我认为每个ticker 的第一行和最后一行都需要concatdrop_duplicates

    还需要为 years 添加新列,用于每年的第一个和最后一个值与代码。

    df['year'] = pd.to_datetime(df['date']).dt.year
    
    df1 = pd.concat([df.drop_duplicates(['ticker', 'year']), 
                     df.drop_duplicates(['ticker', 'year'], keep='last')])  
    

    未排序datetimes 的更通用解决方案:

    c = ['ticker','exchange','date','close']
    df = pd.DataFrame({'date':pd.to_datetime(['2017-01-04','2017-01-12',
                                              '2017-01-05',
                               '2018-01-02','2018-12-27','2017-12-27',
                               '2018-01-05','2018-01-12','2017-01-05',
                               '2017-01-12','2018-12-22','2017-12-22']),
                       'close':[4.56,5.45,4.32,5.67,5.23,4.78,7.43,8.67,
                                9.32,4.73,2.42,3.45],
                       'ticker':['BA','BA','BA','BA','BA','BA',
                                 'AAPL','AAPL','AAPL','AAPL','AAPL','AAPL'],
                        'exchange':['NYSE'] * 6 + ['NSDQ'] * 6}, columns=c)
    
    print (df)
       ticker exchange       date  close
    0      BA     NYSE 2017-01-04   4.56
    1      BA     NYSE 2017-01-12   5.45
    2      BA     NYSE 2017-01-05   4.32
    3      BA     NYSE 2018-01-02   5.67
    4      BA     NYSE 2018-12-27   5.23
    5      BA     NYSE 2017-12-27   4.78
    6    AAPL     NSDQ 2018-01-05   7.43
    7    AAPL     NSDQ 2018-01-12   8.67
    8    AAPL     NSDQ 2017-01-05   9.32
    9    AAPL     NSDQ 2017-01-12   4.73
    10   AAPL     NSDQ 2018-12-22   2.42
    11   AAPL     NSDQ 2017-12-22   3.45
    

    """ Reformat dataframe data """    
    # Change exchange from NSDQ to NASDAQ
    df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')
    
    # Round closing share price to 2 digits
    df['close'] = df['close'].round(2)
    
    #sorting dates for first date per ticker is first day in Jan and last day in Dec
    df = df.sort_values('date')
    
    #extract years from dates
    df['year'] = pd.to_datetime(df['date']).dt.year
    
    #get first rows per tickers and year
    df1 = df.drop_duplicates(['ticker', 'year'])
    print (df1)
      ticker exchange       date  close  year
    0     BA     NYSE 2017-01-04   4.56  2017
    8   AAPL   NASDAQ 2017-01-05   9.32  2017
    3     BA     NYSE 2018-01-02   5.67  2018
    6   AAPL   NASDAQ 2018-01-05   7.43  2018
    
    #get last row per year and ticker
    df2 = df.drop_duplicates(['ticker', 'year'], keep='last')
    print (df2)
       ticker exchange       date  close  year
    11   AAPL   NASDAQ 2017-12-22   3.45  2017
    5      BA     NYSE 2017-12-27   4.78  2017
    10   AAPL   NASDAQ 2018-12-22   2.42  2018
    4      BA     NYSE 2018-12-27   5.23  2018
    

    #join DataFrames together and sorting if necessary
    df = pd.concat([df1, df2]).sort_values(['ticker','date'])
    print (df)
       ticker exchange       date  close  year
    8    AAPL   NASDAQ 2017-01-05   9.32  2017
    11   AAPL   NASDAQ 2017-12-22   3.45  2017
    6    AAPL   NASDAQ 2018-01-05   7.43  2018
    10   AAPL   NASDAQ 2018-12-22   2.42  2018
    0      BA     NYSE 2017-01-04   4.56  2017
    5      BA     NYSE 2017-12-27   4.78  2017
    3      BA     NYSE 2018-01-02   5.67  2018
    4      BA     NYSE 2018-12-27   5.23  2018
    

    通过聚合firstlast 具有不同数据输出的另一种解决方案:

    """ Reformat dataframe data """    
    # Change exchange from NSDQ to NASDAQ
    df['exchange'] = df['exchange'].str.replace('NSDQ','NASDAQ')
    
    # Round closing share price to 2 digits
    df['close'] = df['close'].round(2)
    
    #sorting dates for first date per ticker is first day in Jan and last day in Dec
    df = df.sort_values('date')
    
    #extract years from dates
    df['year'] = pd.to_datetime(df['date']).dt.year
    
    df = (df.groupby(['ticker','year'])['close']
           .agg(['first','last'])
           .reset_index())
    print (df)
      ticker  year  first  last
    0   AAPL  2017   9.32  3.45
    1   AAPL  2018   7.43  2.42
    2     BA  2017   4.56  4.78
    3     BA  2018   5.67  5.23
    

    【讨论】:

    • 好的,这与我的预期不同,但我喜欢生成的输出。我有两个后续问题:1.如何过滤年份?我想包括大于变量定义的所有年份,或者包括在数组中匹配的所有年份,其中包含 5 年。
    • 1.您可以按df['year'] = pd.to_datetime(df['date']).dt.year 过滤,然后按df[df['year'] > 2016] - 它被称为boolean indexing
    • 2.我想为每年的第一个和最后一个列,命名为 2017-first、2017-last、2018-first、2018-last,我将使用什么方法来实现这一点?
    • 年份定义如下:#find current year now = datetime.datetime.now() current_year = str(now.year)
    • 2.对于解决方案 1 的输出 df1 = df.drop_duplicates(['ticker', 'year']) 添加 df1['year'] = df1['year'].astype(str) + 'first' 和类似的 df2
    【解决方案2】:

    你想df.groupby('ticker'),然后按月份分组,过滤月份=='Dec'并获取tail(),过滤月份=='Jan'并获取head(),然后ungroup() .

    (如果您发布可重现的数据,我将发布执行此操作的代码。)

    阅读 pandas 文档 Group By: split-apply-combine范式,数据科学的关键范式之一。有关 SO 的示例,请参阅标记

    【讨论】:

      猜你喜欢
      • 2019-01-06
      • 2014-11-05
      • 1970-01-01
      • 1970-01-01
      • 2022-11-30
      • 1970-01-01
      • 2021-05-07
      • 2019-10-24
      • 1970-01-01
      相关资源
      最近更新 更多