【问题标题】:Remove outlier from time series data using pandas使用 pandas 从时间序列数据中删除异常值
【发布时间】:2020-10-12 05:28:19
【问题描述】:

我有一分钟的数据:

# Import data
import yfinance as yf
data = yf.download(tickers="MSFT", period="7d", interval="1m")
print(data.tail())

我想删除分钟差异大于每日差异的观察结果,我们指的是分钟栏的日期。我想将此规则应用于除卷之外的每一列。代码开头:

minute_diff = data.diff()
dail_diff = data.resample('D').last().diff().median()

# here remove rows from data were minute_diff is grater than daily diff

【问题讨论】:

    标签: python pandas outliers


    【解决方案1】:
    minute_diff = data.diff().reset_index()
    dail_diff = data.resample('D').last().diff().median()
    
    cols = minute_diff.columns.to_list()
    cols.remove('Datetime')
    
    for c in cols:
      minute_diff = minute_diff[(minute_diff[c] <= dail_diff[c])|(minute_diff[c].isnull())]
    
    data = data.loc[minute_diff['Datetime']]
    

    【讨论】:

    • 我可能把你和我的代码搞混了。它应该测试分钟差异是否大于特定日期的每日差异。所以每日差异是日期的向量。
    【解决方案2】:
    import pandas as pd
    # Import data
    import yfinance as yf
    data = yf.download(tickers="MSFT", period="7d", interval="1m")
    
    data_minute = data.copy()
    data_minute['Date'] = data_minute.index.astype('datetime64[ns]')
    data_minute['Date'] = data_minute['Date'].dt.normalize()
    #Create new column for difference of current close minus previous close
    data_minute['Minute Close Difference'] = data_minute['Close'] - data_minute['Close'].shift(1)
    
    #Convert minute data to daily data
    data_daily = data_minute.resample('D').agg({'Open':'first',
                                                 'High':'max',
                                                 'Low':'min',
                                                 'Close':'last',
                                                 'Adj Close':'last',
                                                'Volume':'sum'
                                               })
    
    
    data_daily['Date'] = data_daily.index.astype('datetime64[ns]')
    data_daily['Date'] = data_daily['Date'].dt.normalize()
    data_daily = data_daily.set_index('Date')
    #Create new column for difference of current close minus previous close
    data_daily['Daily Close Difference'] = data_daily['Close'] - data_daily['Close'].shift(1)
    
    data_minute = pd.merge(data_minute,data_daily['Daily Close Difference'],how = 'left', left_on = 'Date', right_index = True)
    data_minute = data_minute[data_minute['Minute Close Difference'].abs() <= data_minute['Daily Close Difference'].abs()]
    data_minute
    

    【讨论】:

    • 差异是当前值和滞后值之间的差异。此外,最后一个过滤器应仅保留偏差低于每日偏差绝对值的观察值。不,它应该应用于所有列。
    【解决方案3】:

    我找到了解决办法:

    daily_diff = data.resample('D').last().dropna().diff() * 25
    daily_diff['diff_date'] = daily_diff.index.strftime('%Y-%m-%d')
    data_test = data.diff()
    data_test['diff_date'] = data_test.index.strftime('%Y-%m-%d')
    data_test_diff = pd.merge(data_test, daily_diff, on='diff_date')
    data_test_final = data_test_diff.loc[(np.abs(data_test_diff['close_x']) < np.abs(data_test_diff['close_y']))]
    data_test_final['close_x'].plot()
    indexer = (np.abs(data_test_diff['close_x']) < np.abs(data_test_diff['close_y']))
    data_final = data.loc[indexer.values, :]
    

    【讨论】:

      猜你喜欢
      • 2019-03-22
      • 2020-03-26
      • 1970-01-01
      • 1970-01-01
      • 2020-03-21
      • 2016-08-20
      • 2018-11-26
      • 2023-02-18
      • 2021-01-16
      相关资源
      最近更新 更多