【问题标题】:Time Series Dataframe Groupby 3d Array - observation/row count - For LSTM时间序列数据帧 Groupby 3d 数组 - 观察/行数 - 对于 LSTM
【发布时间】:2021-01-23 01:37:21
【问题描述】:

我有一个时间序列,其结构如下所示,标识符列和两个值列(浮点数)

只调用 df 的数据帧:

Date          Id    Value1    Value2
2014-10-01     A      1.1       1.2
2014-10-01     B      1.3       1.4
2014-10-02     A      1.5       1.6
2014-10-02     B      1.7       1.8
2014-10-03     A      3.2       4.8
2014-10-03     B      8.2       10.1
2014-10-04     A      6.1       7.2
2014-10-04     B      4.3       4.1 

我想要做的是将它变成一个数组,该数组由标识符列分组,并具有滚动 3 个观察期,所以我最终会得到这样的结果:

[[[1.1 1.2]
  [1.5 1.6]   '----> ID A 10/1 to 10/3'
  [3.2 4.8]]

 [[1.3  1.4]
  [1.7  1.8]   '----> ID B 10/1 to 10/3'
  [8.2 10.1]]

 [[1.5 1.6]
  [3.2 4.8]   '----> ID A 10/2 to 10/4'
  [6.1 7.2]] 
  
 [[1.7  1.8]
  [8.2 10.1]  '----> ID B 10/2 to 10/4'
  [4.3  4.1]]]

当然,请忽略数组中上述引号中的部分,但希望您能明白这一点。 我有一个更大的数据集,它有更多的标识符,可能需要更改观察计数,所以不能硬计算行数。到目前为止,我倾向于的方向是获取 ID 列的唯一值,并通过创建一个临时 df 并对其进行迭代,一次迭代并获取 3 个值。 似乎有更好更快的方法来做到这一点。

“伪代码”

unique_ids = df.ID.unique().tolist()

for id in unique_ids:
    temp_df = df.loc[df['Id']==id]]

虽然我坚持的部分是迭代 temp_df 的最佳方法。

最终输出将用于 LSTM 模型;但是,大多数其他解决方案都不需要像处理“Id”列那样处理 groupby 方面。

【问题讨论】:

    标签: python arrays pandas time-series lstm


    【解决方案1】:

    这就是我最终为解决方案所做的,不是最简单的,但我的问题还是一开始就没有赢得任何选美比赛

    id_list = array_steps_df['Id'].unique().tolist()
    
    # change number of steps as needed
    step = 3
    
    column_list = ['Value1', 'Value2']
    
    master_list = []
    
    for id in id_list:
        master_dict = {}
        for column in column_list:
            array_steps_id_df = array_steps_df.loc[array_steps_df['Id'] == id]
            array_steps_id_df = array_steps_id_df[[column]].values
    
            master_dict[column] = []
    
            for obs in range(len(array_steps_id_df)-step+1):
                start_obs = obs + step
                master_dict[column].append(array_steps_id_df[obs:start_obs,])
        master_list.append(master_dict)
    
    
    
    for idx, dic in enumerate(master_list):
        # init arrays here
        if idx == 0:
            value1_array_init = master_list[0]['Value1']
            value2_array_init = master_list[1]['Value2']
        else:
            value1_array_init += master_list[idx]['Value1']
            value2_array_init += master_list[idx]['Value2']
            
    value1_array = np.array(value1_array_init)
    value2_array = np.array(value2_array_init)
    
    all_array = np.hstack((value1_array, value2_array)).reshape((len(array_steps_df) - (step + 1), 
                                                                 len(column_list),
                                                                 step)).transpose(0, 2, 1) 
    

    已修复,我的错误在末尾添加了转置,并重新调整了特征和步骤的顺序。

    感谢本网站以获得一些额外的帮助

    https://www.mikulskibartosz.name/how-to-turn-pandas-data-frame-into-time-series-input-for-rnn/

    【讨论】:

      【解决方案2】:

      我最终重做了一点,以使其对列更具动态性并保持时间序列有序,还添加了一个目标数组以保持预测有序。对于任何需要此功能的人来说:

      def data_to_array_steps(array_steps_df, time_steps, columns_to_array, id_column):
      """
      https: //www.mikulskibartosz.name/ how - to - turn - pandas - data - frame - into - time - series - input - for -rnn /
      :param array_steps_df: the dataframe from the csv
      :param time_steps: how many time steps
      :param columns_to_array: what columns to convert to the array
      :param id_column: what is to be used for the identifier
      :return: data grouped in a # observations by identifier and date
      """
      
          id_list = array_steps_df[id_column].unique().tolist()
          date_list = array_steps_df['date'].unique().tolist()
      
          master_list = []
          target_list = []
      
          missing_counter = 0
          total_counter = 0
      
          # grab date size = time steps at a time and iterate through all of them
          for date in range(len(date_list) - time_steps + 1):
              date_range_test = date_list[date:time_steps+date]
      
              date_range_df = array_steps_df.loc[(array_steps_df['date'] <= date_range_test[-1]) &
                                                 (array_steps_df['date'] >= date_range_test[0])
                                                ]
      
              # for each id do it separately so time series data doesn't get mixed up
              for identifier in id_list:
      
                  # get id in here and then skip if not the required time steps/observations for the id
      
                  date_range_id = date_range_df.loc[date_range_df[id_column] == identifier]
      
                  master_dict = {}
      
              # if there aren't enough observations for the data range
                  if len(date_range_id) != time_steps:
      
                      # dont fully need the counter except in unusual circumstances when debugging it causes no harm for now
                      missing_counter += 1
      
                  else:
                  # add target each loop through for the last date in the date range for the id or ticker
                      target = array_steps_df['target'].\
                               loc[(array_steps_df['date'] == date_range_test[-1])
                                 & (array_steps_df[id_column] == identifier)                                     
                                  ].iloc[0]
      
                      target_list.append(target)
      
                      total_counter += 1
      
                      # loop through each column in dataframe
                      for column in columns_to_array:
      
                          date_range_id_value = date_range_id[[column]].values
      
                          master_dict[column] = []
                          master_dict[column].append(date_range_id_value)
      
                      master_list.append(master_dict)
      
          # redo columns to arrays, after they have been ordered and grouped by Id above
          array_list = []
      
          # for each column go through the values in the array create an array for the column then append to list
          for column in columns_to_array:
      
              for idx, dic in enumerate(master_list):
                  # init arrays here if the first value
                  if idx == 0:
                       value_array_init = master_list[0][column]
      
                  else:
                       value_array_init += master_list[idx][column]
      
              array_list.append(np.array(value_array_init))
      
          # for each value in the array list, horizontally stack each value
          all_array = np.hstack(array_list).reshape((total_counter,
                                                     len(columns_to_array),
                                                     time_steps
                                                     )
                                                   ).transpose(0, 2, 1)
      
          target_array_all = np.array(target_list
                                      ).reshape(len(target_list),
                                                1)
      
          # should probably make this an if condition later after a few more tests
          print('check of length of arrays', len(all_array), len(target_array_all))
      
          return all_array, target_array_all
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2015-09-16
        • 2018-09-23
        • 2021-01-19
        • 2019-03-15
        • 2018-10-28
        • 1970-01-01
        • 1970-01-01
        • 2021-12-23
        相关资源
        最近更新 更多