【问题标题】:Generate missing blocks of data in pandas dataframe在 pandas 数据框中生成缺失的数据块
【发布时间】:2018-06-29 22:25:22
【问题描述】:

所以我有一个像这样的数据框:

[5232 rows x 2 columns]
                                   0       2
0                                               
2018-02-01 00:00:00  2018-02-01 00:00:00  435.24
2018-02-01 00:30:00  2018-02-01 00:30:00  357.12
2018-02-01 01:00:00  2018-02-01 01:00:00  301.32
2018-02-01 01:30:00  2018-02-01 01:30:00  256.68
2018-02-01 02:00:00  2018-02-01 02:00:00  245.52
2018-02-01 02:30:00  2018-02-01 02:30:00  223.20
2018-02-01 03:00:00  2018-02-01 03:00:00  212.04
2018-02-01 03:30:00  2018-02-01 03:30:00  212.04
2018-02-01 04:00:00  2018-02-01 04:00:00  212.04
2018-02-01 04:30:00  2018-02-01 04:30:00  212.04
2018-02-01 05:00:00  2018-02-01 05:00:00  223.20
2018-02-01 05:30:00  2018-02-01 05:30:00  234.36

而我目前能做的是替换一部分值(比如用NaN 随机替换 10%:

df_missing.loc[df_missing.sample(frac=0.1, random_state=100).index, 2] = np.NaN

我想做的是做同样的事情,但是对于大小为 x 的随机块,假设 10% 的数据应该被阻止 NaN

例如,如果块大小为 4,并且比例为 30%,则上述数据帧可能如下所示:

[5232 rows x 2 columns]
                                   0       2
0                                               
2018-02-01 00:00:00  2018-02-01 00:00:00  435.24
2018-02-01 00:30:00  2018-02-01 00:30:00  357.12
2018-02-01 01:00:00  2018-02-01 01:00:00  NaN
2018-02-01 01:30:00  2018-02-01 01:30:00  NaN
2018-02-01 02:00:00  2018-02-01 02:00:00  NaN
2018-02-01 02:30:00  2018-02-01 02:30:00  NaN
2018-02-01 03:00:00  2018-02-01 03:00:00  212.04
2018-02-01 03:30:00  2018-02-01 03:30:00  212.04
2018-02-01 04:00:00  2018-02-01 04:00:00  212.04
2018-02-01 04:30:00  2018-02-01 04:30:00  212.04
2018-02-01 05:00:00  2018-02-01 05:00:00  223.20
2018-02-01 05:30:00  2018-02-01 05:30:00  234.36

我发现我可以通过以下方式获得块数:

number_of_samples = int((df.shape[0] * proporition) / block_size)

但我不知道如何实际创建缺失的块。

我看过this 的问题,这很有帮助,但有两个警告:

  1. 它不会用 NaN 值修改原始数据帧,只是返回样本。
  2. 无法保证样本不会重叠(我希望避免重叠)

有人可以解释如何将答案转换为上述几点(或解释不同的解决方案)吗?

【问题讨论】:

    标签: python pandas missing-data


    【解决方案1】:

    这段代码使用if 语句检查块中的重叠,以一种相当不优雅的方式完成了这项工作。它还使用带有参数解包 (*) 的 chain 方法将列表列表展平为单个列表:

    import pandas as pd
    import random
    import numpy as np
    from itertools import chain
    
    # Example dataframe
    df = pd.DataFrame({0: pd.date_range(start = pd.datetime(2018, 2, 1, 0, 0, 0), 
                                        end = pd.datetime(2018, 2, 1, 10, 0, 0), freq = '30 min'),
                       2: np.random.randn(21)})
    
    # Set basic parameters
    proportion = 0.4
    block_size = 4
    number_of_samples = int((df.shape[0] * proportion) / block_size)
    
    # This will hold all indexes to be set to NaN
    block_indexes = []
    
    i = 0 
    
    # Iterate until number of samples are found
    while i < number_of_samples:
        
        # Choose a potential start and end
        potential_start = random.sample(list(df.index), 1)[0]
        potential_end = potential_start + block_size
        
        # Flatten the list of lists
        flattened_indexes = list(chain(*block_indexes))
        
        # Check to make sure potential start and potential end are not already in the indexes
        if potential_start not in flattened_indexes \
        and potential_end not in flattened_indexes:
            
            # If they are not, append the block indexes
            block_indexes.append(list(range(potential_start, potential_end)))
            
            i += 1
            
    # Flatten the list of lists
    block_indexes = list(chain(*block_indexes))
    
    # Set the blocks to nan accounting for end of dataframe
    df.loc[[x for x in block_indexes if x in df.index], 2] = np.nan
    

    将结果应用于示例数据框:

    我不确定您要如何处理数据帧末尾的块,但此代码会忽略出现在数据帧索引范围之外的任何索引。我确信有一种更 Pythonic 的方式来编写这段代码,任何 cmets 都将不胜感激!

    【讨论】:

      【解决方案2】:

      @caseWestern 提供了一个很好的解决方案,我在此基础上建立了自己的解决方案:

      def block_sample(df_length : int, number_of_samples : int, block_size : int):
          """ Generates the the initial index of a block of block_size WITHOUT replacement.
      
              Does this by removing x-(block_size+1):x+block_size from the possible values, 
              so that the next value must be at least a block_size away from the last value. 
      
      
              Raises
              ------
              ValueError: In cases of more samples than possible.
          """
          full_range = list(range(df_length))
          for _ in range(number_of_samples):
              x = random.sample(full_range, 1)[0]
              indx = full_range.index(x)
              yield x
              del full_range[indx-(block_size-1):indx+block_size]
      
      try: 
          for x in block_sample(df_length, number_of_samples, block_size):
              df_missing.loc[x:x+block_size, 2] = np.NaN
      except ValueError:
              pass
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-11-04
        • 2021-06-14
        • 2021-06-17
        • 1970-01-01
        • 2022-08-22
        • 2020-02-16
        • 1970-01-01
        相关资源
        最近更新 更多