【问题标题】:How to time-efficiently remove values next to 'NaN' values?如何高效地删除“NaN”值旁边的值?
【发布时间】:2017-08-25 06:51:57
【问题描述】:

我正在尝试从我的数据中删除错误的值(一系列 1500 万个值,700MB)。要删除的值是 'nan' 值旁边的值,例如:

系列:/1/,nan,/2/,3,/4/,nan,nan,nan,/8/,9 由斜线包围的数字,即 /1/,/2/,/4/,/8/ 是值,应该删除。

问题是使用我拥有的以下代码计算它需要很长时间:

%%time

import numpy as np
import pandas as pd

# sample data
speed = np.random.uniform(0,25,15000000)
next_speed = speed[1:]

# create a dataframe
data_dict = {'speed': speed[:-1],
            'next_speed': next_speed}

df = pd.DataFrame(data_dict)


# calculate difference between the current speed and the next speed
list_of_differences = []

for i in df.index:
    difference = df.next_speed[i]-df.speed[i]
    list_of_differences.append(difference)

df['difference'] = list_of_differences

# add 'nan' to data in form of a string. 

for i in range(len(df.difference)):
    # arbitrary condition
    if df.difference[i] < -2:
        df.difference[i] = 'nan'

#########################################
# THE TIME-INEFFICIENT LOOP

# remove wrong values before and after 'nan'.
for i in  range(len(df)):

    # check if the value is a number to skip computations of the following "if" cases
    if not(isinstance(df.difference[i], str)):
        continue

    # case 1: where there's only one 'nan' surrounded by values. 
    # Without this case the algo will miss some wrong values because 'nan' will be removed
    # Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
    # A number surrounded by slashes e.g. /1/ is a value to be removed
    if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
        df.difference[i-1]= 'wrong'
        df.difference[i+1]= 'wrong'

    # case 2: where the following values are 'nan': /1/, nan, nan, 4
    # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
    elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
        df.difference[i-1]= 'wrong'

    # case 3: where next value is NOT 'nan'  wrong, nan,nan,4 
        # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
    elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
        df.difference[i+1]= 'wrong'

如何让它更省时?

【问题讨论】:

  • 为什么要用字符串wrong 替换不需要的元素?是要求还是您要删除不需要的元素。
  • 字符串'nan' 和值nan 之间存在差异。如果您的系列中确实包含nan(很可能),那么这些实际上是np.nan 的值,也称为“非数字”。你不能和nan比较,你必须和isnan比较。
  • @Rohanil 我将其替换为“错误”,因为我想稍后从该列中删除所有字符串。出于同样的原因,我使用字符串 'nan' 而不是实际的 np.nan 这是我如何从列中删除错误值的想法的一部分。我之前尝试过更直接的方法,但我在循环和索引方面遇到了一些问题,所以我做了一个解决方法并用字符串替换了错误的值。 @Austin 是的,我很清楚 'nan'np.nan(即 NaN)之间存在差异。作为解决方法的一部分,我有意将它们替换为字符串。

标签: python performance for-loop time


【解决方案1】:

这对我来说仍然是一项正在进行的工作。我将你的虚拟数据大小减少了 100 倍,以降低到我可以等待的东西。

我还在我的版本顶部添加了这段代码:

 import time

current_milli_time = lambda: int(round(time.time() * 1000))

def mark(s):
    print("[{}] {}".format(current_milli_time()/1000, s))

这只是打印一个前面有时间标记的字符串,看看什么花了这么长时间。

完成后,在您的'difference' 列计算中,您可以将手动列表生成替换为向量操作。这段代码:

df = pd.DataFrame(data_dict)

mark("Got DataFrame")

# calculate difference between the current speed and the next speed
list_of_differences = []

for i in df.index:
    difference = df.next_speed[i]-df.speed[i]
    list_of_differences.append(difference)

df['difference'] = list_of_differences
mark("difference 1")

df['difference2'] = df['next_speed'] - df['speed']
mark('difference 2')

print(df[:10])

产生这个输出:

[1490943913.921] Got DataFrame
[1490943922.094] difference 1
[1490943922.096] difference 2
   next_speed      speed  difference  difference2
0   18.008314  20.182982   -2.174669    -2.174669
1   14.736095  18.008314   -3.272219    -3.272219
2    5.352993  14.736095   -9.383102    -9.383102
3    5.854199   5.352993    0.501206     0.501206
4    2.003826   5.854199   -3.850373    -3.850373
5   12.736061   2.003826   10.732236    10.732236
6    2.512623  12.736061  -10.223438   -10.223438
7   18.224716   2.512623   15.712093    15.712093
8   14.023848  18.224716   -4.200868    -4.200868
9   15.991590  14.023848    1.967741     1.967741

请注意,两个 difference 列是相同的,但第二个版本花费的时间减少了大约 8 秒。 (当您拥有 100 倍以上的数据时,大概需要 800 秒。)

我在“nanify”过程中做了同样的事情:

df.difference2[df.difference2 < -2] = np.nan

这里的想法是,许多二元运算符实际上生成占位符、序列或向量。这可以用作索引,因此df.difference2 &lt; -2 成为(本质上)该条件为真的位置的列表,然后您可以索引df(整个表)或任何列df,如df.difference2,使用该索引。它是其他慢速 python for 循环的快速简写。

更新

好的,最后,这是一个矢量化“时间效率低的循环”的版本。我只是将整个内容粘贴到底部,以便复制。

前提是Series.isnull() 方法返回一个布尔系列(列),如果内容“缺失”或“无效”或“伪造”,则该系列为真。一般这个意思是NaN,但也能识别Python None等。

在 pandas 中,棘手的部分是将该列向上或向下移动 1 以反映“环绕”性。

也就是说,我想要另一个布尔列,如果 col[n] 为空,则 col[n-1] 为真。那是我的“在南之前”专栏。同样,如果 col[n] 为空,我想要另一列 col[n+1] 为真。那是我的“nan之后”专栏。

原来我不得不把这该死的东西拆开!我不得不进入,使用Series.values 属性提取底层numpy 数组,以便丢弃pandas index。然后创建一个新的索引,从 0 开始,一切都恢复正常。 (如果您不删除索引,则列“记住”它们的编号应该是什么。因此,即使您删除列 [0],该列也不会向下移动。相反,它知道“我错过了我的[0] 值,但其他人仍然在正确的位置!”)

无论如何,弄清楚这一点后,我能够构建三列(不必要 - 它们可能是表达式的一部分),然后将它们合并到第四列中,指示您想要什么:该列是 True当行在 nan 值之前、之上或之后时。

missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan

这就是全部内容:

import numpy as np
import pandas as pd

import time

current_milli_time = lambda: int(round(time.time() * 1000))

def mark(s):
    print("[{}] {}".format(current_milli_time()/1000, s))

# sample data
speed = np.random.uniform(0,25,150000)
next_speed = speed[1:]

# create a dataframe
data_dict = {'speed': speed[:-1],
            'next_speed': next_speed}

df = pd.DataFrame(data_dict)

mark("Got DataFrame")

# calculate difference between the current speed and the next speed
list_of_differences = []

#for i in df.index:
    #difference = df.next_speed[i]-df.speed[i]
    #list_of_differences.append(difference)

#df['difference'] = list_of_differences
#mark("difference 1")

df['difference'] = df['next_speed'] - df['speed']
mark('difference 2')

df['difference2'] = df['next_speed'] - df['speed']

# add 'nan' to data in form of a string.

#for i in range(len(df.difference)):
    ## arbitrary condition
    #if df.difference[i] < -2:
        #df.difference[i] = 'nan'

df.difference[df.difference < -2] = np.nan
mark('nanify')

df.difference2[df.difference2 < -2] = np.nan
mark('nanify 2')

missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan
mark('looped')

#########################################
# THE TIME-INEFFICIENT LOOP

# remove wrong values before and after 'nan'.
for i in  range(len(df)):

    # check if the value is a number to skip computations of the following "if" cases
    if not(isinstance(df.difference[i], str)):
        continue

    # case 1: where there's only one 'nan' surrounded by values.
    # Without this case the algo will miss some wrong values because 'nan' will be removed
    # Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
    # A number surrounded by slashes e.g. /1/ is a value to be removed
    if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
        df.difference[i-1]= 'wrong'
        df.difference[i+1]= 'wrong'

    # case 2: where the following values are 'nan': /1/, nan, nan, 4
    # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
    elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
        df.difference[i-1]= 'wrong'

    # case 3: where next value is NOT 'nan'  wrong, nan,nan,4
        # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
    elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
        df.difference[i+1]= 'wrong'

mark('time-inefficient loop done')

【讨论】:

    【解决方案2】:

    我假设您不想要'nan' 或错误值,并且nan 值与数据大小相比并不多。请试试这个:

    nan_idx = df[df['difference']=='nan'].index.tolist()
    
    from copy import deepcopy
    drop_list = deepcopy(nan_idx)
    
    
    for i in nan_idx:
        if (i+1) not in(drop_list) and (i+1) < len(df):
            mm.append(i+1)
        if (i-1) not in(drop_list) and (i-1) < len(df):
            mm.append(i-1)
    
    df.drop(df.index[drop_list])
    

    如果nan 不是字符串,但它是用于缺失值的NaN,则使用它来获取其索引:

    nan_idx = df[pandas.isnull(df['difference'])].index.tolist()
    

    【讨论】:

      猜你喜欢
      • 2015-08-22
      • 2020-09-18
      • 1970-01-01
      • 2016-08-25
      • 2018-01-26
      • 2018-09-23
      • 1970-01-01
      • 2011-08-23
      • 1970-01-01
      相关资源
      最近更新 更多