如何高效地删除“NaN”值旁边的值？答案

【问题标题】：How to time-efficiently remove values next to 'NaN' values?如何高效地删除“NaN”值旁边的值？
【发布时间】：2017-08-25 06:51:57
【问题描述】：

我正在尝试从我的数据中删除错误的值（一系列 1500 万个值，700MB）。要删除的值是 'nan' 值旁边的值，例如：

系列：/1/,nan,/2/,3,/4/,nan,nan,nan,/8/,9 由斜线包围的数字，即 /1/,/2/,/4/,/8/ 是值，应该删除。

问题是使用我拥有的以下代码计算它需要很长时间：

%%time

import numpy as np
import pandas as pd

# sample data
speed = np.random.uniform(0,25,15000000)
next_speed = speed[1:]

# create a dataframe
data_dict = {'speed': speed[:-1],
            'next_speed': next_speed}

df = pd.DataFrame(data_dict)


# calculate difference between the current speed and the next speed
list_of_differences = []

for i in df.index:
    difference = df.next_speed[i]-df.speed[i]
    list_of_differences.append(difference)

df['difference'] = list_of_differences

# add 'nan' to data in form of a string. 

for i in range(len(df.difference)):
    # arbitrary condition
    if df.difference[i] < -2:
        df.difference[i] = 'nan'

#########################################
# THE TIME-INEFFICIENT LOOP

# remove wrong values before and after 'nan'.
for i in  range(len(df)):

    # check if the value is a number to skip computations of the following "if" cases
    if not(isinstance(df.difference[i], str)):
        continue

    # case 1: where there's only one 'nan' surrounded by values. 
    # Without this case the algo will miss some wrong values because 'nan' will be removed
    # Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
    # A number surrounded by slashes e.g. /1/ is a value to be removed
    if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
        df.difference[i-1]= 'wrong'
        df.difference[i+1]= 'wrong'

    # case 2: where the following values are 'nan': /1/, nan, nan, 4
    # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
    elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
        df.difference[i-1]= 'wrong'

    # case 3: where next value is NOT 'nan'  wrong, nan,nan,4 
        # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
    elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
        df.difference[i+1]= 'wrong'

如何让它更省时？

【问题讨论】：

为什么要用字符串wrong 替换不需要的元素？是要求还是您要删除不需要的元素。
字符串'nan' 和值nan 之间存在差异。如果您的系列中确实包含nan（很可能），那么这些实际上是np.nan 的值，也称为“非数字”。你不能和nan比较，你必须和isnan比较。
@Rohanil 我将其替换为“错误”，因为我想稍后从该列中删除所有字符串。出于同样的原因，我使用字符串 'nan' 而不是实际的 np.nan 这是我如何从列中删除错误值的想法的一部分。我之前尝试过更直接的方法，但我在循环和索引方面遇到了一些问题，所以我做了一个解决方法并用字符串替换了错误的值。 @Austin 是的，我很清楚 'nan' 和 np.nan（即 NaN）之间存在差异。作为解决方法的一部分，我有意将它们替换为字符串。

标签： python performance for-loop time

【解决方案1】：

这对我来说仍然是一项正在进行的工作。我将你的虚拟数据大小减少了 100 倍，以降低到我可以等待的东西。

我还在我的版本顶部添加了这段代码：

 import time

current_milli_time = lambda: int(round(time.time() * 1000))

def mark(s):
    print("[{}] {}".format(current_milli_time()/1000, s))

这只是打印一个前面有时间标记的字符串，看看什么花了这么长时间。

完成后，在您的'difference' 列计算中，您可以将手动列表生成替换为向量操作。这段代码：

df = pd.DataFrame(data_dict)

mark("Got DataFrame")

# calculate difference between the current speed and the next speed
list_of_differences = []

for i in df.index:
    difference = df.next_speed[i]-df.speed[i]
    list_of_differences.append(difference)

df['difference'] = list_of_differences
mark("difference 1")

df['difference2'] = df['next_speed'] - df['speed']
mark('difference 2')

print(df[:10])

产生这个输出：

[1490943913.921] Got DataFrame
[1490943922.094] difference 1
[1490943922.096] difference 2
   next_speed      speed  difference  difference2
0   18.008314  20.182982   -2.174669    -2.174669
1   14.736095  18.008314   -3.272219    -3.272219
2    5.352993  14.736095   -9.383102    -9.383102
3    5.854199   5.352993    0.501206     0.501206
4    2.003826   5.854199   -3.850373    -3.850373
5   12.736061   2.003826   10.732236    10.732236
6    2.512623  12.736061  -10.223438   -10.223438
7   18.224716   2.512623   15.712093    15.712093
8   14.023848  18.224716   -4.200868    -4.200868
9   15.991590  14.023848    1.967741     1.967741

请注意，两个 difference 列是相同的，但第二个版本花费的时间减少了大约 8 秒。（当您拥有 100 倍以上的数据时，大概需要 800 秒。）

我在“nanify”过程中做了同样的事情：

df.difference2[df.difference2 < -2] = np.nan

这里的想法是，许多二元运算符实际上生成占位符、序列或向量。这可以用作索引，因此df.difference2 < -2 成为（本质上）该条件为真的位置的列表，然后您可以索引df（整个表）或任何列df，如df.difference2，使用该索引。它是其他慢速 python for 循环的快速简写。

更新

好的，最后，这是一个矢量化“时间效率低的循环”的版本。我只是将整个内容粘贴到底部，以便复制。

前提是Series.isnull() 方法返回一个布尔系列（列），如果内容“缺失”或“无效”或“伪造”，则该系列为真。一般这个意思是NaN，但也能识别Python None等。

在 pandas 中，棘手的部分是将该列向上或向下移动 1 以反映“环绕”性。

也就是说，我想要另一个布尔列，如果 col[n] 为空，则 col[n-1] 为真。那是我的“在南之前”专栏。同样，如果 col[n] 为空，我想要另一列 col[n+1] 为真。那是我的“nan之后”专栏。

原来我不得不把这该死的东西拆开！我不得不进入，使用Series.values 属性提取底层numpy 数组，以便丢弃pandas index。然后创建一个新的索引，从 0 开始，一切都恢复正常。（如果您不删除索引，则列“记住”它们的编号应该是什么。因此，即使您删除列 [0]，该列也不会向下移动。相反，它知道“我错过了我的[0] 值，但其他人仍然在正确的位置！”）

无论如何，弄清楚这一点后，我能够构建三列（不必要 - 它们可能是表达式的一部分），然后将它们合并到第四列中，指示您想要什么：该列是 True当行在 nan 值之前、之上或之后时。

missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan

这就是全部内容：

import numpy as np
import pandas as pd

import time

current_milli_time = lambda: int(round(time.time() * 1000))

def mark(s):
    print("[{}] {}".format(current_milli_time()/1000, s))

# sample data
speed = np.random.uniform(0,25,150000)
next_speed = speed[1:]

# create a dataframe
data_dict = {'speed': speed[:-1],
            'next_speed': next_speed}

df = pd.DataFrame(data_dict)

mark("Got DataFrame")

# calculate difference between the current speed and the next speed
list_of_differences = []

#for i in df.index:
    #difference = df.next_speed[i]-df.speed[i]
    #list_of_differences.append(difference)

#df['difference'] = list_of_differences
#mark("difference 1")

df['difference'] = df['next_speed'] - df['speed']
mark('difference 2')

df['difference2'] = df['next_speed'] - df['speed']

# add 'nan' to data in form of a string.

#for i in range(len(df.difference)):
    ## arbitrary condition
    #if df.difference[i] < -2:
        #df.difference[i] = 'nan'

df.difference[df.difference < -2] = np.nan
mark('nanify')

df.difference2[df.difference2 < -2] = np.nan
mark('nanify 2')

missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan
mark('looped')

#########################################
# THE TIME-INEFFICIENT LOOP

# remove wrong values before and after 'nan'.
for i in  range(len(df)):

    # check if the value is a number to skip computations of the following "if" cases
    if not(isinstance(df.difference[i], str)):
        continue

    # case 1: where there's only one 'nan' surrounded by values.
    # Without this case the algo will miss some wrong values because 'nan' will be removed
    # Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
    # A number surrounded by slashes e.g. /1/ is a value to be removed
    if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
        df.difference[i-1]= 'wrong'
        df.difference[i+1]= 'wrong'

    # case 2: where the following values are 'nan': /1/, nan, nan, 4
    # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
    elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
        df.difference[i-1]= 'wrong'

    # case 3: where next value is NOT 'nan'  wrong, nan,nan,4
        # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
    elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
        df.difference[i+1]= 'wrong'

mark('time-inefficient loop done')

【讨论】：

【解决方案2】：

我假设您不想要'nan' 或错误值，并且nan 值与数据大小相比并不多。请试试这个：

nan_idx = df[df['difference']=='nan'].index.tolist()

from copy import deepcopy
drop_list = deepcopy(nan_idx)


for i in nan_idx:
    if (i+1) not in(drop_list) and (i+1) < len(df):
        mm.append(i+1)
    if (i-1) not in(drop_list) and (i-1) < len(df):
        mm.append(i-1)

df.drop(df.index[drop_list])

如果nan 不是字符串，但它是用于缺失值的NaN，则使用它来获取其索引：

nan_idx = df[pandas.isnull(df['difference'])].index.tolist()

【讨论】：