【发布时间】:2017-08-25 06:51:57
【问题描述】:
我正在尝试从我的数据中删除错误的值(一系列 1500 万个值,700MB)。要删除的值是 'nan' 值旁边的值,例如:
系列:/1/,nan,/2/,3,/4/,nan,nan,nan,/8/,9
由斜线包围的数字,即 /1/,/2/,/4/,/8/ 是值,应该删除。
问题是使用我拥有的以下代码计算它需要很长时间:
%%time
import numpy as np
import pandas as pd
# sample data
speed = np.random.uniform(0,25,15000000)
next_speed = speed[1:]
# create a dataframe
data_dict = {'speed': speed[:-1],
'next_speed': next_speed}
df = pd.DataFrame(data_dict)
# calculate difference between the current speed and the next speed
list_of_differences = []
for i in df.index:
difference = df.next_speed[i]-df.speed[i]
list_of_differences.append(difference)
df['difference'] = list_of_differences
# add 'nan' to data in form of a string.
for i in range(len(df.difference)):
# arbitrary condition
if df.difference[i] < -2:
df.difference[i] = 'nan'
#########################################
# THE TIME-INEFFICIENT LOOP
# remove wrong values before and after 'nan'.
for i in range(len(df)):
# check if the value is a number to skip computations of the following "if" cases
if not(isinstance(df.difference[i], str)):
continue
# case 1: where there's only one 'nan' surrounded by values.
# Without this case the algo will miss some wrong values because 'nan' will be removed
# Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
# A number surrounded by slashes e.g. /1/ is a value to be removed
if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
df.difference[i-1]= 'wrong'
df.difference[i+1]= 'wrong'
# case 2: where the following values are 'nan': /1/, nan, nan, 4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
df.difference[i-1]= 'wrong'
# case 3: where next value is NOT 'nan' wrong, nan,nan,4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
df.difference[i+1]= 'wrong'
如何让它更省时?
【问题讨论】:
-
为什么要用字符串
wrong替换不需要的元素?是要求还是您要删除不需要的元素。 -
字符串
'nan'和值nan之间存在差异。如果您的系列中确实包含nan(很可能),那么这些实际上是np.nan的值,也称为“非数字”。你不能和nan比较,你必须和isnan比较。 -
@Rohanil 我将其替换为“错误”,因为我想稍后从该列中删除所有字符串。出于同样的原因,我使用字符串
'nan'而不是实际的np.nan这是我如何从列中删除错误值的想法的一部分。我之前尝试过更直接的方法,但我在循环和索引方面遇到了一些问题,所以我做了一个解决方法并用字符串替换了错误的值。 @Austin 是的,我很清楚'nan'和np.nan(即NaN)之间存在差异。作为解决方法的一部分,我有意将它们替换为字符串。
标签: python performance for-loop time