使用 if-else 条件迭代 dask 数据帧答案

【问题标题】：Iterate over dask dataframe with if-else condition使用 if-else 条件迭代 dask 数据帧
【发布时间】：2021-05-17 19:46:49
【问题描述】：

我有一个大约 1500 万行的数据集，pandas 在这些数据集上有点无法执行 for 循环。我正在尝试 dask 数据帧以加快执行时间，但是，迭代不起作用。

初始数据框示例：

cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp']
data = [[12003, 1, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
        [12003, 2, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
        [12003, 3, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
        [12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
        [12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183]
]
bookdf = pd.DataFrame(data, columns = cols)

期望的输出：

cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp', 'position', 'check', 'egix', 'expx']
data = [[12003, 1, 446499.51923, 23.76, np.nan, np.nan, 0.00228, 0.00228, 0, 446499.51923],
        [12003, 2, 446499.51923, 32.76, np.nan, np.nan, 0.00228, 0.00228, 1, 447517.89163],
        [12003, 3, 446499.51923, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 2, 448338.21855],
        [12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 3, 449160.04918],
        [12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 4, 449983.38628],
        [12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 5, 450808.23260],
        [12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 6, 451634.59091],
        [12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 7, 452462.46399],
        [12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 8, 453294.43921],
        [12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 9, 454127.94424],
        [12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 0, 163392.40385],
        [12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 1, 163765.06788],
        [12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 2, 164065.25900],
        [12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 3, 164366.00038],
        [12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 4, 164667.29304]
]
bookdf = pd.DataFrame(data, columns = cols)

仅适用于小型数据集的 Pandas 中的工作代码：

# 'check' column is being created to get first row of each  grouped data w.r.t 'id' column. 
# I need to take to take first row of each group and do the below calculation for rest of the rows of each group but ```bookdf.group('id).first()```  is not working with the below calculation which basically retains the last value and do the math.

bookdf['check'] =  bookdf.groupby(bookdf['id']).cumcount()
bookdf['egix']  = np.where((bookdf.check==0) & (bookdf.PEGI>0), bookdf.PEGI, bookdf.EGI0)
bookdf['expx']  = np.where((bookdf.check==0) & (bookdf.PExp>0), boodf.PExp, bookdf.EXP0)
for ind in bookdf.index:
    if boo1df['check'][ind]!=0:
        bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
        bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])

如果我尝试使用 dask 数据帧运行相同的代码，则会出现以下错误：

for ind in range(0, len(book1df)):
    if boo1df['check'][ind]!=0:
        bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
        bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])

**Error** : Series getitem in only supported for other series objects with matching partition structure.

Is there any way to implement this in dask dataframe or another best way to get the desired output with large Dataset.

【问题讨论】：

如果您提供一个完全可重现的数据框，这样用户就可以运行代码并尝试自己查看错误，这将更容易。乍一看，在不了解其他情况的情况下，我建议您尝试将 pandas shift() 方法与 apply() 一起使用。见stackoverflow.com/questions/10982089/…
@HMReliable，嗨，我已按要求附上了可重现的数据框。这个你能帮我吗。我已经尝试过 shift 和 apply() 但它们不起作用，因为我必须遍历整个列才能获得所需的计算。

标签： python pandas dask dask-distributed dask-dataframe

【解决方案1】：

一种选择是完全摆脱循环。

# this creates the mask of interest
mask = bookdf['check'] != 0

# now we can apply the mask with .loc
bookdf.loc[mask, 'egix'] = bookdf.loc[mask, ['egix']].shift(-1) * (bookdf.loc[mask, ['gEGI']])

bookdf.loc[mask, 'expx'] = bookdf.loc[mask, ['expx']].shift(-1) * (bookdf.loc[mask, ['TotExp']])

这应该适用于 pandas 和 dask 数据帧。

【讨论】：

嗨@SultanOrazbayev，感谢上述简化代码。但是，此代码返回的预期输出基本上是前一行与 1 + 当前行的乘积。无论如何我可以得到上面提到的可重现的期望输出。
不清楚'check'列是如何生成的。
生成'check'列是为了区分每个组的第一行w.r.t到“id”列来计算组中的其余行
我尝试运行您的 pandas 代码，但由于未定义 egix/expx 列，它不起作用。
我已经更新了上面的代码，提供了如何生成 egix/expx 的详细信息。请看一下

【解决方案2】：

由于您需要访问每个组中以前的“egix”和“expx”值，因此创建两个新列来存储这些值，以便有效地完成计算。然后对 df 的所有列使用 apply 方法：

bookdf['egix_prev'] = bookdf.groupby('id')['egix'].shift(1)
bookdf['expx_prev'] = bookdf.groupby('id')['expx'].shift(1)

bookdf['egix'] = bookdf.apply(lambda x:x['egix_prev']*(1+x['gEGI']) if x['check']!=0 else x['egix'],axis=1)
bookdf['expx'] = bookdf.apply(lambda x:x['expx_prev']*(1+x['TotExp']) if x['check']!=0 else x['expx'],axis=1)

【讨论】：

感谢代码@HMReliable。在这种情况下，移位不起作用，因为 'egix' 在每次计算后不断更新每一行，它只使用 'egix' 的第一行值进行下一行计算，但对于下一行，它应该使用更新的 'egix' 值.这就是为什么我使用 for 循环来获取以前的索引并更改整个列的当前索引
请看我的最新回答@Ankit Chaudhary

【解决方案3】：

如果您想有效地迭代大型表的行，使用 df.iterrows() 几乎总是更好。下面的代码应该比使用外部 for 循环快 20 倍左右。

row_list = []
for row in bookdf.iterrows():
    if row[1]['check']!=0:
        row[1]['egix'] = row_list[-1][1]['egix']*(1 + row[1]['gEGI'])
        row[1]['expx'] = row_list[-1][1]['expx']*(1 + row[1]['TotExp'])
    row_list.append(row)
rebuilt_df = pd.DataFrame([row[1] for row in row_list])

【讨论】：