【发布时间】:2021-05-17 19:46:49
【问题描述】:
我有一个大约 1500 万行的数据集,pandas 在这些数据集上有点无法执行 for 循环。我正在尝试 dask 数据帧以加快执行时间,但是,迭代不起作用。
初始数据框示例:
cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp']
data = [[12003, 1, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
[12003, 2, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
[12003, 3, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
[12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
[12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
[12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
[12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
[12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
[12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183]
]
bookdf = pd.DataFrame(data, columns = cols)
期望的输出:
cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp', 'position', 'check', 'egix', 'expx']
data = [[12003, 1, 446499.51923, 23.76, np.nan, np.nan, 0.00228, 0.00228, 0, 446499.51923],
[12003, 2, 446499.51923, 32.76, np.nan, np.nan, 0.00228, 0.00228, 1, 447517.89163],
[12003, 3, 446499.51923, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 2, 448338.21855],
[12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 3, 449160.04918],
[12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 4, 449983.38628],
[12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 5, 450808.23260],
[12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 6, 451634.59091],
[12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 7, 452462.46399],
[12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 8, 453294.43921],
[12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 9, 454127.94424],
[12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 0, 163392.40385],
[12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 1, 163765.06788],
[12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 2, 164065.25900],
[12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 3, 164366.00038],
[12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 4, 164667.29304]
]
bookdf = pd.DataFrame(data, columns = cols)
仅适用于小型数据集的 Pandas 中的工作代码:
# 'check' column is being created to get first row of each grouped data w.r.t 'id' column.
# I need to take to take first row of each group and do the below calculation for rest of the rows of each group but ```bookdf.group('id).first()``` is not working with the below calculation which basically retains the last value and do the math.
bookdf['check'] = bookdf.groupby(bookdf['id']).cumcount()
bookdf['egix'] = np.where((bookdf.check==0) & (bookdf.PEGI>0), bookdf.PEGI, bookdf.EGI0)
bookdf['expx'] = np.where((bookdf.check==0) & (bookdf.PExp>0), boodf.PExp, bookdf.EXP0)
for ind in bookdf.index:
if boo1df['check'][ind]!=0:
bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])
如果我尝试使用 dask 数据帧运行相同的代码,则会出现以下错误:
for ind in range(0, len(book1df)):
if boo1df['check'][ind]!=0:
bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])
**Error** : Series getitem in only supported for other series objects with matching partition structure.
Is there any way to implement this in dask dataframe or another best way to get the desired output with large Dataset.
【问题讨论】:
-
如果您提供一个完全可重现的数据框,这样用户就可以运行代码并尝试自己查看错误,这将更容易。乍一看,在不了解其他情况的情况下,我建议您尝试将 pandas shift() 方法与 apply() 一起使用。见stackoverflow.com/questions/10982089/…
-
@HMReliable,嗨,我已按要求附上了可重现的数据框。这个你能帮我吗。我已经尝试过 shift 和 apply() 但它们不起作用,因为我必须遍历整个列才能获得所需的计算。
标签: python pandas dask dask-distributed dask-dataframe