【问题标题】:Iterating over multiple data frames row by row - any ways to increase speed?逐行迭代多个数据帧 - 任何提高速度的方法?
【发布时间】:2022-01-04 01:44:04
【问题描述】:

我有 19 个带有日期时间索引的数据帧,我想并行迭代每个数据帧。因此,我从一个 df 开始,将其切片到给定的时间范围,并对其他 df 执行相同的操作。这样就完成了 while 循环的整个迭代。在下一次迭代中,我想创建一个新切片,从旧切片的末尾开始,直到所有数据帧的下一个最接近的时间戳。我想出了这段代码,它正在运行,但由于迭代次数很多,它非常耗时,我想知道是否有更快的方法来做到这一点。

import pandas as pd
import datetime

# creating test data frames
df1 = pd.DataFrame({'A': range(9)})
df1.index = [pd.Timestamp('20130101 09:00:00'),
             pd.Timestamp('20130101 09:01:00'),
             pd.Timestamp('20130101 09:30:00'),
             pd.Timestamp('20130101 09:44:00'),
             pd.Timestamp('20130101 09:50:00'),
             pd.Timestamp('20130101 10:16:00'),
             pd.Timestamp('20130101 10:47:00'),
             pd.Timestamp('20130101 10:53:00'),
             pd.Timestamp('20130101 11:22:00')]

df2 = pd.DataFrame({'B': range(9)})
df2.index = [pd.Timestamp('20130101 09:00:00'),
             pd.Timestamp('20130101 09:01:00'),
             pd.Timestamp('20130101 09:04:00'),
             pd.Timestamp('20130101 09:05:00'),
             pd.Timestamp('20130101 09:09:00'),
             pd.Timestamp('20130101 10:10:00'),
             pd.Timestamp('20130101 10:15:00'),
             pd.Timestamp('20130101 10:16:00'),
             pd.Timestamp('20130101 11:18:00')]

db_dict = {"a": df1, "b": df2}


time_dict_start = {}
time_dict_end = {}
complete_list = []
start_time = datetime.datetime.now()

# starting the main loop
while True:
    # check if all data has been processed
    if len(complete_list) == len(db_dict):
        print(datetime.datetime.now() - start_time)
        break
    
    # iterate over every data frame
    for name in db_dict:
        
        # skip completed data frames
        if name in complete_list:
            continue

        db = db_dict[name]
        
        # first iteration
        if name not in time_dict_start:
            start = db.index[0]
            end = start + datetime.timedelta(seconds=10)
        # all other iterations
        else:
            start = time_dict_start[name]
            # get smallest time stamp
            time_list = [v for k, v in time_dict_end.items()]
            time_list.sort()
            end = time_list[0]

        time_dict_start[name] = end + datetime.timedelta(seconds=1)

        split = db.loc[start: end]

        try:
            # find next closest index
            next_idx = db.index[np.searchsorted(db.index, end + datetime.timedelta(seconds=1))]
            time_dict_end[name] = next_idx
        except IndexError:
            del time_dict_end[name]
            complete_list.append(name)

        # do something with the sliced data frame

【问题讨论】:

  • 嗨,看看多处理模块和pool 函数。对于等待多个并行代码完成并将结果发送到另一部分,这可能有点棘手。

标签: python-3.x pandas dataframe loops


【解决方案1】:

合并数据框有帮助吗?例如,这是一种组合数据框的方法:

df1.index.name = 'time_stamp'
df1.columns.name = 'group'

df2.index.name = 'time_stamp'
df2.columns.name = 'group'

print(
    pd.concat((df1, df2), axis=1)
    .unstack()
    .loc[ lambda x: x.notna() ]
    .astype(int)
    .reset_index()
    .sort_values(['time_stamp', 'group'])
    .rename(columns = {0: 'value'})
)

前 5 行是:

   group          time_stamp  value
0      A 2013-01-01 09:00:00      0
9      B 2013-01-01 09:00:00      0
1      A 2013-01-01 09:01:00      1
10     B 2013-01-01 09:01:00      1
11     B 2013-01-01 09:04:00      2

【讨论】:

  • 感谢您的建议。我不敢相信我没有想到这一点。我会试一试,如果这能解决我的问题,请告诉你。 :)
猜你喜欢
  • 2020-07-12
  • 1970-01-01
  • 1970-01-01
  • 2020-02-05
  • 1970-01-01
  • 1970-01-01
  • 2013-01-08
  • 1970-01-01
  • 2022-01-25
相关资源
最近更新 更多