如何使用 pandas 加速这个嵌套循环？答案

【问题标题】：How can I speed up this nested loop using pandas?如何使用 pandas 加速这个嵌套循环？
【发布时间】：2020-07-15 23:22:46
【问题描述】：

我是 python 和 pandas 的新手。我正在尝试根据时间戳之间的时间差为大约 2270 个用户分配新的会话 ID。如果时差超过 4 小时，我想要一个新的会话 ID。否则，它必须保持不变。最后，我想要一个带有新会话 ID 列的修改数据框。这是我到目前为止所拥有的：

Eh_2016["NewSessionID"] = 1 #Initialize 'NewSessionID' column in df with 1
Eh_2016['elapsed'] = datetime.time(0,0,0,0) #Create an empty elapsed to calculate Time diff later
users = Eh_2016['Username'].unique() #find the number of unique Usernames
for user in users: #start of the loop
    idx = Eh_2016.index[Eh_2016.Username == user] #Slice the original df
    temp = Eh_2016[Eh_2016.Username == user] #Create a temp placeholder for the slice
    counter = 1 # Initialize counter for NewSessionIDs
    for i,t in enumerate(temp['Timestamp']): #Looping for each timestamp value
        if i != 0 : 
            temp['elapsed'].iloc[i] = (t - temp['Timestamp'].iloc[i-1]) #Calculate diff btwn timestamps
            if temp['elapsed'].iloc[i] > datetime.timedelta(hours = 4): #If time diff>4
                counter +=1 #Increase counter value
                temp['NewSessionID'].iloc[i]=counter #Assign new counter value as NewSessionID
            else:
                temp['NewSessionID'].iloc[i] = counter #Retain previous sessionID
    Eh_2016.loc[idx,:]= temp #Replace original df with the updated slice

任何有关如何使这更快的帮助将不胜感激！如果您需要更多详细信息，请告诉我。提前致谢。

编辑：示例 DF

        Username               Timestamp  NewSessionID                  Elapsed
126842  1095513 2016-06-30 20:58:30.477             1                 00:00:00
126843  1095513 2016-07-16 07:54:47.986             2  15 days 10:56:17.509000
126844  1095513 2016-07-16 07:54:47.986             2          0 days 00:00:00
126845  1095513 2016-07-16 07:55:10.986             2          0 days 00:00:23
126846  1095513 2016-07-16 07:55:13.456             2   0 days 00:00:02.470000
        ...                     ...           ...                      ...
146920  8641894 2016-08-11 22:26:14.051            31   0 days 04:50:23.415000
146921  8641894 2016-08-11 22:26:14.488            31   0 days 00:00:00.437000
146922  8641894 2016-08-12 20:01:02.419            32   0 days 21:34:47.931000
146923  8641894 2016-08-23 10:19:05.973            33  10 days 14:18:03.554000
146924  8641894 2016-09-25 11:30:35.540            34  33 days 01:11:29.567000

【问题讨论】：

欢迎来到 SO。请提供mcve
请包含示例源数据帧和预期结果。
哦，是的，不。拉夫。如果您有一个包含日期时间的列，您可以使用pd.cut 将其拆分为月份并为它们分配唯一值，您可以使用这些值来制作随机 ID。看看这个帖子stackoverflow.com/questions/43500894/…还有另一种方法可以做到这一点，但我不记得了。

标签： python pandas nested-loops

【解决方案1】：

为每个用户过滤整个数据框是O(users*sessions)，这不是必需的，因为无论如何您都需要遍历整个数据。

一种更有效的方法是一次遍历数据帧，并将临时变量（计数器、前一行的位置等）存储在由用户索引的单独数据帧中。

Eh_2016["NewSessionID"] = 1 #Initialize 'NewSessionID' column in df with 1
Eh_2016['elapsed'] = datetime.time(0,0,0,0) #Create an empty elapsed to calculate Time diff later
# create new dataframe of unique users
users = pd.DataFrame({'Username': Eh_2016['Username'].unique()}).set_index('Username') 
# one column for the previous session looked at for each user
users['Previous'] = -1
# one column for the counter variable
users['Counter'] = 0

# iterate over each row
for index, row in Eh_2016.iterrows(): #start of the loop
    user = row['Username']
    previous = users[user, 'Previous']
    if previous >= 0: # if this is not the first row for this user
        Eh_2016.loc[index, 'elapsed'] = (row['Timestamp'] - Eh_2016.loc[previous, 'Timestamp']) #Calculate diff btwn timestamps
        if Eh_2016.loc[index, 'elapsed'] > datetime.timedelta(hours = 4): #If time diff>4
            users[user,'Counter'] += 1 #Increase counter value
        Eh_2016.loc[index, 'NewSessionID'] = users[user,'Counter'] # Assign new counter value as NewSessionID
        users[user, 'Previous'] = index # remember this row as the latest row for this user

【讨论】：