根据另一个数据框查找列的子集？答案

【问题标题】：Find a subset of columns based on another dataframe?根据另一个数据框查找列的子集？
【发布时间】：2021-07-07 05:18:02
【问题描述】：

我正在跨时间收集多个受试者的心率数据。在数据收集过程中会发生不同的事件，因此每个事件的开始都记录在其他地方。对于每个主题，每个事件的开始时间都会略有不同。我想桥接两个数据帧之间的信息，以便我可以知道在标记为事件的每个时间段内不同受试者的平均心率。如何获得在另一个数据框中标记为事件的某些时间点之间的平均心率？例如，如何找到事件 2 和事件 3 之间的平均心率？

import pandas as pd 
import numpy as np

#example 
example_g = [["4/20/21 4:20", 302, 0, 1, 2, 3, 4, 5],
       ["2/17/21 9:20",135, 1, 1.4, 1.8, 2, 8, 10],
       ["2/17/21 9:20", 111, 4, 5, 5.1, 5.2, 5.3, 5.4]]
example_g_table = pd.DataFrame(example_g,columns=['Date_Time','CID', 0, 1, 2, 3, 4, 5])

#Example Timestamps
example_s = [["4/20/21 4:20",302,0, 2, 3],
       ["2/17/21 9:20",135,0, 1, 4 ],
       ["2/17/21 9:20",111,3, 4, 5 ]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', "event_1", "event_2", "event_3"])

desired_outcome = [["4/20/21 4:20",302,2.5],
       ["2/17/21 9:20",135, 3.3 ],
       ["2/17/21 9:20",111, 5.35 ]]

desired_outcome_table = pd.DataFrame(desired_outcome,columns=['Date_Time','CID', "Average of data between Event 2 and Event 3"])

【问题讨论】：

不清楚您使用什么逻辑来获取平均值？您能否再解释一下并包括这些数据帧的预期输出？
谢谢——这说明清楚了吗？

标签： python pandas

【解决方案1】：

我能够组合一个我认为适用于此的函数，但假设列不会更改顺序或添加更多列。如果 df 形状发生变化，则需要为此进行更新。

首先，我将您的 example_g_table 和 example_s_table 合并在一起，以将它们全部组合在一起。

df = pd.merge(left=example_g_table,right=example_s_table,on=['Date_Time','CID'],how='left')
       Date_Time    CID 0   1   2   3   4   5   event_1 event_2 event_3
0   4/20/21 4:20    302 0   1.0 2.0 3.0 4.0 5.0     0   2   3
1   2/17/21 9:20    135 1   1.4 1.8 2.0 8.0 10.0    0   1   4
2   2/17/21 9:20    111 4   5.0 5.1 5.2 5.3 5.4     3   4   5

现在我们使用一个新函数来提取event_2 和event_3 的值，并返回这些先前列值的平均值。我们稍后将对此运行df.apply，因此它一次只需要一行，作为一个系列（我认为，无论如何）。

def func(df):
    event_2 = df['event_2']
    event_3 = df['event_3']
    start = int(event_2 + 2) # this assumes that the column called 0 will be the third (and starting at 0, it'll be the called 2), column 1 will be the third column, etc
    end = int(event_3 + 2) # same as above
    total = sum(df.iloc[start:end+1]) # this line is the key. It takes the sum of the values of columns in the range of start to finish
    avg = total/(end-start+1) #(end-start+1) gets the count of things in our range
    return avg

最后，我们在此运行 df.apply 以获取我们的新列。

df['avg'] = df.apply(func,axis=1)
df
       Date_Time    CID 0   1   2   3   4   5   event_1 event_2 event_3 avg
0   4/20/21 4:20    302 0   1.0 2.0 3.0 4.0 5.0     0   2          3    2.50
1   2/17/21 9:20    135 1   1.4 1.8 2.0 8.0 10.0    0   1          4    3.30
2   2/17/21 9:20    111 4   5.0 5.1 5.2 5.3 5.4     3   4          5    5.35

【讨论】：

非常感谢！这正是我想要的。
如果我在我的主要数据框代码中使用它，你会建议什么，并且在尝试应用时，我得到：TypeError: cannot do positional indexing on Index with these indexer [66.0] of type float
我的猜测是出现此错误是因为您将浮点数而不是整数传递给df.iloc[]。在上面代码的编写方式中，我们将start 和end+1 传递给iloc，它们都应该是整数，因为它们正上方的两行使它们成为整数。您是否采取了一些措施来删除 int() 或以其他方式对其进行修改？或者您现在是否有其他地方将另一个值传递给 iloc[] 并且需要将新值转换为 int？