使用来自 dataframe2 的值过滤 dataframe1 并在 Python 中的特定行值之后选择 dataframe 1 中的所有行答案

【问题标题】：Filter dataframe1 with values from dataframe2 and select all rows in dataframe 1 after a particular row value in Python使用来自 dataframe2 的值过滤 dataframe1 并在 Python 中的特定行值之后选择 dataframe 1 中的所有行
【发布时间】：2021-05-22 12:38:09
【问题描述】：

我有 2 个数据帧（df1 &df2）。我想从 df2 中获取每一行，对于特定的 ID 和查询，在找到“查询”字符串后从 df1 中获取该特定 ID 的所有值。唯一的条件是显示的行的“发送者”列值应仅为“支持团队”。

我尝试过类似df1= df1.loc['can we find tigers in Amazon forest?':] 但是得到了keyerror..任何人都可以帮我解决这个问题..

Note:Index in df1 is not sorted as the dataframes are grouped based on ID

df1 =

Index  ID        Query                                    Sent by
0    76649  Hi                                           Jack
2    76649  Anyone there                                 Jack
3    76649  yes hi                                    Support Team
10   76649  this is Fred from support team            Support Team
5    76649  can we find tigers in Amazon forest?        Jack
6    76649  yes tigers can be found there             Support Team
7    76649  contact forest dept for more              Support Team
13   76649  thanks for reaching out                   Support Team
9    67209  Hello                                      Bianca
4    67209  Anyone there                              Bianca
11   67209  Hi this is Jim from support team          Support Team
12   67209  can we find lions in Amazon forest?       Bianca
8    67209  yes lions  can be found there             Support Team
14   67209  contact forest dept for more              Support Team
15   67209  thanks for reaching out                   Support Team
16   67209  sure that helps..thank you                Bianca

df2 =

Index        Query                                         ID
0      can we find tigers in Amazon forest?               76649
2      can we find lions in Amazon forest?                67209
3      can we find elephant in Amazon forest?             77832

预期输出：

76649  yes tigers can be found there             Support Team
76649  contact forest dept for more              Support Team
76649  thanks for reaching out                   Support Team
67209   yes lions  can be found there             Support Team
67209   contact forest dept for more              Support Team
67209   thanks for reaching out                   Support Team

【问题讨论】：

标签： python python-3.x pandas dataframe merge

【解决方案1】：

我不知道这是否是最优雅的方式。但这将是我的尝试。如果有不清楚的地方，请要求澄清。

# Just example dfs for testing
df1 = pd.DataFrame({'Query': ['this is Fred from support team', 'can we find tigers in Amazon forest?', 'yes tigers can be found there', 'can we find lions in Amazon forest?','yes lions can be found there'],
                    'ID':   [1,1,1,2,2]})
df2 = pd.DataFrame({'Query': ['can we find tigers in Amazon forest?', 'can we find lions in Amazon forest?'],
                    'ID':   [1,2]})

# Reset index so we can take every object with greater index
df1.reset_index(inplace=True)

#init output
output = None

#iterate over df2
for idx, row in df2.iterrows():

    # Find index of matching string and id in df1
    index = df1.index[(df1['Query'] == row['Query']) & (df1['ID'] == row['ID'])]

    # index is a list so check if the result is consistent with our logic
    # If string not found
    if len(index) == 0:
        continue

    # You can add here code what to do if the same string with same id appears more often in df1
    elif len(index) > 1:
        print("Oopsi, your string seems to appear more often with the same ID!")

    else:

        # Create output
        if output is None:
            output = df1[(df1['ID'] == row['ID']) & (df1.index > index[0])]
        else:
            output = output.append(df1[(df1['ID'] == row['ID']) & (df1.index > index[0])])

# Filter by support team
output = output[output['Sent by'] == 'Support Team']
print(output)

【讨论】：

output = output[output['Sent by'] == 'Support Team'] 应该可以完成这项工作。我也在答案中添加了它。如果它有效，也许你可以给我反馈，因为我现在还没有测试过。可能有错别字。
好的。首先我迭代df2。对于每次迭代，我都会得到索引（idx）和行（行）的内容。你的意思是我尝试获取 df1 中每一行的索引，其中列值 Query 和 ID 与行中的相应值匹配。也许为了理解尝试print(df1[(df1['Query'] == row['Query']) & (df1['ID'] == row['ID'])]) 。 index 方法只返回一个列表，其中包含该打印系列中的所有索引。小心我使用 reset_index 创建的新索引。
我觉得output = output[output['Sent by'] == 'Support Team']比较好理解。基本上我也是这样做的。只需通过逻辑 AND 连接两个条件。而output.index[output['Sent by'] == 'Support Team'] 不会返回具有该条件的新系列，而只会返回行的idices。
基本上 index 是 df1 中与 df2 中的字符串匹配的所有索引的列表。我检查字符串是否恰好找到一次。所以 index[0] 是 df1 中的索引，其中 Query 字符串与 df2 中的查询匹配。然后我附加所有具有相同 ID 并且索引高于 df1 中的索引 [0] 的所有行。
您可以打印他的以便更好地理解：print(df1[(df1['ID'] == row['ID']) & (df1.index > index[0])]) 或 print(index)