按 ID 链接两个文件，然后通过在 Python 中使用 DataFrames 引用另一个文件从一个文件中删除数据值答案

【问题标题】：Linking two files by ID, and then removing data values from one file by referencing the other in Python using DataFrames按 ID 链接两个文件，然后通过在 Python 中使用 DataFrames 引用另一个文件从一个文件中删除数据值
【发布时间】：2020-07-02 10:41:14
【问题描述】：

我不认为这个问题那么复杂，我只是很笨，我不知道如何描述我的搜索。

我有两个文件，它们通过一个共同的 ID 链接。一个文件（FileA），每行列出一个上年和一个下年。在另一个文件 (FileB) 中，有一个年份范围。我不需要 FileB 中 FileA 中的间隔定义的年份。如何通过引用通用 ID 来删除它们？需要针对每个 ID 组执行此操作，这增加了复杂性。

文件 A：

ID、uyear、lyear

2341、2005、1995

2341、2013、2010

所以对于 FileB 中的 ID 2341，我不需要 1995 年至 2005 年和 2010 年至 2013 年的年份

示例文件B：

ID、年份、价格、

4321, 1991, 2.45

4321, 1992, 2.47

4321, 1993, 3.4

4321, 1994, 3.4

4321, 1995, 2.34

4321, 1996, 2.44

3214, 1990, 2.33

3214, 1991, 2.44

3214, 1992, 2.55

【问题讨论】：

标签： python-3.x pandas

【解决方案1】：

我在您的 file_b 示例中添加了一些 2341 引用，以表明它们将被过滤掉：

import pandas as pd
file_a = pd.DataFrame(
    data=[[2341, 2005, 1995],
    [2341, 2013, 2010]],
    columns=["id", "uyear", "year"]
)
file_b = pd.DataFrame(
    data=[[4321, 1991, 2.45],
    [4321, 1992, 2.47],
    [4321, 1993, 3.4],
    [4321, 1994, 3.4],
    [4321, 1995, 2.34],
    [4321, 1996, 2.44],
    [2341, 1994, 2.34],
    [2341, 1995, 2.34],
    [2341, 1996, 2.44],
    [3214, 1990, 2.33],
    [3214, 1991, 2.44],
    [3214, 1992, 2.55]],
    columns=["id", "year", "price"]
)

请注意，我们希望保留 2341 之一：1994 年为 2341。另外两行属于 file_a 中的范围之一。

remove_indexes = (file_b
    .assign(file_b_index=lambda x: x.index)
    .merge(file_a, on="id", how="left")
    .query("year_x >= year_y and year_x <= uyear")
    .file_b_index)
file_b[~file_b.index.isin(remove_indexes)].reset_index()[["id", "year", "price"]]

产量

      id    year    price
0   4321    1991    2.45
1   4321    1992    2.47
2   4321    1993    3.40
3   4321    1994    3.40
4   4321    1995    2.34
5   4321    1996    2.44
6   2341    1994    2.34
7   3214    1990    2.33
8   3214    1991    2.44
9   3214    1992    2.55

基本思路是确定需要从 file_b 中删除哪些索引（因为它们在 if 上匹配并且至少落入一个范围内），然后按索引从原始文件中删除行。

【讨论】：

.query("year_x >= year_y and year_x <= uyear") 我对这条线的工作原理感到困惑。我理解要点，它设置了上限和下限，但我不知道 year_x 和 year_y 来自哪里？
其实我想通了！这非常有效，非常感谢。我想我也可以推断这也适用于以下步骤。 :)
啊，看来我错过了问题中lyear 的l 部分。很抱歉造成混乱。
嗯，我似乎仍然收到此错误。我不知道 year_y 和 year_x 是什么。我为 file_a 添加了另一行：4321, 1990, 1994。它不会像应有的那样排除 1991 到 1994。
更新：我让它工作了，我倒着输入它：file_a: 4321, 1990, 1994。我试图让它与一个更大的文件一起工作，这是我提取的一个样本从它（我从这里复制和粘贴）。出于某种原因，当我使用我的实际 .csv 文件时，我收到错误：'year_x is not defined'。这是因为“年”与“年”的变化。我不确定要进行哪些调整才能使标签正常工作。麻烦您了！