Python int 比较在熊猫中无法正常工作答案

【问题标题】：Python int comparison not working properly in pandasPython int 比较在熊猫中无法正常工作
【发布时间】：2020-12-23 18:54:00
【问题描述】：

我正在开发一个函数，用于根据对数据集的评估来计算文档的数字。我选择了 pandas，因为它似乎是使用大数据集的最有效方式。我的专栏是：引用（标识符）、引用（标识符）、创建（字符串 YYYY-MM 或 YYYY）。
我需要将满足在第 1 年或第 2 年创建的标准的引用对象的所有标识符添加到集合中。我发现了一个很酷的技巧，通过索引对 Dataframe 进行子集化：我将索引的 Dataframe 保存到局部变量（'citing'），然后使用 .loc[identifier]['creation'] 在列创建时获取该行的值.问题是，这可以返回一个系列（多个标识符）或一个字符串（只有一个值，所以直接返回创建日期）。
由于该值可以是 str(YYYY-MM) 或 str(YYYY) 格式，我必须用 [:4] 对其进行切片以进行实际比较，另外。我试图根据数据类型做一个条件块，但一定出了问题，因为我用我的 DEBUG 行打印的是这样的：

调试：2014 年 == 到 2015 年或 2014 年
调试：2016 年 == 到 2015 年或 2014 年
调试：2018 年 == 到 2015 年或 2014 年
调试：2015 年 == 到 2015 年或 2014 年

我也尝试进行字符串比较，将日期转换为 str()，然后比较字符串，不幸的是我得到了相同的结果

for identifier in ls:
    citing = data.set_index('citing')  # save data indexed by 'citing' column to local variable
        try:                               # handle KeyError exception

            creation = citing.loc[identifier]['creation']  # this can either be a str or a pandas series

            if type(creation) == pandas.core.series.Series:
                if int(creation.iloc[0][:4]) == (int(year))-1 or int(creation.iloc[0][:4]) == (int(year))-2:
                    print('DEBUG: ', creation.iloc[0][:4], 'is == to either {} or {}'.format(str(int(year)-1), str(int(year)-2)))
                    pub.add(identifier)

            elif type(creation) == str:
                if int(creation[:4]) == (int(year))-1 or (int(year))-2:
                    print('DEBUG: ', creation[:4], 'is == to either {} or {}'.format(str(int(year)-1), str(int(year)-2)))
                    pub.add(identifier)

        except KeyError:
            pass

这真的是我在python中的第一个复杂函数，所以有些事情可能明显错误或缓慢或效率低下，请您帮我把它们拼出来，以便我改进我的函数！谢谢！

编辑：作为 pandas 数据框的示例输入：

 citing    cited    creation
0  1234  1235  2018-11 
1  1237  1234  2017     
2  1236  1237  2011-01
3  1234  1248  2018-11
4  1235  1236  2018-11

如果输入是此 Dataframe 和 2018 年，则结果集应仅包含 {1237}，因为它是在 y-1 或 y-2 中创建的唯一一个

【问题讨论】：

你能分享样本数据和想要的输出吗？也可能有另一种方法可以做到这一点。

标签： python python-3.x pandas dataframe series

【解决方案1】：

您可以（几乎）一次找到所有符合您的条件的行。事实上，这更有效，因为您将一次性计算所有行的标准，而不是循环遍历每个值。

ix = df[
    df.creation.astype(str).str[:4].astype(int).isin({year-1, year-2})
  ].index
identifiers = set(df.loc[ix, 'citing'])
pub |= identifiers

更多解释：

.astype(str) -> 确保每个值都是 str 类型，即使是多年（以防万一）

.str -> pandas 的字符串访问器，这将允许您使用字符串方法（更多信息here）

[:4] -> 字符串方法，将允许您捕获前 4 个字符

.astype(int) -> 会将整个结果转换为 int（请注意，如果您有缺少值的行，这可能会失败；请参阅下面的解决方法）

.isin(...) -> 将允许查看值（在每一行上）是否在内部 (...)

您将获得一个“索引”，可用于在一次操作中过滤数据帧。

如果您有缺失值，您可以先使用df['creation'].fillna("1000", inplace=True)，例如。

【讨论】：