使用另一个数据框列表过滤数据框列表答案

【问题标题】：Filter list of dataframes with another list of dataframes使用另一个数据框列表过滤数据框列表
【发布时间】：2020-01-18 05:30:18
【问题描述】：

我有两个具有相同结构的数据帧列表，如果 list_a[df_a][col_a] 中的至少一个值存在于任何 list_b 数据帧的 col_a 中，我会尝试忽略 list_a 中的每个数据帧。我已经通过了几次，但还没有找到真正完成它的东西。我的方法可能是错误的，请指出正确的方向！

方法：

    for df_a in list_a:
        for df_b in list_b:
            temp = df_a[~df_a['col_a'].isin([df_b['col_a']])] # error 'list indices must be integers or slices, not 
            if len(temp.index) > 0:
                list_a.remove(df_a)

list_a[0]

    col_a   temp
877 12/17/2019  0.300807486
886 12/31/2019  0.143508662

list_a[1]

    col_a   temp
651 7/27/2019   0.435680418
660 8/10/2019   0.229333215

list_b[0]

    col_a   temp
1   12/31/2019  0.843356517
10  1/14/2020   0.846720719

list_omit[0]

    col_a   temp
1   12/17/2019  0.600807486
2   12/31/2019  0.143508662

结果： 由于 list_a[0] 和 list_b[0] 的日期重叠为 2019 年 12 月 31 日，因此应将 list_a[0] 从 list_a 中删除并添加到 dfs 的“省略”列表中

转载：

import numpy as np
import pandas as pd

temp = list(range(0, 2))
list_a = []
list_b = []

for l in temp:
    df = pd.DataFrame(np.random.randint(0,100,size=(2, 2)), columns=list(['col_a','temp']))
    list_a.append(df)

for l in temp:
    df = pd.DataFrame(np.random.randint(0,100,size=(2, 2)), columns=list(['col_a','temp']))
    list_b.append(df)

print(list_a)
print(list_b)

感谢您的帮助。

【问题讨论】：

标签： python pandas

【解决方案1】：

你可以使用修改后的this solution:

np.random.seed(2020)
temp = list(range(0, 2))
list_a = []
list_b = []

for l in temp:
    df = pd.DataFrame(np.random.randint(0,20,size=(3, 2)), columns=list(['col_a','temp']))
    list_a.append(df)

for l in temp:
    df = pd.DataFrame(np.random.randint(0,30,size=(2, 2)), columns=list(['col_a','temp']))
    list_b.append(df)

print(list_a)
print(list_b)

为排除创建所有可能值的集合：

b = set([y for x in list_b for y in x['col_a']])
print (b)
{3, 28, 5, 23}

然后在循环中添加到排除列表以及来自list_a 的DataFrames 值的新列表：

exclude = []
a = []
for df_a in list_a:
    if df_a['col_a'].isin(b).any():
        exclude.append(df_a)
    else:
        a.append(df_a)


print (exclude)
[   col_a  temp
0      0     8
1      3     3
2      3     7]

print (a)
[   col_a  temp
0     16     0
1     10     9
2     19    11]

另一个关于列表推导的想法：

exclude = [df_a for df_a in list_a if df_a['col_a'].isin(b).any()]
print (exclude)
[   col_a  temp
0      0     8
1      3     3
2      3     7]

new_a = [df_a for df_a in list_a if not df_a['col_a'].isin(b).any()]
print (new_a)
[   col_a  temp
0     16     0
1     10     9
2     19    11]

【讨论】：

首先，谢谢您.. 在发布之前，我一直在努力敲打我的头。其次，两种解决方案都适用于提供的样本。但是我注意到，在扩展温度范围时，我收到错误“DataFrame 的真值不明确。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()' 知道如何解决这个问题吗？也就是说，具有列表理解的第二种解决方案适用于扩展列表，我怀疑这个解决方案在更大的数据集上会表现得更好，所以我将使用这种方法。再次感谢您！
@Meowbits - 嗯，好像没有使用.any()？还是可能重复的列名？
在第一个解决方案中删除 .any() 并没有解决问题。数据框之间肯定存在重复的列名，因为它们都具有相同的结构。
@Meowbits - 我认为如果不可能的话，在一个数据框中会出现重复的列名
[jezrael] - 您对解决方案 A 的更新通过单独的列表来工作 :) 谢谢您的更新