【问题标题】:How to check if all the elements in list are present in pandas column如何检查列表中的所有元素是否都存在于熊猫列中
【发布时间】:2019-04-18 11:39:16
【问题描述】:

我有一个数据框和一个列表:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 
    'char':[['a','b'],['a','b','c'],['a','c'],['b','c'],[],['c','a','d'],['c','d'],['a']]})

names = ['a','c']

只有当ac 都出现在char 列中时,我才想获取行。(这里的顺序无关紧要)

预期输出:

       char  id                                                                                                                      
1  [a, b, c]   2                                                                                                                      
2     [a, c]   3                                                                                                                      
5  [c, a, d]   6   

我的努力

true_indices = []
for idx, row in df.iterrows():
    if all(name in row['char'] for name in names):
        true_indices.append(idx)


ids = df[df.index.isin(true_indices)]

这给了我正确的输出,但对于大型数据集来说太慢了,所以我正在寻找更有效的解决方案。

【问题讨论】:

    标签: python python-3.x pandas


    【解决方案1】:

    使用pd.DataFrame.apply:

    df[df['char'].apply(lambda x: set(names).issubset(x))]
    

    输出:

       id       char
    1   2  [a, b, c]
    2   3     [a, c]
    5   6  [c, a, d]
    

    【讨论】:

      【解决方案2】:

      您可以从名称列表中构建一个集合以便更快地查找,并使用set.issubset 检查集合中的所有元素是否包含在列列表中:

      names = set(['a','c'])
      df[df['char'].map(names.issubset)]
      
         id       char
      1   2  [a, b, c]
      2   3     [a, c]
      5   6  [c, a, d]
      

      【讨论】:

      • 这个比休息快。谢谢:-)
      【解决方案3】:

      issubset使用列表推导:

      mask = [set(names).issubset(x) for x in df['char']]
      df = df[mask]
      print (df)
         id       char
      1   2  [a, b, c]
      2   3     [a, c]
      5   6  [c, a, d]
      

      Series.map 的另一个解决方案:

      df = df[df['char'].map(set(names).issubset)]
      print (df)
         id       char
      1   2  [a, b, c]
      2   3     [a, c]
      5   6  [c, a, d]
      

      性能取决于行数和匹配值的数量:

      df = pd.concat([df] * 10000, ignore_index=True)
      
      In [270]: %timeit df[df['char'].apply(lambda x: set(names).issubset(x))]
      45.9 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
      
      In [271]: %%timeit
           ...: names = set(['a','c'])
           ...: [names.issubset(set(row)) for _,row in df.char.iteritems()]
           ...: 
      46.7 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
      
      
      In [272]: %%timeit
           ...: df[[set(names).issubset(x) for x in df['char']]]
           ...: 
      45.6 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
      
      In [273]: %%timeit
           ...: df[df['char'].map(set(names).issubset)]
           ...: 
      18.3 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
      
      In [274]: %%timeit
           ...: n = set(names)
           ...: df[df['char'].map(n.issubset)]
           ...: 
      16.6 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      
      In [279]: %%timeit
           ...: names = set(['a','c'])
           ...: m = [name.issubset(i) for i in df.char.values.tolist()]
           ...: 
      19.2 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      

      【讨论】:

      • @yatu - 嗯,对我来说不是,但真实数据似乎不同%%timeit names = set(['a','c']) m = [name.issubset(i) for i in df.char.values.tolist()] 19.2 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      【解决方案4】:

      试试这个。

      df['char']=df['char'].apply(lambda x: x if ("a"in x and "c" in x) else np.nan)
      print(df.dropna())
      

      输出:

         id       char
      1   2  [a, b, c]
      2   3     [a, c]
      5   6  [c, a, d]
      

      【讨论】:

        猜你喜欢
        • 2020-06-12
        • 2019-03-26
        • 1970-01-01
        • 2019-12-22
        • 1970-01-01
        • 2015-10-03
        • 2021-11-15
        • 2022-01-01
        • 1970-01-01
        相关资源
        最近更新 更多