比较数据框中的 2 个元组答案

【问题标题】：comparing 2 tuples in a dataframe比较数据框中的 2 个元组
【发布时间】：2021-11-18 05:46:22
【问题描述】：

基于以下数据框：

import json
import numpy as np
import pandas as pd
test_list = ['purple', 'red', 'yellow']
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': [['red','blue'], ['white'], ['blue','yellow']]})
df['colors_new'] = df.colors.map(tuple)

我正在尝试生成一个新列，如果 test_list 中的至少一个元素在每一行中，那么我将该行标记为 true

df['found'] = any((True for x in test_list if x in df['colors_new']))
df

在上面的例子中，第 0 行和第 2 行应该为真，因为红色在第 0 行，黄色在第 2 行

这将是最有效和正确的方法，因为我目前得到错误的结果

我认为最接近我能得到正确回应的是

df['found'] = ['red' in x for x in df['colors_new']]

但是当我有多个项目时使用这个不起作用（test_list = ['purple', 'red', 'yellow']）

【问题讨论】：

你可以使用集合和交集来代替。

标签： python pandas dataframe tuples

【解决方案1】：

如果性能很重要，请使用带有isdisjoint 的集合：

s = set(test_list)
df['colors_new'] = ~df.colors.map(s.isdisjoint)

或者：

s = set(test_list)
df['colors_new'] = df['colors'].map(s.intersection).astype(bool)

print (df)

   numbers          colors  colors_new
0        1     [red, blue]        True
1        2         [white]       False
2        3  [blue, yellow]        True

性能在测试数据中，最好的实际测试，因为取决于 DataFrame 的长度、测试列表的长度、匹配值的数量：

df['colors_new'] = df.colors.map(tuple)

#3k rows
df = pd.concat([df] * 1000, ignore_index=True)

test_list = ['purple', 'red', 'yellow']

s = set(test_list)

In [46]: %timeit df['colors_new'] = ~df.colors.map(s.isdisjoint)
707 µs ± 20.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [47]: %timeit df['colors_new'] = df['colors'].map(s.intersection).astype(bool)
1.38 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [50]: %timeit df['found'] = df['colors_new'].apply(lambda x: len(s.intersection(x))>0)
1.68 ms ± 42.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [51]: %timeit df['found'] = df['colors_new'].explode().isin(test_list).groupby(level=0).max()
4.66 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [52]: %timeit df['found'] = df['colors_new'].apply(lambda x: bool(max([1 if y in test_list else 0 for y in x])))
2.91 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [54]: %timeit df["colors_map"] = df[['colors','colors_new']].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)
26.1 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

【讨论】：

【解决方案2】：

使用爆炸

df['found'] = df['colors_new'].explode().isin(test_list).groupby(level=0).max()

输出：

   numbers          colors      colors_new  found
0        1     [red, blue]     (red, blue)   True
1        2         [white]        (white,)  False
2        3  [blue, yellow]  (blue, yellow)   True

使用 python 集

可以使用集合和set.intersection，如果交集不为空，则有共同值。

集合操作比经典循环更快。

test_list = set(test_list)
df['found'] = df['colors_new'].apply(lambda x: len(test_list.intersection(x))>0)

输出：

   numbers          colors      colors_new  found
0        1     [red, blue]     (red, blue)   True
1        2         [white]        (white,)  False
2        3  [blue, yellow]  (blue, yellow)   True

注意。作为奖励，您可以使用相同的方法来获取找到的元素

df['found elements'] = df['colors_new'].apply(test_list.intersection)

输出：

   numbers          colors      colors_new  found found elements
0        1     [red, blue]     (red, blue)   True          {red}
1        2         [white]        (white,)  False             {}
2        3  [blue, yellow]  (blue, yellow)   True       {yellow}

【讨论】：

嗨，有没有一种方法可以使用 apply 我有一个不同的方法，但它很慢，因为我有数千行和几列以类似的方式计算
@Manza 当然，检查更新
假设explode() 是最快的会是正确的吗？
@Manza - 在我的答案中添加了测试性能。

【解决方案3】：

你可以使用lambda函数来得到你想要的：

import json
import numpy as np
import pandas as pd
test_list = ['purple', 'red', 'yellow']
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': [['red','blue'], ['white'], ['blue','yellow']]})
df['colors_new'] = df.colors.map(tuple)

df['found'] = df['colors_new'].apply(lambda x: bool(max([1 if y in test_list else 0 for y in x])))

【讨论】：

【解决方案4】：

您也可以使用列表推导：

df["colors_map"] = df[['colors','colors_new']].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)

如果您要检查的 colors 列很多（而不仅仅是 2 个）：

df["colors_map"] = df[[x for x in df.columns if "colors" in x]].apply(lambda x:any([x2 in test_list for x1 in x for x2 in x1]), axis=1)

【讨论】：