在 pandas 中设置 Union答案

【问题标题】：Set Union in pandas在 pandas 中设置 Union
【发布时间】：2016-11-20 13:15:59
【问题描述】：

我有两列存储在我的数据框中。

我想使用快速向量化操作对两列执行集合并集

df['union'] = df.set1 | df.set2

但错误TypeError: unsupported operand type(s) for |: 'set' and 'bool' 阻止我这样做，因为我在两列中都输入了np.nan。

有没有好的解决方案来克服这个问题？

【问题讨论】：

标签： python python-3.x numpy pandas vectorization

【解决方案1】：

对于这些操作，纯 Python 通常更高效。

%timeit pd.Series([set1.union(set2) for set1, set2 in zip(df['A'], df['B'])])
10 loops, best of 3: 43.3 ms per loop

%timeit df.apply(lambda x: x.A.union(x.B), axis=1)
1 loop, best of 3: 2.6 s per loop

用于计时的DataFrame：

import pandas as pd
import numpy as np
l1 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]
l2 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]

df = pd.DataFrame({'A': l1, 'B': l2})

【讨论】：

当A 或B 中的值是np.nan 时，这不会遇到同样的问题吗？

【解决方案2】：

这是我能想到的最好的：

# method 1
df.apply(lambda x: x.set1.union(x.set2), axis=1)

# method 2
df.applymap(list).sum(1).apply(set)

哇！

我希望方法 2 更快。不是这样！

示例

df = pd.DataFrame([[{1, 2, 3}, {3, 4, 5}] for _ in range(3)],
                  columns=list('AB'))
df

df.apply(lambda x: x.set1.union(x.set2), axis=1)

0    {1, 2, 3, 4, 5}
1    {1, 2, 3, 4, 5}
2    {1, 2, 3, 4, 5}

【讨论】：

请原谅我有限的熊猫知识，但联合对数据框列意味着什么？这不会重置输出列的生成顺序并且还会有更多元素吗？如果不麻烦的话，可以加个样例吗？
是的，我会稍等一下。但是，apply 和 axis=1 意味着这将在行上迭代运行。这样lambda 对pd.Series 对象进行操作。所以x.set1 是该行在set1 位置中的实际设置。 set2 也一样。该联合是实际的 set.union 方法。
@Divakar 另外，我认为这是我能想到的最好的，因为这太糟糕了。为什么我们不能用np.union1d 做点什么？
再想一想，也许方法#1并没有那么糟糕。
感谢您的更新！现在有道理了。另外，我正在考虑使用 NumPy，但是联合可能会导致每行的元素数量不同，所以我认为这不值得。所以，我认为apply 方法应该是最好的。