我觉得对于这样的数据结构,如果数据在 Pandas 之外被处理,然后返回到 Pandas 中,你可能会有更好的性能(当然,这只有在你关心性能的情况下才重要,不需要进行不必要的优化) -当然,测试是确保这是正确的唯一方法:
from collections import defaultdict
d = defaultdict(int)
for words, number in zip(df.words, df.category):
for word in words:
d[(word, number)] += 1
d
defaultdict(int,
{('cat', 1): 3,
('dog', 1): 2,
('mouse', 1): 1,
('mouse', 2): 1,
('cat', 2): 1,
('dog', 2): 1,
('elephant', 2): 1,
('elephant', 3): 2})
构建数据框:
(pd.DataFrame(d.values(), index = d)
.unstack(fill_value = 0)
.droplevel(0, axis = 1)
)
1 2 3
cat 3 1 0
dog 2 1 0
elephant 0 1 2
mouse 1 1 0
借鉴@HenryEcker,您还可以使用Counter 函数:
from itertools import product, chain
from collections import Counter
# integers are put into a list as `product` works on iterables
pairing = (product(left, [right])
for left, right
in zip(df.words, df.category))
outcome = Counter(chain.from_iterable(pairing))
outcome
Counter({('cat', 1): 3,
('dog', 1): 2,
('mouse', 1): 1,
('mouse', 2): 1,
('cat', 2): 1,
('dog', 2): 1,
('elephant', 2): 1,
('elephant', 3): 2})
像以前一样构建数据框:
(pd.DataFrame(outcome.values(), index = outcome)
.unstack(fill_value = 0)
.droplevel(0, axis = 1)
)
1 2 3
cat 3 1 0
dog 2 1 0
elephant 0 1 2
mouse 1 1 0