Pandas：查找多次包含特定单词的数据框字符串条目答案

【问题标题】：Pandas: Find data frame string entries that contain a specific word more than oncePandas：查找多次包含特定单词的数据框字符串条目
【发布时间】：2021-09-17 19:50:13
【问题描述】：

情况：

我有一个 pandas 数据框，并希望使用包含多次特定单词的字符串查找某个列的所有条目，并使用所述结果创建一个单独的数据框。

我做了什么？

到目前为止，我已经设法让它收集所有包含指定单词的条目至少一次。

守则：


    import pandas as pd
    df = pd.DataFrame({'Year': ['2020', '2021', '2021'],
                       'Title': ['Energy calculation', 'Energy calculation with energy', 'Other calculation'])
    terms = ['energy']
    list_df = selection_df[selection_df['title'].str.contains('|'.join(terms), na=False, case=False)]

输出：

0 2020 Energy calculation
1 2021 Energy calculation with energy

然后提问

我希望帮助收集第二个条目：

1 2021 Energy calculation with energy

其中多次包含“能量”一词。我怎么能这样做？

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

您需要在Series.str.count 中分别测试列表的每个值以获取掩码列表，然后使用np.logical_or.reduce：

import re

terms = ['energy']
masks = [selection_df['Title'].str.count(re.escape(x), flags=re.I).gt(1) for x in terms]
list_df = selection_df[np.logical_or.reduce(masks)]
print (list_df)
 Year                           Title
1  2021  Energy calculation with energy

替代解决方案：

terms = ['energy']
masks = [selection_df['Title'].str.count(re.escape(x), flags=re.I).gt(1) for x in terms]
list_df = selection_df[pd.concat(masks, axis=1).any(axis=1)]

【讨论】：

【解决方案2】：

您可以将正则表达式与捕获组和引用一起使用：

import re
reg = r'.*(%s).*\1' % '|'.join(terms)
# line above constructs reg = '.*(energy|other|terms).*\\1'

selection_df[selection_df['Title'].str.match(reg, flags=re.I)]

输出：

   Year                           Title
1  2021  Energy calculation with energy

【讨论】：

【解决方案3】：

您可以将.str.extractall 与collections.Counter 一起使用：

import re
from collections import Counter

terms = ["energy", "calculation"]

x = (
    df["Title"]
    .str.extractall("(" + "|".join(map(re.escape, terms)) + ")", flags=re.I)
    .groupby(level=0)
    .agg(lambda x: Counter(map(str.lower, x)).most_common(1)[0][1])
)
print(df[x[0] > 1])

打印：

   Year                           Title
1  2021  Energy calculation with energy

【讨论】：

如果列表中有多个值失败。
@jezrael 我已经编辑过了，但你的解决方案更优雅:)
是的，这里的性能应该是问题。