在字符串列中查找子字符串（来自字符串列表）并添加为 Dataframe 中的新列答案

【问题标题】：Finding a substring (from list of strings) in a string column and add as a new column in Dataframe在字符串列中查找子字符串（来自字符串列表）并添加为 Dataframe 中的新列
【发布时间】：2023-02-02 20:56:43
【问题描述】：

我有以下数据框 (df)，其中包含“文本”列中的字符串：

text	sth
abdcdtext1wrew	...
qwerqdtext2cvufu	...
iuotext3tvbv	...
iuotvbvewre	...

我也有一个系列(df_look_for) 包含我要查找的字符串：

look_for
text1
text2
text3

我的目标是检查“文本" 列是否包含 " 中的字符串之一寻找" 列。如果它包含我想将找到的字符串添加为 df 中的新列。例如：

text	sth	found_str
abdcdtext1wrew	...	text1
qwerqdtext2cvufu	...	text2
iuotext3tvbv	...	text3
iuotvbvewre	...	`NaN`

到目前为止，我正在尝试使用str.contains()，但还没有成功。

任何帮助将不胜感激！

【问题讨论】：

您的预期输出与您的df 不匹配。此行中缺少一行。
对不起，我修好了。

标签： python string dataframe substring

【解决方案1】：

一种选择是使用列表组件使用 next 以避免嵌套列表。

lookfor = df_look_for["look_for"]

df["found_str"] = [next((a for a in lookfor if a in b), None) for b in df["text"]]

输出：

print(df)

               text  sth found_str
0    abdcdtext1wrew  ...     text1
1  qwerqdtext2cvufu  ...     text2
2      iuotext3tvbv  ...     text3
3       iuotvbvewre  ...      None

【讨论】：

谢谢你的解决方案。你能解释一下“避免嵌套列表”是什么意思吗？ next() 功能我不清楚。谢谢。
尝试删除next并重写列表组件您会看到 found_str 列中的值将括在方括号中。例如，在第一行中，您将得到 [text1]（一个列表) 而不是 text1 (一个字符串).
我看到了。谢谢你的解释！

【解决方案2】：

另一种方法。这将给出所有找到的 strs 的列表：

import pandas as pd
d = {'text': ['asdtext1', 'sdkjfhtext2sdf', 'dsfds']}
l = {'look_for': ['text1', 'text2']}

look_for_df = pd.DataFrame(data=l)
df = pd.DataFrame(data=d)

df["found_str"] = df['text'].apply(lambda text: [search_word for search_word in look_for_df['look_for'] if search_word in text])

【讨论】：

【解决方案3】：

这是使用 map() 和 next() 的替代方法

df_look = pd.Series(['text1', 'text2', 'text3'])
df['found_str'] = list(map(lambda x: next((y for y in df_look if y in x), 'NaN'), df['text']))
print(df)

               text  sth found_str
0    abdcdtext1wrew  ...     text1
1  qwerqdtext2cvufu  ...     text2
2      iuotext3tvbv  ...     text3
3       iuotvbvewre  ...       NaN

【讨论】：

【解决方案4】：

解决方案：

import re

df_look_for = pd.Series(['text1', 'text2', 'text3'])

pattern = '|'.join(df_look_for)
df['found_str'] = df['text'].str.extract('(' + pattern + ')', expand=False)
df.fillna(value='NaN', inplace=True)

解释：

您可以使用 str.extract 方法和正则表达式来实现此目的。这个想法是提取与 df_look_for 中的模式匹配的第一个字符串，并将其添加为数据框中的新列。

通过将所有元素与|（逻辑或运算符）连接起来，df_look_for 系列被转换为正则表达式模式。然后使用str.extract 方法从文本列中提取该模式的第一个匹配项，并将结果存储在found_str 列中。最后，fillna 方法用于用字符串 'NaN' 替换任何缺失值（即未找到匹配项）。

【讨论】：

这会报错，因为ValueError: Wrong number of items passed, placement implies 1. 可能是因为df 中的元素数量与df_look_for 不一样。