从熊猫列中检索与列表中的单词匹配的单词答案

【问题标题】：Retrieving word from pandas column that matches words in list从熊猫列中检索与列表中的单词匹配的单词
【发布时间】：2020-03-12 22:32:13
【问题描述】：

我有一个带有文本的列。此文本可以包含国家/地区的名称。我想在与文本相同的行中列出所有提到的国家/地区。我已经有一个关于我要提取的国家/地区的系列。

    SomeText                          | ... | .... | CountryInText
    Something Canada                  |     |      |   
    RUSSIAAreACountry                 |     |      |   
    Mexicoand Brazil is South of USA




    SomeText                          | ... | .... | CountryInText
    Something Canada                  |     |      |  Canada 
    RUSSIAAreACountry                 |     |      |  Russia
    Mexicoand Brazil is South of USA  |     |      |  Mexico, Brazil, USA

我试过了

pd.Series(df['SomeText'].str.findall(f"({'|'.join(countryname['CommonName'])})"))

但是，这给了我一个无法匹配回原始数据框的对象列表。 countryname['CommonName'] 是一系列国家名称。

谁能帮帮我？

提前致谢

【问题讨论】：

this 是您要找的吗？
你为什么要使用findall？如果SomeText 中有两个国家/地区名称会怎样？
看起来您真正想要的可能与您的措辞不同。根据您的示例，您似乎想要的是特定行的最右列，由该行最左列中出现的所有国家/地区组成。对吗？
@Accccumulation 是的，这是正确的，抱歉 - 我现在正在更新问题。

标签： python pandas

【解决方案1】：

使用re 包的解决方案（带有一个小测试示例）（以获得更大的灵活性）：

import pandas as pd
import re

df = pd.DataFrame({"SomeText": ["Something Canada", "RUSSIAAreACountry"]})
countryname = pd.Series({"CommonName": ["Canada", "Russia"]})
df["CountryInText"] = df["SomeText"].str.title().map(lambda x: 
                                         re.findall('|'.join(countryname['CommonName']), x, re.I))

更新（基于二凡在评论中的反馈）：

import pandas as pd
import re

df = pd.DataFrame({"SomeText": ["Something Canada", "RUSSIAAreACountry"]})
countryname = pd.Series({"CommonName": ["Canada", "Russia"]})
df["CountryInText"] = df["SomeText"].str.title().str.findall('|'.join(countryname['CommonName']), re.I)

更新 2（基于 OP 发布的有用的附加测试用例）：

上述方法将返回美国而不是美国。下面的一个负责：

import pandas as pd

df = pd.DataFrame({"SomeText": ["Something Canada",
                                "RUSSIAAreACountry", 
                                "Mexicoand Brazil is South of USA"]})
countryname = pd.Series({"CommonName": ["Canada", "Russia", "Mexico", "Brazil", "USA"]})
df["CountryInText"] = df["SomeText"].map(lambda x: [c for c in countryname['CommonName'] 
                                                    if c.lower() in x.lower()])

【讨论】：

最好使用原生 pandas 方法：Series.str.findall
是的，除此之外，但这是一般性评论，当我们有原生 pandas 方法时使用 re.findall 没有意义@QuangHoang
为什么是.title() 和.lower()？
@AMC title() 用于返回仅首字母大写的国家/地区名称（例如 OP 示例 RUSSIA -> Russia）。但是这种方法没有正确解决后来添加的场景（美国 -> 美国）。最后一种方法也处理该测试用例，它使用lower() 使匹配不区分大小写。

【解决方案2】：

有点太晚了，有点傻，但我写了代码，所以我也可以:)

import pandas as pd
import re
countryname = pd.DataFrame(
    data={
        "Name": ["Rep. of Congo", "Russia Long", "Canada Long"],
        "CommonName": ["Congo", "Russia", "Canada"]})
df = pd.DataFrame(
    data={
        "SomeText": ["Something Canada", "RUSSIAAreACountry", "Rep ofIreland", "Unrelated"],
        "CountryInText": ["","","",""]})
names = "|".join(list(countryname["CommonName"]))

会给你：

国家名称：

            Name CommonName
0  Rep. of Congo      Congo
1    Russia Long     Russia
2    Canada Long     Canada

df:

            SomeText CountryInText
0   Something Canada              
1  RUSSIAAreACountry              
2      Rep ofIreland              
3          Unrelated

名字：

Congo|Russia|Canada

然后使用 findall 和一个简单的函数，您可以找到通用名称中的所有字符串实例，如果找到任何内容，则选择第一个并将其设为标题大小写，如果没有找到，则返回一个空字符串。此方法忽略所有大写选项并将所有内容更改为标题大小写。在我写完答案后，我还看到了最右边的名字，所以也没有。

# re.I is there to do case insensitive matching
df["CountryInText"] = df["SomeText"].str.findall(names, flags = re.I)
def cleanup(country_list):
    if len(country_list) > 0:
        return str(country_list[0])
    return ""
df["CountryInText"] = df["CountryInText"].apply(cleanup).apply(str.title)

现在df：

            SomeText CountryInText
0   Something Canada        Canada
1  RUSSIAAreACountry        Russia
2      Rep ofIreland              
3          Unrelated

【讨论】：