通过与列表进行比较，从数据框列中获取国家/地区名称答案

【问题标题】：Get country name from dataframe column by comparing with a list通过与列表进行比较，从数据框列中获取国家/地区名称
【发布时间】：2022-01-13 03:06:48
【问题描述】：

如何通过与包含国家名称的字符串列表进行比较，从数据框列中获取国家名称？

例如：

list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })
df 
job_location
0   birmingham, england, united kingdom
1   new jersey, united states
2   gilgit-baltistan, pakistan
3   uae
4   united states
5   pakistan
6   31-c2, gulberg 3, lahore, pakistan

我需要在数据框名称中添加一个新列作为国家/地区，其中包含来自 job_location 列的国家/地区名称。

【问题讨论】：

1.不要命名列表list，这与 python 内置冲突。 2. 预期输出是多少？
df 中的 new_column 仅包含 job_loction 列中的国家名称。 like job_location 0 英国 1 美国 2 巴基斯坦 3 阿联酋 4 美国 5 巴基斯坦 6 巴基斯坦

标签： python pandas dataframe substring

【解决方案1】：

使用clist 作为列表名称，您可以制作一个正则表达式并使用str.extract：

reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)

输出：

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

但老实说，如果 job_location 的格式总是以国家为结尾，那么用逗号分隔并保留最后一个字段可能更容易

【讨论】：

【解决方案2】：

不假设该国将永远处于末路，这里应该可行：

import pandas as pd

country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })

matching_countries = []

for key, value in df.items():
    for text in value:
        for country in country_list:
                if country in text:
                    matching_countries.append(country)

df['country'] = matching_countries

print (df)

输出：

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

【讨论】：

在 pandas 中使用这样的循环是低效的

【解决方案3】：

首先，更改您的列表名称。我已经使用列表理解完成了它..

df['country'] = [x.split(",")[-1] for x in df['job_location']]

输出：

	job_location	country
0	birmingham, england, united kingdom	united kingdom
1	new jersey, united states	united states
2	gilgit-baltistan, pakistan	pakistan
3	uae	uae
4	united states	united states
5	pakistan	pakistan
6	31-c2, gulberg 3, lahore, pakistan	pakistan

【讨论】：