【问题标题】:Get country name from dataframe column by comparing with a list通过与列表进行比较,从数据框列中获取国家/地区名称
【发布时间】:2022-01-13 03:06:48
【问题描述】:

如何通过与包含国家名称的字符串列表进行比较,从数据框列中获取国家名称?

例如:

list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })
df 
job_location
0   birmingham, england, united kingdom
1   new jersey, united states
2   gilgit-baltistan, pakistan
3   uae
4   united states
5   pakistan
6   31-c2, gulberg 3, lahore, pakistan

我需要在数据框名称中添加一个新列作为国家/地区,其中包含来自 job_location 列的国家/地区名称。

【问题讨论】:

  • 1.不要命名列表list,这与 python 内置冲突。 2. 预期输出是多少?
  • df 中的 new_column 仅包含 job_loction 列中的国家名称。 like job_location 0 英国 1 美国 2 巴基斯坦 3 阿联酋 4 美国 5 巴基斯坦 6 巴基斯坦

标签: python pandas dataframe substring


【解决方案1】:

使用clist 作为列表名称,您可以制作一个正则表达式并使用str.extract

reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)

输出:

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

但老实说,如果 job_location 的格式总是以国家为结尾,那么用逗号分隔并保留最后一个字段可能更容易

【讨论】:

    【解决方案2】:

    不假设该国将永远处于末路,这里应该可行:

    import pandas as pd
    
    country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]
    
    # create dataframe column name is job_location of employee
    df = pd.DataFrame({
            'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
        })
    
    matching_countries = []
    
    for key, value in df.items():
        for text in value:
            for country in country_list:
                    if country in text:
                        matching_countries.append(country)
    
    df['country'] = matching_countries
    
    print (df)
    

    输出:

                              job_location         country
    0  birmingham, england, united kingdom  united kingdom
    1            new jersey, united states   united states
    2           gilgit-baltistan, pakistan        pakistan
    3                                  uae             uae
    4                        united states   united states
    5                             pakistan        pakistan
    6   31-c2, gulberg 3, lahore, pakistan        pakistan
    

    【讨论】:

    • 在 pandas 中使用这样的循环是低效的
    【解决方案3】:

    首先,更改您的列表名称。我已经使用列表理解完成了它..

    df['country'] = [x.split(",")[-1] for x in df['job_location']]
    

    输出:

    job_location country
    0 birmingham, england, united kingdom united kingdom
    1 new jersey, united states united states
    2 gilgit-baltistan, pakistan pakistan
    3 uae uae
    4 united states united states
    5 pakistan pakistan
    6 31-c2, gulberg 3, lahore, pakistan pakistan

    【讨论】:

      猜你喜欢
      • 2019-09-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-10-03
      • 1970-01-01
      • 2013-09-15
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多