【问题标题】:Search column for a string and Classify them with a dictionary keys在列中搜索字符串并使用字典键对它们进行分类
【发布时间】:2020-06-19 06:12:37
【问题描述】:

我已经导入了我从 Linkedin 导出的我的联系人的电子表格,并且想要对人们在不同级别的职位进行分类。

因此,我创建了一个字典,其中包含用于查找每个职位级别的术语。

字典的第一个版本是:

dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
                '1 - Director of': ['Director', 'Head'], 
                '2 - Manager': ['Manager', 'Administrador'], 
                '3 - Engenheiro': ['Engenheiro', 'Engineering'], 
                '4 - Consultor': ['Consultor', 'Consultant'], 
                '5 - Estagiário': ['Estagiário', 'Intern'], 
                '6 - Desempregado': ['Self-Employed', 'Autônomo'], 
                '7 - Professor': ['Professor', 'Researcher'] }

我需要一个代码来读取电子表格中的每个位置,检查是否有这些术语,并在另一个特定列中返回等效键。

我正在读取的数据框的示例数据是:

sample = pd.Series(data = (['(blank)'], ['Estagiário'], ['Professor', 'Adjunto'], 
                           ['CEO', 'and', 'Founder'], ['Engenheiro', 'de', 'Produção'], 
                           ['Consultant'], ['Founder', 'and', 'CTO'], 
                           ['Intern'], ['Manager', 'Specialist'], 
                           ['Administrador', 'de', 'Novos', 'Negócios'], 
                           ['Administrador', 'de', 'Serviços']))

返回:

0                                [(blank)]
1                             [Estagiário]
2                     [Professor, Adjunto]
3                      [CEO, and, Founder]
4               [Engenheiro, de, Produção]
5                             [Consultant]
6                      [Founder, and, CTO]
7                                 [Intern]
8                    [Manager, Specialist]
9     [Administrador, de, Novos, Negócios]
10           [Administrador, de, Serviços]
dtype: object

我已经完成了以下代码:

import pandas as pd
plan = pd.read_excel('SpreadSheet Name.xlsx', sheet_name = 'Positions')

list0 = ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner']
list1 = ['Director', 'Head']
list2 = ['Manager', 'Administrador']   
listgeral = [dic0, dic1, dic2]

def in_list(list_to_search,terms_to_search):     
    results = [item for item in list_to_search if item in terms_to_search]
    if len(results) > 0:
        return '0 - CEO, Founder'        
    else:
        pass
plan['PositionLevel'] = plan['Position'].str.split().apply(lambda x: in_list(x, listgeral[0]))

实际输出:

                                          Position           PositionLevel
0                                        '(blank)'                None
1                                     'Estagiário'                None
2                              'Professor Adjunto'                None
3                                'CEO and Founder'         '0 - CEO, Founder'
4                         'Engenheiro de produção'                None
5                                     'Consultant'                None
6                                'Founder and CTO'         '0 - CEO, Founder'
7                                         'Intern'                None
8                             'Manager Specialist'                None
9                'Administrador de Novos Negócios'                None

预期输出:

                                            Position         PositionLevel
0                                          '(blank)'              None
1                                       'Estagiário'       '5 - Estagiário'
2                                'Professor Adjunto'       '7 - Professor'
3                                  'CEO and Founder'      '0 - CEO, Founder'
4                           'Engenheiro de produção'       '3 - Engenheiro'
5                                       'Consultant'       '4 - Consultor'
6                                  'Founder and CTO'      '0 - CEO, Founder'
7                                           'Intern'       '5 - Estagiário'
8                               'Manager Specialist'        '2 - Manager'
9                  'Administrador de Novos Negócios'        '2 - Manager'

首先,我计划为listgeral 中的每个列表运行该代码,但我没有这样做。然后我开始相信把这个应用到一本大字典上会更好,就像问题开头的dicpositions,然后返回词的键。

我已尝试将以下代码应用于此程序:

dictest = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'], 
           '1 - Director of': ['Director', 'Head'], 
           '2 - Manager': ['Manager', 'Administrador']}

def in_dic (x, dictest):
    for key in dictest:
        for elem in dictest[key]:
            if elem == x:
                return key
    return False

in_dic('CEO', dictest) 的输出是'0 - CEO, Founder'

例如,in_dic('Banana', dictest) 的输出是False

但我无法从中取得进展并将此功能 in_dic() 应用于我的问题。

我非常感谢任何人的帮助。

非常感谢。

【问题讨论】:

    标签: python pandas dataframe dictionary series


    【解决方案1】:

    我冒昧地对您的输入进行了一些重构,但这就是我所得到的(它可能有点过度设计)。简而言之,我们使用一个名为 jellyfishpip3 install jellyfish,代码取自 this 答案)的库来进行模糊字符串匹配,以匹配您的 excel 工作表中的位置与您的 dicpositions 中的位置,然后将它们映射到同一字典中的类别。这是导入和匹配函数:

    import pandas as pd
    import numpy as np
    import jellyfish
    
    
    # Function for fuzzy-matching strings
    def get_closest_match(x, list_strings):
        best_match = None
        highest_jw = 0
    
        # Keep an eye out for "blank" values, they can be strings, e.g. "(blank)", or e.g. NaN values
        no_values = ["(blank)", np.nan, None]
        if x in no_values:
            return "(blank)"
    
        # Find which string most closely matches our input and return it
        for current_string in list_strings:
            current_score = jellyfish.jaro_winkler(x, current_string)
    
            if current_score > highest_jw:
                highest_jw = current_score
                best_match = current_string
    
        return best_match
    

    好的,这是您的dicpositions,为方便起见,我将其转换为长格式 DataFrame:

    # Translations between keywords and their category, as dict, as provided in question
    dicpositions = {'0 - CEO, Founder': ['CEO', 'Founder', 'Co-Founder', 'Cofounder', 'Owner'],
                    '1 - Director of': ['Director', 'Head'],
                    '2 - Manager': ['Manager', 'Administrador'],
                    '3 - Engenheiro': ['Engenheiro', 'Engineering'],
                    '4 - Consultor': ['Consultor', 'Consultant'],
                    '5 - Estagiário': ['Estagiário', 'Intern'],
                    '6 - Desempregado': ['Self-Employed', 'Autônomo'],
                    '7 - Professor': ['Professor', 'Researcher'],
                    'Not found"': ["(blank)"]  # <-- I added this to deal with blank values
    }
    
    # Let's expand the dict above to a DF, which makes for easier merging later
    positions = []
    aliases = []
    for key, val in dicpositions.items():
        for v in val:
            positions.append(key)
            aliases.append(v)
    # This will serve as our mapping table
    lookup_table = pd.DataFrame({
        "position": positions,
        "alias": aliases
    })
    print(lookup_table)
    

    它不是字典,而是长格式 DataFrame。这种格式使得以后可以很容易地将类别与各种关键字进行匹配:

                position          alias
    0   0 - CEO, Founder            CEO
    1   0 - CEO, Founder        Founder
    2   0 - CEO, Founder     Co-Founder
    3   0 - CEO, Founder      Cofounder
    4   0 - CEO, Founder          Owner
    5    1 - Director of       Director
    6    1 - Director of           Head
    7        2 - Manager        Manager
    8        2 - Manager  Administrador
    9     3 - Engenheiro     Engenheiro
    10    3 - Engenheiro    Engineering
    11     4 - Consultor      Consultor
    12     4 - Consultor     Consultant
    13    5 - Estagiário     Estagiário
    14    5 - Estagiário         Intern
    15  6 - Desempregado  Self-Employed
    16  6 - Desempregado       Autônomo
    17     7 - Professor      Professor
    18     7 - Professor     Researcher
    19        Not found"        (blank)
    

    让我们测试一些输入,看看匹配是如何工作的。我们使用alias 列中的字符串检查您输入中的每个字符串,并返回alias 列中与我们的输入数据最匹配的值(稍后我们将再次使用它来查找类别,或@987654331 @):

    # Test input, as a list, you might have to wrangle it from your format to a list, though
    test_df = pd.DataFrame({"test_position": ["(blank)", 'Estagiário', 'Professor Adjunto', 'CEO and Founder', 'Engenheiro de produção', 'Consultant', 'Founder and CTO', 'Intern', 'Manager Specialist', 'Administrador de Novos Negócios']})
    
    # Match our test input with our mapping table, create a new column 'best_match' representing the value in our mapping table that most closely matches our input
    test_df["best_match"] = test_df.test_position.map(lambda x: get_closest_match(x, lookup_table.alias))
    print(test_df)
    

    在我们的test_df 中添加了一个新列,指示我们查找表中的哪个alias 与我们的test_position 输入最相似:

                        test_position     best_match
    0                          (blank)        (blank)
    1                       Estagiário     Estagiário
    2                Professor Adjunto      Professor
    3                  CEO and Founder            CEO
    4           Engenheiro de produção     Engenheiro
    5                       Consultant     Consultant
    6                  Founder and CTO        Founder
    7                           Intern         Intern
    8               Manager Specialist        Manager
    9  Administrador de Novos Negócios  Administrador
    

    要最终确定类别,我们只需将测试数据中的 best_match 列与查找表的 alias 列合并:

    result = test_df.merge(lookup_table, left_on="best_match", right_on="alias", how="left")
    

    导致:

                        test_position     best_match          alias          position
    0                          (blank)        (blank)        (blank)         Not found
    1                       Estagiário     Estagiário     Estagiário    5 - Estagiário
    2                Professor Adjunto      Professor      Professor     7 - Professor
    3                  CEO and Founder            CEO            CEO  0 - CEO, Founder
    4           Engenheiro de produção     Engenheiro     Engenheiro    3 - Engenheiro
    5                       Consultant     Consultant     Consultant     4 - Consultor
    6                  Founder and CTO        Founder        Founder  0 - CEO, Founder
    7                           Intern         Intern         Intern    5 - Estagiário
    8               Manager Specialist        Manager        Manager       2 - Manager
    9  Administrador de Novos Negócios  Administrador  Administrador       2 - Manager
    

    【讨论】:

    • 等离子,非常感谢!它对我的电子表格非常有效!我还是要完成更好的dicpositions,然后再做更多的测试。但现在它运行良好,而且非常可扩展!!非常感谢您的帮助!
    猜你喜欢
    • 2019-08-06
    • 2014-07-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-06-15
    相关资源
    最近更新 更多