为匹配（子）字符串的多个数据帧中的每一行创建列并合并到另一个数据帧答案

【问题标题】：create columns for every row in multiple dataframes that matches a (sub-)string & merge to another dataframe为匹配（子）字符串的多个数据帧中的每一行创建列并合并到另一个数据帧
【发布时间】：2021-10-21 05:01:12
【问题描述】：

我正在使用 pandas 从 xlsx 文件中读取数据 - 并正在为 xlsx 文件中的每个工作簿创建一个数据框，其中包含有关人员工作、人员教育和人员以前工作经验的数据。所以我最后使用了大约 13 个数据框。所有数据框都有一个共同的“talent_id”列，它们可以在以后的某个时间点合并（加入）。我目前面临的问题是，在 df1 中，talent_id 是唯一的，在 df2 中，'talent_id' 并不是唯一的，因为人们过去可能接受过多种教育（所以他们以前的每一份工作都是一种观察），与 df3 相同，它为我提供了每个“talent_id”的所有以前的工作经验。

所以我最终要实现的目标是，让一个 df 保存来自 df1、df2 和 df3 的所有信息，而没有重复的“talent_id”行，而不是每个“talent_id”1 行和所有教育组织的列和以前的雇主作为列或特征。

这里是生成 df 的代码 - 我搞砸了 melt()、join()、merge() 给我的不是我想要的。

不用说，不是每个 Talent_id 都有相同数量的教育机构，有些人参加了 2 所学校，2 所学校和 3 所大学等，所以功能的数量各不相同，以前的工作经验数量也是如此。

data1 = [['001', '1975-01-01', 'mr', 'de', 'at', 40000], ['002', '1980-01-01', 'mrs', 'en', 'uk', 50000], ['003', '1985-01-01', 'mr', 'es', 'es', 45000]]
df1 = pd.DataFrame(data1, columns = ['talent_id',  'birthdate', 'salutation', 'nationality', 'country', 'salary'])
 
data2 = [['001', 'groundschool_a', 'NaN', 'basic', 'none', 'yes'], ['001', 'high_school', 'math', 'higher', 'none', 'no'], ['002', 'groundschool_b', 'NaN', 'basic', 'none', 'yes'],
        ['002', 'highschool', 'science', 'higher', 'yes', 'yes'], ['002', 'college', 'medicine', 'degree', 'MA', 'yes'], ['003', 'NA', 'none', 'dont know', 'none', 'NaN']]
df2 = pd.DataFrame(data2, columns = ['talent_id',  'schoolname', 'subject', 'type_of_education', 'degree', 'completed'])

data3 = [['001', 'company_a', 'supervisor', 'manufacturing'], ['001', 'company_b', 'editor', 'educational'], ['002', 'company_c', 'clerk', 'pos'],
        ['002', 'company_d', 'cleaning', 'steel'], ['002', 'company_e', 'ceo', 'sales'], ['003', 'company_f', 'it', 'retail']]
df3 = pd.DataFrame(data3, columns = ['talent_id',  'company', 'position', 'industry'])

理想的结果是这样的：

data4 = [['001', '1975-01-01', 'mr', 'de', 'at', 40000, 
          'groundschool_a', 'NaN', 'basic', 'none', 'yes', 'high_school', 'math', 'higher', 'none', 'no', 
          'company_a', 'supervisor', 'manufacturing', 'company_b', 'editor', 'educational', 'NA','NA','NA','NA','NA', 'NA', 'NA', 'NA'], 
         ['002', '1980-01-01', 'mrs', 'en', 'uk', 50000, 
          'groundschool_b', 'NaN', 'basic', 'none', 'yes', 'highschool', 'science', 'higher', 'yes', 'yes', 'college', 'medicine', 'degree', 'MA', 'yes',
          'company_c', 'clerk', 'pos', 'company_d', 'cleaning', 'steel', 'company_e', 'ceo', 'sales'], 
         ['003', '1985-01-01', 'mr', 'es', 'es', 45000, 'NA', 'none', 'dont know', 'none', 'NaN',
          'company_f', 'it', 'retail', 'NA','NA','NA','NA','NA','NA','NA','NA','NA','NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA']]


df4 = pd.DataFrame(data4, columns = ['talent_id', 'birthdate', 'salutation', 'nationality', 'country', 'salary', 
                                    'schoolname_1', 'subject_1', 'type_of_education_1', 'degree_1', 'completed_1', 
                                    'schoolname_2', 'subject_2', 'type_of_education_2', 'degree_2', 'completed_2',
                                    'schoolname_3', 'subject_3', 'type_of_education_3', 'degree_3', 'completed_3',
                                    'company_1', 'position', 'industry',
                                    'company_2', 'position', 'industry',
                                    'company_2', 'position', 'industry'])

我想为特定的 'talent_id' 解析每个 df 并将其写入一个列表，最后从该列表中创建一个 df，但有没有更智能、更有效的方法呢？

【问题讨论】：

标签： pandas dataframe join merge

【解决方案1】：

首先制作一个小辅助函数来修改数据框 2 和 3：

def group_pivot(d):
    d =(d.assign(group=d.groupby('talent_id').cumcount())
         .pivot(index='talent_id', columns='group', values=d.columns[1:])
        )
    d.columns = ['_'.join(map(str, c)) for c in d.columns]
    return d.reset_index()

df3 上的示例：

>>> group_pivot(df3)
  talent_id  company_0  company_1  company_2  position_0 position_1 position_2     industry_0   industry_1 industry_2
0       001  company_a  company_b        NaN  supervisor     editor        NaN  manufacturing  educational        NaN
1       002  company_c  company_d  company_e       clerk   cleaning        ceo            pos        steel      sales
2       003  company_f        NaN        NaN          it        NaN        NaN         retail          NaN        NaN

然后在“talent_id”上合并所有转换后的数据帧：

df1.merge(group_pivot(df2), on='talent_id').merge(group_pivot(df3), on='talent_id')

输出：

  talent_id   birthdate salutation nationality country  salary    schoolname_0 schoolname_1 schoolname_2 subject_0 subject_1 subject_2 type_of_education_0 type_of_education_1 type_of_education_2 degree_0 degree_1 degree_2 completed_0 completed_1 completed_2  company_0  company_1  company_2  position_0 position_1 position_2     industry_0   industry_1 industry_2
0       001  1975-01-01         mr          de      at   40000  groundschool_a  high_school          NaN       NaN      math       NaN               basic              higher                 NaN     none     none      NaN         yes          no         NaN  company_a  company_b        NaN  supervisor     editor        NaN  manufacturing  educational        NaN
1       002  1980-01-01        mrs          en      uk   50000  groundschool_b   highschool      college       NaN   science  medicine               basic              higher              degree     none      yes       MA         yes         yes         yes  company_c  company_d  company_e       clerk   cleaning        ceo            pos        steel      sales
2       003  1985-01-01         mr          es      es   45000              NA          NaN          NaN      none       NaN       NaN           dont know                 NaN                 NaN     none      NaN      NaN         NaN         NaN         NaN  company_f        NaN        NaN          it        NaN        NaN         retail          NaN        NaN

【讨论】：

非常感谢，我用你想出的辅助函数运行了一些测试 - 它似乎工作得很好 - 目前我在一个有 500 行的示例数据集上运行它，它非常快。生产数据集包含大约 40k 行 - 期待它...感谢您的时间和精力帮助我...

【解决方案2】：

您可以使用agg 之后的groupby 方法从变量（例如您的示例中的教育）创建列表。例如：

df2.groupby('talent_id').agg({
   'education':list
    })

这将使talent_id 独一无二，并将所有内容存储为列表供您加入。稍后您可以使用适当的工具，例如将其扩展为展开列以供您分析或期望。

注意： em>方法groupby make talent_id index所以在merge方法中使用相应的选项来将数据帧合并在一起。

【讨论】：