【问题标题】:Merging 2 dfs using string contains and multiple columns使用字符串包含和多列合并 2 个 dfs
【发布时间】:2022-09-27 21:57:23
【问题描述】:

我有 2 个要合并的 DF。但我需要根据包含的字符串合并它们并使用多列

df_1

    IN          Start_Time          Description                                                                     Per_Extr
0   IN7305517   2022-07-24 00:06:59 ABEND JOB PP_BRAI_VAR_CARTAO_IND_IBI_D and JOB_STREAM_NAME P26_BRAI_RS2...      FROM : 2022/01/08 TO : 2022/12/09
1   IN7305465   2022-07-24 00:09:49 ABEND JOB PP_AAAR_4898_POUP_MOV_TDCH_D and JOB_STREAM_NAME P26_AAAR_006_TSA...  FROM : 2022/01/08 TO : 2022/12/09
2   IN7305466   2022-07-24 00:10:16 ABEND JOB PP_AAAR_4898_POUPMOV_D and JOB_STREAM_NAME P26_AAAR_006_TSA...        FROM : 2022/01/08 TO : 2022/12/09
3   IN7305493   2022-07-24 00:20:27 ABEND JOB PP_BGDTPRODHBACMS102020_01_M and JOB_STREAM_NAME P26_BGDTDCHF_PUM...  FROM : 2022/01/08 TO : 2022/12/09

df_2

    JOB_STREAM_NAME     JOB_NAME
NaN P26_BRAI_RS2        PP_BRAI_VAR_CARTAO_IND_IBI_D
NaN P26_BRAI_VAR_TOD    PP_BRAI_VAR_CARTAO_IND_IBI_D
NaN P26_AAAR_006_TSA    PP_AAAR_4898_POUP_MOV_TDCH_D
NaN P26_AAAR_006_TSA    PP_AAAR_4898_POUPMOV_D
NaN P26_BGDTDCHF_PUM    PP_BGDTPRODHBACMS102020_01_M

描述列中包含 JOB_NAME 和 JOB_STREAM_NAME

我的目标是这样的df: 合并的_df

    IN          JOB_STREAM_NAME     JOB_NAME                        Start_Time          Description                                                                     Per_Extr
0   IN7305517   P26_BRAI_RS2        PP_BRAI_VAR_CARTAO_IND_IBI_D    2022-07-24 00:06:59 ABEND JOB PP_BRAI_VAR_CARTAO_IND_IBI_D and JOB_STREAM_NAME P26_BRAI_RS2...      FROM : 2022/01/08 TO : 2022/12/09
1   NaN         P26_BRAI_VAR_TOD    PP_BRAI_VAR_CARTAO_IND_IBI_D    NaN                 NaN                                                                             NaN
2   IN7305465   P26_AAAR_006_TSA    PP_AAAR_4898_POUP_MOV_TDCH_D    2022-07-24 00:10:16 ABEND JOB PP_AAAR_4898_POUPMOV_D and JOB_STREAM_NAME P26_AAAR_006_TSA...        FROM : 2022/01/08 TO : 2022/12/09
3   IN7305466   P26_AAAR_006_TSA    PP_AAAR_4898_POUPMOV_D          2022-07-24 00:10:16 ABEND JOB PP_AAAR_4898_POUPMOV_D and JOB_STREAM_NAME P26_AAAR_006_TSA...        FROM : 2022/01/08 TO : 2022/12/09
4   IN7305493   P26_AAAR_006_TSA    PP_AAAR_4898_POUPMOV_D          2022-07-24 00:20:27 ABEND JOB PP_BGDTPRODHBACMS102020_01_M and JOB_STREAM_NAME P26_BGDTDCHF_PUM...  FROM : 2022/01/08 TO : 2022/12/09

请注意,作业 PP_BRAI_VAR_CARTAO_IND_IBI_D 位于 2 JOB_STREAM_NAME 中,其中一个没有 IN,这就是为什么在 merge_df 中,JOB_STREAM_NAME = P26_BRAI_VAR_TOD 中的作业没有 IN(NaN)

我被指示用一列来做这件事,但是,不要管理对多列做同样的事情。

对于一列,我正在使用这种方法:

jobs_list= \"|\".join(map(str, df_2[\'JOB_NAME\']))
new_df.insert(0, \'merge_key\', df_1[\'Description\'].str.extract(\"(\"+jobs_list+\")\", expand=False))
df_merged = new_df.merge(df_1, how=\'right\', left_on=\'merge_key\', right_on=\'JOB_NAME\').drop(\'merge_key\', axis=1)

你们能帮帮我吗?

    标签: python pandas


    【解决方案1】:

    您需要一个密钥来合并两者,因此我们提取密钥并使用它们进行合并。

    # extract the keys from the description and create addl columns
    # you can always drop these afterwards
    
    df[['JOB_NAME', 'JOB_STREAM_NAME' ]]=df['Description'].str.extract(r'JOB\s\b(\w+)\b.*?JOB_STREAM_NAME\s\b(\w+)\b' )
    
    #merge on steam_name and job_name, since columns names are common, these won't be repeated
    df3=df2.merge(df, on=['JOB_STREAM_NAME','JOB_NAME'], how='left')
    df3
    
    # drop the addl columns
    df=df.drop(columns=['JOB_STREAM_NAME','JOB_NAME'])
    
        JOB_STREAM_NAME     JOB_NAME    IN  Start_Time  Description     Per_Extr
    0   P26_BRAI_RS2    PP_BRAI_VAR_CARTAO_IND_IBI_D    IN7305517   2022-07-24 00:06:59     ABEND JOB PP_BRAI_VAR_CARTAO_IND_IBI_D and JOB...   FROM : 2022/01/08 TO : 2022/12/09
    1   P26_BRAI_VAR_TOD    PP_BRAI_VAR_CARTAO_IND_IBI_D    NaN     NaN     NaN     NaN
    2   P26_AAAR_006_TSA    PP_AAAR_4898_POUP_MOV_TDCH_D    IN7305465   2022-07-24 00:09:49     ABEND JOB PP_AAAR_4898_POUP_MOV_TDCH_D and JOB...   FROM : 2022/01/08 TO : 2022/12/09
    3   P26_AAAR_006_TSA    PP_AAAR_4898_POUPMOV_D  IN7305466   2022-07-24 00:10:16     ABEND JOB PP_AAAR_4898_POUPMOV_D and JOB_STREA...   FROM : 2022/01/08 TO : 2022/12/09
    4   P26_BGDTDCHF_PUM    PP_BGDTPRODHBACMS102020_01_M    IN7305493   2022-07-24 00:20:27     ABEND JOB PP_BGDTPRODHBACMS102020_01_M and JOB...   FROM : 2022/01/08 TO : 2022/12/09
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-08-02
      • 2020-09-07
      • 1970-01-01
      • 1970-01-01
      • 2022-12-28
      • 2017-07-04
      • 2023-02-17
      • 1970-01-01
      相关资源
      最近更新 更多