【问题标题】:Python Pandas - Merge based on substring in stringPython Pandas - 基于字符串中的子字符串合并
【发布时间】:2018-07-22 10:58:04
【问题描述】:

我有 2 个格式如下的数据框:

df_search

SEARCH
part1
anotherpart
onemorepart


df_all

FILE             EXTENSION    PATH
part1_1         .prt    //server/folder1/part1_1
part1_2         .prt    //server/folder2/part1_2
part1_2         .pdf    //server/folder3/part1_2
part1_3         .prt    //server/folder2/part1_3
anotherpart_1   .prt    //server/folder1/anotherpart_1
anotherpart_2   .prt    //server/folder3/anotherpart_2
anotherpart_3   .prt    //server/folder2/anotherpart_3
anotherpart_3   .cgm    //server/folder1/anotherpart_3
anotherpart_4   .prt    //server/folder3/anotherpart_4
onemorepart_1   .prt    //server/folder2/onemorepart_1
onemorepart_2   .prt    //server/folder1/onemorepart_2
onemorepart_2   .dwg    //server/folder2/onemorepart_2
onemorepart_3   .prt    //server/folder1/onemorepart_3
onemorepart_4   .prt    //server/folder1/onemorepart_4

完整的 df_search 有 15,000 个项目。 df_all 有 550,000 个项目。我正在尝试根据文件字符串中的搜索项字符串合并两个数据框。我想要的输出是这样的:

SEARCH       FILE            EXTENSION  PATH    
part1        part1_1        .prt    //server/folder1/part1_1    
part1        part1_2        .prt    //server/folder2/part1_2    
part1        part1_2        .pdf    //server/folder3/part1_2    
part1        part1_3        .prt    //server/folder2/part1_3    
anotherpart anotherpart_1   .prt    //server/folder1/anotherpart_1  
anotherpart anotherpart_2   .prt    //server/folder3/anotherpart_2  
anotherpart anotherpart_3   .prt    //server/folder2/anotherpart_3  
anotherpart anotherpart_3   .cgm    //server/folder1/anotherpart_3  
anotherpart anotherpart_4   .prt    //server/folder3/anotherpart_4  
onemorepart onemorepart_1   .prt    //server/folder2/onemorepart_1  
onemorepart onemorepart_2   .prt    //server/folder1/onemorepart_2  
onemorepart onemorepart_2   .dwg    //server/folder2/onemorepart_2  
onemorepart onemorepart_3   .prt    //server/folder1/onemorepart_3  
onemorepart onemorepart_4   .prt    //server/folder1/onemorepart_4  

简单的数据框合并不起作用,因为字符串永远不会完全匹配(它始终是子字符串)。我还根据stackoverflow上的其他问题尝试了以下方法:

df_all[df_all.name.str.contains('|'.join(df_search.search))]

这给了我在 df_all 中找到的所有项目的完整列表,但我不知道哪个搜索字符串返回了哪个结果。

我设法让它与 for 循环一起工作,但我的数据集很慢(67 分钟):

super_df = []
for search_item in df_search.search:
     df_entire.loc[df_entire.file.str.contains(search_item), 'search'] = search_item
     temp_df = df_entire[df_entire.file.str.contains(search_item)]
super_df = pd.concat(super_df, axis=0, ignore_index=True)

是否可以通过矢量化来提高性能?

谢谢

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    我会这样:

    df_all['SEARCH'] = ''
    for val in df_search.SEARCH:
        df_all.loc[df_all['FILE'].str.match(val), 'SEARCH'] = val
    

    【讨论】:

      【解决方案2】:

      使用str.extract + insert:

      pat = "|".join(df_search.SEARCH)
      df_all.insert(0, 'SEARCH', df_all['FILE'].str.extract("(" + pat + ')', expand=False))
      print (df_all)
               SEARCH           FILE EXTENSION                            PATH
      0         part1        part1_1      .prt        //server/folder1/part1_1
      1         part1        part1_2      .prt        //server/folder2/part1_2
      2         part1        part1_2      .pdf        //server/folder3/part1_2
      3         part1        part1_3      .prt        //server/folder2/part1_3
      4   anotherpart  anotherpart_1      .prt  //server/folder1/anotherpart_1
      5   anotherpart  anotherpart_2      .prt  //server/folder3/anotherpart_2
      6   anotherpart  anotherpart_3      .prt  //server/folder2/anotherpart_3
      7   anotherpart  anotherpart_3      .cgm  //server/folder1/anotherpart_3
      8   anotherpart  anotherpart_4      .prt  //server/folder3/anotherpart_4
      9   onemorepart  onemorepart_1      .prt  //server/folder2/onemorepart_1
      10  onemorepart  onemorepart_2      .prt  //server/folder1/onemorepart_2
      11  onemorepart  onemorepart_2      .dwg  //server/folder2/onemorepart_2
      12  onemorepart  onemorepart_3      .prt  //server/folder1/onemorepart_3
      13  onemorepart  onemorepart_4      .prt  //server/folder1/onemorepart_4
      

      【讨论】:

      • 这正是我所需要的。我在插入时遇到问题,并意识到这是因为我的一个搜索项中有一个“()”字符,这在正则表达式中给了我两个组。在此之后我还过滤掉了所有的 NaN 值,因为有些文件不匹配。非常感谢。
      • 抱歉,我的数据出现错误:ValueError: Wrong number of items passed 12, placement implies 1。你知道怎么处理吗?
      • @ahbon - 难题,一个想法应该是使用pat = "|".join([re.escape(x) for x in df_search.SEARCH]),如果可能的话,列中的一些特殊字符 - 转义值
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-01-22
      • 1970-01-01
      • 2017-10-06
      • 2020-07-09
      • 2017-02-15
      • 2018-11-05
      • 2021-04-26
      相关资源
      最近更新 更多