【问题标题】:python - "merge based on a partial match" - Improving performance of functionpython - “基于部分匹配的合并” - 提高函数的性能
【发布时间】:2022-01-19 06:35:03
【问题描述】:

我有以下脚本 - 旨在创建“基于部分匹配的合并”功能,因为据我所知,普通的 .merge() 函数无法做到这一点。

下面的工作/返回所需的结果,但不幸的是,它非常慢,以至于在我需要它的地方几乎无法使用。

一直在查看其他包含类似问题的 Stack Overflow 帖子,但还没有找到更快的解决方案。

任何关于如何实现这一点的想法都将不胜感激!

import pandas as pd 

df1 = pd.DataFrame([  'https://wwww.example.com/hi', 'https://wwww.example.com/tri', 'https://wwww.example.com/bi', 'https://wwww.example.com/hihibi' ]
    ,columns = ['pages']
)

df2 = pd.DataFrame(['hi','bi','geo']
    ,columns = ['ngrams']
)

def join_on_partial_match(full_values=None, matching_criteria=None):
    # Changing columns name with index number
    full_values.columns.values[0] = "full"
    matching_criteria.columns.values[0] = "ngram_match"

    # Creating matching column so all rows match on join
    full_values['join'] = 1
    matching_criteria['join'] = 1
    dfFull = full_values.merge(matching_criteria, on='join').drop('join', axis=1)

    # Dropping the 'join' column we created to join the 2 tables
    matching_criteria = matching_criteria.drop('join', axis=1)

    # identifying matching and returning bool values based on whether match exists
    dfFull['match'] = dfFull.apply(lambda x: x.full.find(x.ngram_match), axis=1).ge(0)

    # filtering dataset to only 'True' rows
    final = dfFull[dfFull['match'] == True] 

    final = final.drop('match', axis=1)
    
    return final 

join = join_on_partial_match(full_values=df1,matching_criteria=df2)
print(join)
>>                 full ngram_match
0       https://wwww.example.com/hi          hi
7       https://wwww.example.com/bi          bi
9   https://wwww.example.com/hihibi          hi
10  https://wwww.example.com/hihibi          bi

【问题讨论】:

  • 我建议切换到 numpy,在那里完成工作,然后返回 pandas

标签: python pandas performance


【解决方案1】:

对于任何有兴趣的人 - 最终想出了 2 种方法来做到这一点。

  1. 首先返回所有匹配项(即,它复制输入值并匹配所有部分匹配项)
  2. 只返回第一个匹配项。
    两者都非常快。刚刚使用了一个非常简单的屏蔽脚本
def partial_match_join_all_matches_returned(full_values=None, matching_criteria=None):
    """The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with all matching values (duplicating the full value).
    Args:
        full_values = None: This is the series that contains the full values for matching pair.
        partial_values = None: This is the series that contains the partial values for matching pair.
    Returns:
            A dataframe with 2 columns - 'full' and 'match'.  
    """
    start_join1 = time.time()
    
    matching_criteria = matching_criteria.to_frame("match")
    full_values = full_values.to_frame("full")
    full_values = full_values.drop_duplicates() 
    
    output=[]

    for n in matching_criteria['match']:
        mask = full_values['full'].str.contains(n, case=False, na=False)
        df = full_values[mask]
        df_copy = df.copy()
        df_copy['match'] = n 
        # df = df.loc[n, 'match'] 
        output.append(df_copy)

    final = pd.concat(output)

    end_join1 = (time.time() - start_join1)
    end_join1 = str(round(end_join1, 2))
    len_join1 = len(final)
    
    return final
def partial_match_join_first_match_returned(full_values=None, matching_criteria=None):
    """The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with the first matching value.
    Args:
        full_values = None: This is the series that contains the full values for matching pair.
        partial_values = None: This is the series that contains the partial values for matching pair.
    Returns:
            A dataframe with 2 columns - 'full' and 'match'.  
    """
    start_singlejoin = time.time()

    matching_criteria = matching_criteria.to_frame("match")
    full_values = full_values.to_frame("full").drop_duplicates() 
    output=[]
    for n in matching_criteria['match']:
        mask = full_values['full'].str.contains(n, case=False, na=False)
        df = full_values[mask]
        df_copy = df.copy()
        df_copy['match'] = n 
        # leaves us with only the 1st of each URL
        df_copy.drop_duplicates(subset=['full'])
        output.append(df_copy)

    final = pd.concat(output)

    end_singlejoin = (time.time() - start_singlejoin)
    end_singlejoin = str(round(end_singlejoin, 2))
    len_singlejoin = len(final)

    return final

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-01-29
    • 2019-10-06
    • 1970-01-01
    • 2020-10-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多