【问题标题】:Duplicate removal functions keeping the first occurence重复删除功能保持第一次出现
【发布时间】:2021-08-14 19:00:30
【问题描述】:

我使用以下函数删除重复项,同时保持第一次出现且不更改顺序。

    def uniqueList(row):
    words = str(row).split(" ")
    unique = words[0]
    for w in words:
        if w.lower() not in unique.lower():
            unique = unique + " " + w
    return unique
df["value_corrected"] = df["value_corrected"].apply(uniqueList)

"""   1   """
sentences = df["value_corrected"] .to_list()
for s in sentences:
    s_split = s.split(' ')  # keep original sentence split by ' '
    s_split_without_comma = [i.strip(',') for i in s_split]
    # method 1: re
    compare_words = re.split(' |-', s)
    # method 2: itertools
    compare_words = list(itertools.chain.from_iterable([i.split('-') for i in s_split]))
    # method 3: DIY
    compare_words = []
    for i in s_split:
        compare_words += i.split('-')

    # strip ','
    compare_words_without_comma = [i.strip(',') for i in compare_words]

    # start to compare
    need_removed_index = []
    for word in compare_words_without_comma:
        matched_indexes = []
        for idx, w in enumerate(s_split_without_comma):
            if word.lower() in w.lower().split('-'):
                matched_indexes.append(idx)
        if len(matched_indexes) > 1:  # has_duplicates
            need_removed_index += matched_indexes[1:]
    need_removed_index = list(set(need_removed_index))

    # keep remain and join with ' '
    print(" ".join([i for idx, i in enumerate(s_split) if idx not in need_removed_index]))
    # print(sentences)

print(sentences)

在大多数情况下它都有效,除了:

  1. 还删除了介词,因为它应用于一行的全部内容,我认为需要一个条件才能将该函数应用于 len >3 的单词
  2. 有时会删除“'”
  3. 当单词在大写字母和大写字母时也不消除重复,例如:'apple' vs 'APPLE'

数据样本:

data = {'Name': ["LOVABLE Lovable Period Panties Slip da Ciclo Mestruale Flusso Medio (Pacco da 2) Donna', 'Laessig LÄSSIG Set di Cucchiaio per bambini 4 pezzi Uni menta/mirtillo",
             "Béaba BÉABA, Set di 6 Contenitori per la Pappa per Svezzamento Bebè in Silicone",
             "L´Occitane L'OCCITANE - CREMA MANI NUTRIENTE AL BURRO DI KARITÈ PER PELLI SECCHE 150ML"]}
df = pd.DataFrame(data)

期望的输出:

LOVABLE Period Panties Slip da Ciclo Mestruale Flusso Medio (Pacco da 2) Donna
Laessig Set di Cucchiaio per bambini 4 pezzi Uni menta/mirtillo
Béaba, Set di 6 Contenitori per la Pappa per Svezzamento Bebè in Silicone
L´Occitane - CREMA MANI NUTRIENTE AL BURRO DI KARITÈ PER PELLI SECCHE 150ML

有没有办法我可以修改上述函数来覆盖这种情况?

非常感谢。

【问题讨论】:

  • 在 pandas 中,您感兴趣的方法是 DataFrame.duplicated(),它将返回一个布尔系列,标记重复行,但第一次出现除外(此行为可以使用 keep 参数更改)。在此处查看更多信息:pandas.pydata.org/docs/reference/api/…
  • @Erlinska,谢谢,但如果我理解正确你的答案,我会尝试删除每一行的重复项,而不是重复的行
  • 你能举一个输入数据框和预期输出的例子吗?
  • 如上。考虑到这 10-15 个句子/字符串(放置在很多不同的场景中),我认为添加到问题中会更好,我需要将它们输出为 。
  • @MDR,已更新,谢谢

标签: python pandas duplicates


【解决方案1】:

根据提供的字符串...

试试:

import pandas as pd
import re
# import unidecode

data = {'Name': ["LOVABLE Lovable Period Panties Slip da Ciclo Mestruale Flusso Medio (Pacco da 2) Donna", 
                 "Laessig LÄSSIG Set di Cucchiaio per bambini 4 pezzi Uni menta/mirtillo",
             "Béaba BÉABA, Set di 6 Contenitori per la Pappa per Svezzamento Bebè in Silicone",
             "L´Occitane L'OCCITANE - CREMA MANI NUTRIENTE AL BURRO DI KARITÈ PER PELLI SECCHE 150ML"]}

df = pd.DataFrame(data)

def dedupString(s):
    '''
    Given a string 's' it processes the string and returns a string with duplicated words removed.
    - replaces acute accent with single quote
    - split string inc. punctuation to list
    - sets 'ALL CAPS' words to 'All Caps' words (only during processing)
    - loops through list and removes duplicates
    - if word has a uppercase in the third char (like L'Oréal) reinstates that
    - deduplicates the list and returns the list joined with a " "
    '''

    #replace acute accent (´) with a single quote (')
    s = s.replace("´", "'")
    #split the string inc. punctuation.  If ticks and dashes etc. go missing from the output
    #add them to the end of the second square brackets below.  Example -> [.,!?;-HERE]
    l = re.findall(r"[\w']+|[.,!?;-]", s)
    output = []
    seen = set()
    #loop through the words
    for word in l:
        wordAllCaps = False
        #if word is all caps record it
        if word.isupper():
            wordAllCaps = True
        #change, for example 'THE' to 'The' (and 'The' to 'The' but hey)
        if word[0].isupper():
            word = word.capitalize()
        #if the word is more than 3 chars
        if len(word) > 3:
            #and if the word as a single quote as the second char
            if word[1] == "'":
                #capitialize the third char in the word so "L'oréal" becomes "L'Oréal"
                word = ''.join([word[:2], word[2].upper(), word[2 + 1:]])
        #if the current word hasn't been seen before
        if word not in seen:
            #add it to seen
            seen.add(word)
            #if the word was originally all caps (like 'FOOBAR' but currently 'Foobar') change it back
            if wordAllCaps:
                word = word.upper()
            #add word to the output string
            output.append(word)     
        
    #return the list of words joined with spaces
    return ' '.join(output)

df['Name2'] = df['Name']
# df['Name2'] = df['Name2'].apply(unidecode.unidecode)
df['Name2'] = df.apply(lambda x: dedupString(x['Name2']), axis=1)
df['Name2'] = df['Name2'].str.replace(' , ', ', ', regex=False)

print(df)

输出:

                                                Name  \
0  LOVABLE Lovable Period Panties Slip da Ciclo M...   
1  Laessig LÄSSIG Set di Cucchiaio per bambini 4 ...   
2  Béaba BÉABA, Set di 6 Contenitori per la Pappa...   
3  L´Occitane L'OCCITANE - CREMA MANI NUTRIENTE A...   

                                               Name2  
0  LOVABLE Period Panties Slip da Ciclo Mestruale...  
1  Laessig LÄSSIG Set di Cucchiaio per bambini 4 ...  
2  Béaba, Set di 6 Contenitori per la Pappa Svezz...  
3  L'Occitane - CREMA MANI NUTRIENTE AL BURRO DI ... 

注意:

  • LOVABLE Lovable 变为 LOVABLE,因为第一个单词被保留。同样,Béaba BÉABA, 变为 Béaba,,因为标点符号移到原来的第一个单词中。
  • 如果您愿意在上面的代码中覆盖现有列,请将df['Name2'] = 更改为df['Name'] =。我建议在删除原始字符串列之前检查/采样输出。
  • 我已经注释掉了可以删除 unicode 的几行(3 和 59)(未经测试)。我暂时把它放在了外面,但如果需要,它就在那里。在检查较大的数据集时,您可以查看 unicode 字符是否会导致问题(例如,façade Facade 之类的字符串 - 是否匹配为重复项是存在的问题。在删除重复项之前交换 unicode(取消注释第 3 行和第 59 行以及试试看)或保持原样。

这适用于给定的字符串。如果字符消失,请注意代码中的注释(随着数据集的增长,您可能需要更改正则表达式)...

#split the strings inc. punctuation.  If ticks and dashes etc. go missing from the output
#add them to the end of the second square brackets below.  Example -> [.,!?;-HERE]
l = re.findall(r"[\w']+|[.,!?;-]", s)

补充:

如果您的预期输出是 Laessig LÄSSIG 变为 Laessig 尝试:

import pandas as pd
import re
import unidecode

data = {'Name': ["LOVABLE Lovable Period Panties Slip da Ciclo Mestruale Flusso Medio (Pacco da 2) Donna", 
                 "Laessig LÄSSIG Set di Cucchiaio per bambini 4 pezzi Uni menta/mirtillo",
             "Béaba BÉABA, Set di 6 Contenitori per la Pappa per Svezzamento Bebè in Silicone",
             "L´Occitane L'OCCITANE - CREMA MANI NUTRIENTE AL BURRO DI KARITÈ PER PELLI SECCHE 150ML"]}

df = pd.DataFrame(data)

swaps = {"ä":"ae", 
         #"ö":"oe", 
         "ü":"ue", 
         "Ä":"Ae", 
         #"Ö":"Oe", 
         "Ü":"Ue", 
         "ß":"ss"}

def toASCII(s):
    '''
    Input is a string; 
    - if the string contains any char in the keys of 'swaps' replace that char
    - sets words that are ALL CAPS to All Caps for consistent output
    '''
    #if the string has a char that is in the keys of 'swaps'
    if any(e in swaps.keys() for e in s):
        #for each word
        for w in s.split():
            #if the word is ALL CAPS
            if w.isupper():
                #make it All Caps
                s = s.replace(w, w.capitalize())
            
            #replace, for example 'ä' with 'ae'
            for w, l in swaps.items():
                s = s.replace(w, l)
    return s

def dedupString(s):
    '''
    Given a string 's' it processes the string and returns a string with duplicated words removed.
    - replaces acute accent with single quote
    - split string inc. punctuation to list
    - sets 'ALL CAPS' words to 'All Caps' words (only during processing)
    - loops through list and removes duplicates
    - if word has a uppercase in the third char (like L'Oréal) reinstates that
    - deduplicates the list and returns the list joined with a " "
    '''

    #replace acute accent (´) with a single quote (')
    s = s.replace("´", "'")
    #split the string inc. punctuation.  If ticks and dashes etc. go missing from the output
    #add them to the end of the second square brackets below.  Example -> [.,!?;-HERE]
    l = re.findall(r"[\w']+|[.,!?;-]", s)
    output = []
    seen = set()
    #loop through the words
    for word in l:
        wordAllCaps = False
        #if word is all caps record it
        if word.isupper():
            wordAllCaps = True
        #change, for example 'THE' to 'The' (and 'The' to 'The' but hey)
        if word[0].isupper():
            word = word.capitalize()
        #if the word is more than 3 chars
        if len(word) > 3:
            #and if the word as a single quote as the second char
            if word[1] == "'":
                #capitialize the third char in the word so "L'oréal" becomes "L'Oréal"
                word = ''.join([word[:2], word[2].upper(), word[2 + 1:]])
        #if the current word hasn't been seen before
        if word not in seen:
            #add it to seen
            seen.add(word)
            #if the word was originally all caps (like 'FOOBAR' but currently 'Foobar') change it back
            if wordAllCaps:
                word = word.upper()
            #add word to the output string
            output.append(word)     
        
    #return the list of words joined with spaces
    return ' '.join(output)

df['Name2'] = df['Name']
df['Name2'] = df.apply(lambda x: toASCII(x['Name2']), axis=1)
df['Name2'] = df['Name2'].apply(unidecode.unidecode)
df['Name2'] = df.apply(lambda x: dedupString(x['Name2']), axis=1)
df['Name2'] = df['Name2'].str.replace(' , ', ', ', regex=False)

print(df)

输出:

                                            Name  \
0  LOVABLE Lovable Period Panties Slip da Ciclo M...   
1  Laessig LÄSSIG Set di Cucchiaio per bambini 4 ...   
2  Béaba BÉABA, Set di 6 Contenitori per la Pappa...   
3  L´Occitane L'OCCITANE - CREMA MANI NUTRIENTE A...   

                                               Name2  
0  LOVABLE Period Panties Slip da Ciclo Mestruale...  
1  Laessig Set di Cucchiaio per bambini 4 pezzi U...  
2  Beaba, Set di 6 Contenitori per la Pappa Svezz...  
3  L'Occitane - CREMA MANI NUTRIENTE AL BURRO DI ...

显然,对于更大的数据集,您必须查看您是否对swaps 字典感到满意。我已经注释掉了一些东西,例如,您可能不希望像 Björn(如果存在于更大的集合中)这样的词被转换等。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-09-10
    • 2014-12-14
    • 2022-11-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多