【问题标题】:Building a new column in a pandas dataframe by matching string values in a list通过匹配列表中的字符串值在 pandas 数据框中构建新列
【发布时间】:2016-03-21 23:26:32
【问题描述】:

我正在尝试在基于另一列 SearchCol3 的 pandas 数据框中构建一个新列 NewCol4,该列已经在数据框中。测试SearchCol3 的每个值以查看它是否包含列表stings 中的任何子字符串。如果SearchCol3 中的值包含列表strings 中的子字符串之一,则列表replacement 中的相应值将插入到找到子字符串的同一行的NewCol4 列中。如果在 SearchCol3 的值中未找到子字符串,则将来自 Col2 的值插入到 NewCol4 中。

期望的结果:

    Col1  Col2    SearchCol3   NewCol4
0   20    'May'   'abc(feb)'   'February'
1   30    'March' 'def | mar'  'March'
2   40    'June'  'ghi | feb'  'February'
3   50    'July'  'jkl(apr)'   'April'
4   60    'May'   'mno(mar)'   'March'
5   70    'March' 'abc'        'March'

目前我正在使用此代码来完成这项工作。

strings = ['jan',
           'feb',
           'mar',
           'apr',
           'may']

replacement = ['January',
               'Febuary',
               'March',
               'April',
               'May']


data = pandas.read_csv('data.csv')

data['NewCol4'] = ''

for j in range(len(strings)):
    for i in range(len(data)):
        if strings[j] in data.SearchCol3[i]:
            data.NewCol4[i] = replacement[j]


for i in range(len(data)):
    if data.NewCol4[i] == '':
        data.NewCol4[i] = data.Col2[i]

我的数据、搜索和替换数据框和列表比本示例中的要长得多。我正在寻找比我目前使用的更有效的方法来节省时间。有什么建议吗?

【问题讨论】:

    标签: python python-2.7 pandas replace nested-loops


    【解决方案1】:

    由于有许多等效的方法可以完成这项任务,而且它看起来像是一个有用的测试用例,可以更好地了解 Pandas 的总体性能,我对所有提交的答案进行了基准测试(我是单元格中代码的作者#3、#4、#7 和 #10)。

    我将输入数据大小增加了一千倍,以便进行更真实和公平的比较 - 否则纯 Python 解决方案优于基于 Pandas 的方法,因为对于大型数据集变得无关紧要的恒定开销。解决方案按从好到坏排序。

    In [1]:
    import pandas as pd
    df = pd.DataFrame({'Col1': [20, 30, 40, 50, 60, 70],
                       'Col2': ['May', 'March', 'June', 'July', 'May', 'March'],
                       'SearchCol3': ['abc(feb)', 'def | mar', 'ghi | feb', 'jkl(apr)', 'mno(mar)', 'abc']})
    
    strings = ['jan', 'feb', 'mar', 'apr', 'may']
    replacement = ['January', 'February', 'March', 'April', 'May']
    
    mapping = dict(zip(strings, replacement))
    
    a_regex = '(jan|feb|mar|apr|may)'
    month_replacements = {'jan': 'January','feb': 'February',
                'mar': 'March','apr': 'April','may': 'May'}
    
    
    In [2]:
    # Use a more realistic input size
    df['NewCol4'] = ''
    df = pd.concat([df]*1000).reset_index().consolidate()
    
    
    In [3]:
    %%timeit -n 100
    result = []
    for searchcol, default in zip(df["SearchCol3"], df["Col2"]):
        for s in mapping:
            if s in searchcol:
                result.append(mapping[s])
                break
        else:
            result.append(default)
    df['NewCol4'] = result
    100 loops, best of 3: 2.69 ms per loop
    
    
    In [4]:
    %%timeit -n 100
    result = []
    for index, searchcol, default in df[["SearchCol3", "Col2"]].itertuples():
        for s in mapping:
            if s in searchcol:
                result.append(mapping[s])
                break
        else:
            result.append(default)
    df['NewCol4'] = result
    100 loops, best of 3: 8.64 ms per loop
    
    
    In [5]:
    %%timeit -n 100
    df['NewCol4'] = df.Col2
    for i, s in enumerate(strings):
        df.loc[df.SearchCol3.str.contains(s), 'NewCol4'] = replacement[i]
    100 loops, best of 3: 23.1 ms per loop
    
    
    In [6]:
    100
    %%timeit -n 100
    df['NewCol4'] = None
    # Use month name if abbreviation in `SearchCol3`.
    for month_code, month in zip(strings, replacement):
        df.loc[df.SearchCol3.str.contains(month_code), 'NewCol4'] = month
    # Create a mask of null values and apply Col2 if null.
    mask = df.NewCol4.isnull()
    df.loc[mask, 'NewCol4'] = df.loc[mask, 'Col2']
    100 loops, best of 3: 24.4 ms per loop
    
    
    In [7]:
    %%timeit -n 100
    def match_string(searchcol, default):
        for s in mapping:
            if s in searchcol:
                return mapping[s]
        return default
    ​
    df['NewCol4'] = df.apply(lambda x: match_string(x['SearchCol3'], x['Col2']), axis=1)
    100 loops, best of 3: 135 ms per loop
    
    
    In [8]:
    %%timeit -n 100
    def match_string(col3, col2):
        k = ([s for s in strings if s in col3])
        if k:  # if found in col3, return that result
            return replacement[strings.index(k[0])]
        l = ([s for s in replacement if s in col2])
        if l:  # else if found in col2, return second best option
            return l[0]
        return ''  # if neither, return empty string
    df['NewCol4'] = df.apply(lambda x: match_string(x['SearchCol3'], x['Col2']), axis=1)
    100 loops, best of 3: 144 ms per loop
    
    
    In [9]:
    %%timeit -n 100
    #Extract Using Regex
    df['NewCol4'] = df['SearchCol3'].str.extract(a_regex).fillna('')
    #Look up values from dictionary
    df['NewCol4'] = df['NewCol4'].apply(lambda x: month_replacements.get(x,''))
    #Use default value from other coumn if no other value
    df['NewCol4'] = df.apply(lambda row: row['Col2'] if row['NewCol4'] == '' else row['NewCol4'], axis=1)
    100 loops, best of 3: 147 ms per loop
    
    
    In [10]:
    %%timeit -n 10
    df['NewCol4'] = ''
    for index, row in df.iterrows():
        searchcol = row["SearchCol3"]
        for s in mapping:
            if s in searchcol:
                df.loc[index, "NewCol4"] = mapping[s]
                break
        else:
            df.loc[index, "NewCol4"] = row["Col2"]
    10 loops, best of 3: 2.82 s per loop
    

    【讨论】:

    • +1 表示努力。另外,这是否意味着我编写了既快速 可读的东西?哇。今天是个好日子:)
    【解决方案2】:

    列表推导通常是处理 Dataframe 对象 dtype 的最快速度。这是一个为可读性而格式化的单行列表理解:

    import pandas as pd
    
    df = pd.DataFrame(
        {'Col1': [20, 30, 40, 50, 60, 70],
        'Col2': ['May','March','June','July','May','March'],
        'SearchCol3': ['abc(feb)','def | mar','ghi | feb','jkl(apr)','mno(mar)','abc']})
    
    df['NewCol4'] = ['January' if 'jan' in x else
                     'Febuary' if 'feb' in x else
                     'March'   if 'mar' in x else
                     'April'   if 'apr' in x else
                     'May'     if 'may' in x else
                     x for x in df['SearchCol3']]
    

    输出:

        Col1    Col2    SearchCol3  NewCol4
     0  20      May     abc(feb)    Febuary
     1  30      March   def | mar   March
     2  40      June    ghi | feb   Febuary
     3  50      July    jkl(apr)    April
     4  60      May     mno(mar)    March
     5  70      March   abc         abc
    

    【讨论】:

      【解决方案3】:

      这对我有用,而且从好的方面来说,它非常可读!

      strings = ['jan', 'feb', 'mar', 'apr', 'may']
      replacement = ['January', 'February', 'March', 'April', 'May']
      
      def match_string(col3, col2):
          # if in col3, return that result. Else, lazy eval for col2. If neither, return empty string.
          k = ([replacement[strings.index(s)] for s in strings if s in col3]) or ([s for s in replacement if s in col2])
          return k[0] if k else ''
      
      df['NewCol4'] = df.apply(lambda x: match_string(x['SearchCol3'], x['Col2']), axis=1)
      

      输出:

          Col2 SearchCol3   NewCol4
      0    May   abc(feb)  February
      1  March  def | mar     March
      2   June  ghi | feb  February
      3   July   jkl(apr)     April
      4    May   mno(mar)     March
      5  March        abc     March
      

      【讨论】:

        【解决方案4】:

        另一种方法:

        data['NewCol4'] = data.Col2
        for i,s in enumerate(strings):
            data.loc[data.SearchCol3.str.contains(s),'NewCol4']=replacement[i]
        

        【讨论】:

          【解决方案5】:
          # Initial data.
          df = pd.DataFrame({'Col1': [20, 30, 40, 50, 60, 70],
                             'Col2': ['May', 'March', 'June', 'July', 'May', 'March'],
                             'SearchCol3': ['abc(feb)', 'def | mar', 'ghi | feb', 'jkl(apr)', 'mno(mar)', 'abc']})
          
          strings = ['jan',
                     'feb',
                     'mar',
                     'apr',
                     'may']
          
          replacement = ['January',
                         'Febuary',
                         'March',
                         'April',
                         'May']
          
          # Set new column values to None.
          df['NewCol4'] = None
          
          # Use month name if abbreviation in `SearchCol3`.
          for month_code, month in zip(strings, replacement):
              df.loc[df.SearchCol3.str.contains(month_code), 'NewCol4'] = month
          
          # Create a mask of null values and apply Col2 if null.
          mask = df.NewCol4.isnull()
          df.loc[mask, 'NewCol4'] = df.loc[mask, 'Col2']
          
          # Voila!
          >>> df
             Col1   Col2 SearchCol3  NewCol4
          0    20    May   abc(feb)  Febuary
          1    30  March  def | mar    March
          2    40   June  ghi | feb  Febuary
          3    50   July   jkl(apr)    April
          4    60    May   mno(mar)    March
          5    70  March        abc    March
          

          【讨论】:

            【解决方案6】:

            .str.extract 接受一个正则表达式。

            http://pandas.pydata.org/pandas-docs/version/0.15.2/generated/pandas.core.strings.StringMethods.extract.html#pandas.core.strings.StringMethods.extract

            import pandas as pd
            df = pd.DataFrame(
                    {'Col1': [20, 30, 40, 50, 60, 70],
                    'Col2': ['May','March','June','July','May','March'],
                    'SearchCol3': ['abc(feb)','def | mar','ghi | feb','jkl(apr)','mno(mar)','abc']})
            
            
            a_regex = '(jan|feb|mar|apr|may)'
            month_replacements = {'jan': 'January','feb': 'February',
                        'mar': 'March','apr': 'April','may': 'May'}
            
            #Extract Using Regex
            df['NewCol4'] = df['SearchCol3'].str.extract(a_regex).fillna('')
            #Look up values from dictionary
            df['NewCol4'] = df['NewCol4'].apply(lambda x: month_replacements.get(x,''))
            #Use default value from other coumn if no other value
            df['NewCol4'] = df.apply(lambda row: row['Col2'] if row['NewCol4'] == '' else row['NewCol4'], axis=1)
            

            输出:

               Col1   Col2 SearchCol3   NewCol4
            0    20    May   abc(feb)  February
            1    30  March  def | mar     March
            2    40   June  ghi | feb  February
            3    50   July   jkl(apr)     April
            4    60    May   mno(mar)     March
            5    70  March        abc     March
            

            【讨论】:

            • 比我的方法好得多。 +1
            • 谢谢,这比我用的快很多。
            猜你喜欢
            • 2022-01-10
            • 1970-01-01
            • 1970-01-01
            • 2021-01-18
            • 1970-01-01
            • 1970-01-01
            • 2021-10-05
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多