通过匹配列表中的字符串值在 pandas 数据框中构建新列答案

【问题标题】：Building a new column in a pandas dataframe by matching string values in a list通过匹配列表中的字符串值在 pandas 数据框中构建新列
【发布时间】：2016-03-21 23:26:32
【问题描述】：

我正在尝试在基于另一列 SearchCol3 的 pandas 数据框中构建一个新列 NewCol4，该列已经在数据框中。测试SearchCol3 的每个值以查看它是否包含列表stings 中的任何子字符串。如果SearchCol3 中的值包含列表strings 中的子字符串之一，则列表replacement 中的相应值将插入到找到子字符串的同一行的NewCol4 列中。如果在 SearchCol3 的值中未找到子字符串，则将来自 Col2 的值插入到 NewCol4 中。

期望的结果：

    Col1  Col2    SearchCol3   NewCol4
0   20    'May'   'abc(feb)'   'February'
1   30    'March' 'def | mar'  'March'
2   40    'June'  'ghi | feb'  'February'
3   50    'July'  'jkl(apr)'   'April'
4   60    'May'   'mno(mar)'   'March'
5   70    'March' 'abc'        'March'

目前我正在使用此代码来完成这项工作。

strings = ['jan',
           'feb',
           'mar',
           'apr',
           'may']

replacement = ['January',
               'Febuary',
               'March',
               'April',
               'May']


data = pandas.read_csv('data.csv')

data['NewCol4'] = ''

for j in range(len(strings)):
    for i in range(len(data)):
        if strings[j] in data.SearchCol3[i]:
            data.NewCol4[i] = replacement[j]


for i in range(len(data)):
    if data.NewCol4[i] == '':
        data.NewCol4[i] = data.Col2[i]

我的数据、搜索和替换数据框和列表比本示例中的要长得多。我正在寻找比我目前使用的更有效的方法来节省时间。有什么建议吗？

【问题讨论】：

标签： python python-2.7 pandas replace nested-loops

【解决方案1】：

由于有许多等效的方法可以完成这项任务，而且它看起来像是一个有用的测试用例，可以更好地了解 Pandas 的总体性能，我对所有提交的答案进行了基准测试（我是单元格中代码的作者#3、#4、#7 和 #10）。

我将输入数据大小增加了一千倍，以便进行更真实和公平的比较 - 否则纯 Python 解决方案优于基于 Pandas 的方法，因为对于大型数据集变得无关紧要的恒定开销。解决方案按从好到坏排序。

In [1]:
import pandas as pd
df = pd.DataFrame({'Col1': [20, 30, 40, 50, 60, 70],
                   'Col2': ['May', 'March', 'June', 'July', 'May', 'March'],
                   'SearchCol3': ['abc(feb)', 'def | mar', 'ghi | feb', 'jkl(apr)', 'mno(mar)', 'abc']})

strings = ['jan', 'feb', 'mar', 'apr', 'may']
replacement = ['January', 'February', 'March', 'April', 'May']

mapping = dict(zip(strings, replacement))

a_regex = '(jan|feb|mar|apr|may)'
month_replacements = {'jan': 'January','feb': 'February',
            'mar': 'March','apr': 'April','may': 'May'}


In [2]:
# Use a more realistic input size
df['NewCol4'] = ''
df = pd.concat([df]*1000).reset_index().consolidate()


In [3]:
%%timeit -n 100
result = []
for searchcol, default in zip(df["SearchCol3"], df["Col2"]):
    for s in mapping:
        if s in searchcol:
            result.append(mapping[s])
            break
    else:
        result.append(default)
df['NewCol4'] = result
100 loops, best of 3: 2.69 ms per loop


In [4]:
%%timeit -n 100
result = []
for index, searchcol, default in df[["SearchCol3", "Col2"]].itertuples():
    for s in mapping:
        if s in searchcol:
            result.append(mapping[s])
            break
    else:
        result.append(default)
df['NewCol4'] = result
100 loops, best of 3: 8.64 ms per loop


In [5]:
%%timeit -n 100
df['NewCol4'] = df.Col2
for i, s in enumerate(strings):
    df.loc[df.SearchCol3.str.contains(s), 'NewCol4'] = replacement[i]
100 loops, best of 3: 23.1 ms per loop


In [6]:
100
%%timeit -n 100
df['NewCol4'] = None
# Use month name if abbreviation in `SearchCol3`.
for month_code, month in zip(strings, replacement):
    df.loc[df.SearchCol3.str.contains(month_code), 'NewCol4'] = month
# Create a mask of null values and apply Col2 if null.
mask = df.NewCol4.isnull()
df.loc[mask, 'NewCol4'] = df.loc[mask, 'Col2']
100 loops, best of 3: 24.4 ms per loop


In [7]:
%%timeit -n 100
def match_string(searchcol, default):
    for s in mapping:
        if s in searchcol:
            return mapping[s]
    return default

df['NewCol4'] = df.apply(lambda x: match_string(x['SearchCol3'], x['Col2']), axis=1)
100 loops, best of 3: 135 ms per loop


In [8]:
%%timeit -n 100
def match_string(col3, col2):
    k = ([s for s in strings if s in col3])
    if k:  # if found in col3, return that result
        return replacement[strings.index(k[0])]
    l = ([s for s in replacement if s in col2])
    if l:  # else if found in col2, return second best option
        return l[0]
    return ''  # if neither, return empty string
df['NewCol4'] = df.apply(lambda x: match_string(x['SearchCol3'], x['Col2']), axis=1)
100 loops, best of 3: 144 ms per loop


In [9]:
%%timeit -n 100
#Extract Using Regex
df['NewCol4'] = df['SearchCol3'].str.extract(a_regex).fillna('')
#Look up values from dictionary
df['NewCol4'] = df['NewCol4'].apply(lambda x: month_replacements.get(x,''))
#Use default value from other coumn if no other value
df['NewCol4'] = df.apply(lambda row: row['Col2'] if row['NewCol4'] == '' else row['NewCol4'], axis=1)
100 loops, best of 3: 147 ms per loop


In [10]:
%%timeit -n 10
df['NewCol4'] = ''
for index, row in df.iterrows():
    searchcol = row["SearchCol3"]
    for s in mapping:
        if s in searchcol:
            df.loc[index, "NewCol4"] = mapping[s]
            break
    else:
        df.loc[index, "NewCol4"] = row["Col2"]
10 loops, best of 3: 2.82 s per loop

【讨论】：

+1 表示努力。另外，这是否意味着我编写了既快速又可读的东西？哇。今天是个好日子:)

【解决方案2】：

列表推导通常是处理 Dataframe 对象 dtype 的最快速度。这是一个为可读性而格式化的单行列表理解：

import pandas as pd

df = pd.DataFrame(
    {'Col1': [20, 30, 40, 50, 60, 70],
    'Col2': ['May','March','June','July','May','March'],
    'SearchCol3': ['abc(feb)','def | mar','ghi | feb','jkl(apr)','mno(mar)','abc']})

df['NewCol4'] = ['January' if 'jan' in x else
                 'Febuary' if 'feb' in x else
                 'March'   if 'mar' in x else
                 'April'   if 'apr' in x else
                 'May'     if 'may' in x else
                 x for x in df['SearchCol3']]

输出：

    Col1    Col2    SearchCol3  NewCol4
 0  20      May     abc(feb)    Febuary
 1  30      March   def | mar   March
 2  40      June    ghi | feb   Febuary
 3  50      July    jkl(apr)    April
 4  60      May     mno(mar)    March
 5  70      March   abc         abc

【讨论】：

【解决方案3】：

这对我有用，而且从好的方面来说，它非常可读！

strings = ['jan', 'feb', 'mar', 'apr', 'may']
replacement = ['January', 'February', 'March', 'April', 'May']

def match_string(col3, col2):
    # if in col3, return that result. Else, lazy eval for col2. If neither, return empty string.
    k = ([replacement[strings.index(s)] for s in strings if s in col3]) or ([s for s in replacement if s in col2])
    return k[0] if k else ''

df['NewCol4'] = df.apply(lambda x: match_string(x['SearchCol3'], x['Col2']), axis=1)

输出：

    Col2 SearchCol3   NewCol4
0    May   abc(feb)  February
1  March  def | mar     March
2   June  ghi | feb  February
3   July   jkl(apr)     April
4    May   mno(mar)     March
5  March        abc     March

【讨论】：

【解决方案4】：

另一种方法：

data['NewCol4'] = data.Col2
for i,s in enumerate(strings):
    data.loc[data.SearchCol3.str.contains(s),'NewCol4']=replacement[i]

【讨论】：

【解决方案5】：

# Initial data.
df = pd.DataFrame({'Col1': [20, 30, 40, 50, 60, 70],
                   'Col2': ['May', 'March', 'June', 'July', 'May', 'March'],
                   'SearchCol3': ['abc(feb)', 'def | mar', 'ghi | feb', 'jkl(apr)', 'mno(mar)', 'abc']})

strings = ['jan',
           'feb',
           'mar',
           'apr',
           'may']

replacement = ['January',
               'Febuary',
               'March',
               'April',
               'May']

# Set new column values to None.
df['NewCol4'] = None

# Use month name if abbreviation in `SearchCol3`.
for month_code, month in zip(strings, replacement):
    df.loc[df.SearchCol3.str.contains(month_code), 'NewCol4'] = month

# Create a mask of null values and apply Col2 if null.
mask = df.NewCol4.isnull()
df.loc[mask, 'NewCol4'] = df.loc[mask, 'Col2']

# Voila!
>>> df
   Col1   Col2 SearchCol3  NewCol4
0    20    May   abc(feb)  Febuary
1    30  March  def | mar    March
2    40   June  ghi | feb  Febuary
3    50   July   jkl(apr)    April
4    60    May   mno(mar)    March
5    70  March        abc    March

【讨论】：

【解决方案6】：

.str.extract 接受一个正则表达式。

http://pandas.pydata.org/pandas-docs/version/0.15.2/generated/pandas.core.strings.StringMethods.extract.html#pandas.core.strings.StringMethods.extract

import pandas as pd
df = pd.DataFrame(
        {'Col1': [20, 30, 40, 50, 60, 70],
        'Col2': ['May','March','June','July','May','March'],
        'SearchCol3': ['abc(feb)','def | mar','ghi | feb','jkl(apr)','mno(mar)','abc']})


a_regex = '(jan|feb|mar|apr|may)'
month_replacements = {'jan': 'January','feb': 'February',
            'mar': 'March','apr': 'April','may': 'May'}

#Extract Using Regex
df['NewCol4'] = df['SearchCol3'].str.extract(a_regex).fillna('')
#Look up values from dictionary
df['NewCol4'] = df['NewCol4'].apply(lambda x: month_replacements.get(x,''))
#Use default value from other coumn if no other value
df['NewCol4'] = df.apply(lambda row: row['Col2'] if row['NewCol4'] == '' else row['NewCol4'], axis=1)

输出：

   Col1   Col2 SearchCol3   NewCol4
0    20    May   abc(feb)  February
1    30  March  def | mar     March
2    40   June  ghi | feb  February
3    50   July   jkl(apr)     April
4    60    May   mno(mar)     March
5    70  March        abc     March

【讨论】：

比我的方法好得多。 +1
谢谢，这比我用的快很多。