Python Pandas 在行上操作答案

【问题标题】：Python Pandas operate on rowPython Pandas 在行上操作
【发布时间】：2014-05-30 08:56:44
【问题描述】：

嗨，我的数据框看起来像：

Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264

简单地说，我需要添加另一个名为“_id”的列作为商店、部门、日期的串联，如“1_1_2010-02-05”，我假设我可以通过 df['id'] = df ['Store'] +'' +df['Dept'] +'_'+df['Date']，原来不是。

同样，我还需要添加一个新列作为销售日志，我尝试了 df['logSales'] = math.log(df['Sales'])，再次失败。

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

您可以先将其转换为字符串（整数列），然后再与+ 连接：

In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']

In [26]: df
Out[26]: 
   Store  Dept        Date  Sales              id
0      1     1  2010-02-05    245  1_1_2010-02-05
1      1     1  2010-02-12    449  1_1_2010-02-12
2      1     1  2010-02-19    455  1_1_2010-02-19
3      1     1  2010-02-26    154  1_1_2010-02-26
4      1     1  2010-03-05     29  1_1_2010-03-05
5      1     1  2010-03-12    239  1_1_2010-03-12
6      1     1  2010-03-19    264  1_1_2010-03-19

对于log，您最好使用numpy 函数。这是矢量化的（math.log 只能处理单个标量值）：

In [34]: df['logSales'] = np.log(df['Sales'])

In [35]: df
Out[35]: 
   Store  Dept        Date  Sales              id  logSales
0      1     1  2010-02-05    245  1_1_2010-02-05  5.501258
1      1     1  2010-02-12    449  1_1_2010-02-12  6.107023
2      1     1  2010-02-19    455  1_1_2010-02-19  6.120297
3      1     1  2010-02-26    154  1_1_2010-02-26  5.036953
4      1     1  2010-03-05     29  1_1_2010-03-05  3.367296
5      1     1  2010-03-12    239  1_1_2010-03-12  5.476464
6      1     1  2010-03-19    264  1_1_2010-03-19  5.575949

总结 cmets，对于这种大小的数据框，使用 apply 与使用矢量化函数（在整列上工作）相比在性能上不会有太大差异，但是当您的实际数据框变得更大时，它会。
除此之外，我认为上述解决方案的语法也更简单。

【讨论】：

我使用数学得到 164us，而使用 numpy 日志得到 151us，我假设对于大型数据框 numpy 的早餐会吃数学的日志？
确实，我得到 201us (np) vs 208us (math)，所以这个数据帧几乎相同，但是对于更大的数据帧（这个重复 100 次），numpy 显然比使用快.
对于 7000 行的数据帧，math.log 需要 2.17 毫秒，而 np.log 需要 240 毫秒，因此性能显着提升
同样对于连接，对于这个数据帧，使用 apply 并不慢（500 vs 700 us 甚至快一点），但对于更大的数据帧（7000 行），它再次明显更慢（200 vs 80毫秒）。
是的，我也希望如此，很高兴知道矢量化操作可以很好地扩展，我还有更多关于 pandas 和 numpy 的知识；）

【解决方案2】：

In [153]:
import pandas as pd
import io

temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
   Store  Dept        Date  Sales
0      1     1  2010-02-05    245
1      1     1  2010-02-12    449
2      1     1  2010-02-19    455
3      1     1  2010-02-26    154
4      1     1  2010-03-05     29
5      1     1  2010-03-12    239
6      1     1  2010-03-19    264

[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x:  str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
   Store  Dept        Date  Sales              id
0      1     1  2010-02-05    245  1 1_2010-02-05
1      1     1  2010-02-12    449  1 1_2010-02-12
2      1     1  2010-02-19    455  1 1_2010-02-19
3      1     1  2010-02-26    154  1 1_2010-02-26
4      1     1  2010-03-05     29  1 1_2010-03-05
5      1     1  2010-03-12    239  1 1_2010-03-12
6      1     1  2010-03-19    264  1 1_2010-03-19

[7 rows x 5 columns]
In [155]:

import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
   Store  Dept        Date  Sales              id  logSales
0      1     1  2010-02-05    245  1 1_2010-02-05  5.501258
1      1     1  2010-02-12    449  1 1_2010-02-12  6.107023
2      1     1  2010-02-19    455  1 1_2010-02-19  6.120297
3      1     1  2010-02-26    154  1 1_2010-02-26  5.036953
4      1     1  2010-03-05     29  1 1_2010-03-05  3.367296
5      1     1  2010-03-12    239  1 1_2010-03-12  5.476464
6      1     1  2010-03-19    264  1 1_2010-03-19  5.575949

[7 rows x 6 columns]

【讨论】：