优化数据框操作：基于条件逻辑和多列的新列答案

【问题标题】：Optimizing dataframe manipulation: new column based on conditional logic and multiple columns优化数据框操作：基于条件逻辑和多列的新列
【发布时间】：2017-08-17 22:34:29
【问题描述】：

目前，这可行：

df['new'] = df.apply( \
       lambda x: address[int(x['c1'][:5], 2)]+'_'+str(int(x['c1'][6:11], 2)) \
       if x['c1'][5] == '1' \
       else address[int(x['c2'][:5], 2)]+'_'+str(int(x['c2'][6:11], 2)), axis=1) `

address 是一个字典。

但它真的很慢。具体来说，applying 到整个数据帧比applying 到选定列要慢得多。但是，新列基于多个列，我不确定如何实现。

另外，有没有办法向量化这些类型的逻辑/条件语句？

示例数据框： <bound method DataFrame.head of c1 c2 0 0000100111000111 0010110011000111 1 0001000111000111 0010110011000111 2 0101010001001010 0000000000000000 3 0101010010001110 0000000000000000 4 0101010011101010 0000000000000000 5 0111111100000100 0000000000000000 6 0111110010010110 0000000000000000 7 1000000001001100 0000000000000000 8 1110011110001000 0000000000000000 9 0000100001010000 0000000000000000 10 0001000001001010 0000000000000000 11 0101101100100100 0000000000000000 12 1110001100100100 0000000000000000 13 0010100101101001 0101010101101001 14 0000100101100000 0000000000000000 15 0000100110100000 0000000000000000 16 0001000101101011 0000000000000000 17 1001110000100001 0000000000000000 18 0111111000100000 0000000000000000 19 1000000100010110 0000000000000000 20 1110001111000010 0000000000000000 21 1011010001000010 0000000000000000 22 0110010001001111 0000000000000000 23 0111110000110101 0000000000000000 24 0111110001001100 0000000000000000 25 1000000000111101 0000000000000000 26 0000110001100010 0000000000000000 27 0001010001100010 0000000000000000 28 1100100100100101 1001011000000101 29 0101000010101010 0111110001001010 ... ... ... 95714 0101111100011000 0000000000000000 95715 0010101011001011 0000000000000000 95716 0010100111100110 0101010110100110 95717 0010101000100100 0101011011100100 95718 0101000110000101 0000000000000000

【问题讨论】：

df.apply 与 lambda 结合使用总是很慢。请显示您的数据框示例。
你不能问一个问题，得到一个答案，然后说你真的想问另一个问题。

标签： python performance pandas optimization dataframe

【解决方案1】：

您需要矢量化 if-then-else 也称为 np.where（np 代表 numpy，以防万一）。

import numpy as np
df['new'] = np.where(df['c1'].str[5] == '1',
                     df['c1'].str[:5], 
                     df['c2'].str[:5])
#                 c1                c2    new
#0  0000100111000111  0010110011000111  00101
#1  0001000111000111  0010110011000111  00101
#2  0101010001001010  0000000000000000  01010
#....

【讨论】：

非常感谢！我会试试这个和 PaSTE 的答案，看看哪个更快
好的，我遇到了一些问题，因为我可能过度简化了我的代码的实际外观。我已经编辑了我的原始代码
我没有看到任何最近的编辑。我的答案与您的代码相符。您是否正确复制了答案？什么是“问题”？
刚刚更改。现在增加了str()、int()、字典和字符串连接的复杂性
@MattTakao 改变问题的性质是不好的形式。这完全浪费了试图帮助你的人们的时间。它还会使您编辑之前的任何答案变得无关紧要。接受最佳答案并提出新问题会好得多。

【解决方案2】：

您似乎正在尝试根据字符串列c1 的字符值进行操作。像这样进行逐行字符串操作很慢，但 pandas 可以通过 .str functions 帮助您：

# begin by setting all of the values to what you want from c1
df['new'] = df['c1'].str.slice(stop=5)

# replace those that meet your criteria with what you want from 'c2'
df.loc[df['c1'].str.get(5) == '1', 'new'] = df['c2'].str.slice(stop=5)

【讨论】：

【解决方案3】：

使用布尔值 ~

df['New']=df.c1.str[:5]
df.loc[df.c1.str[5]=='1','New']=(df.c2.str[:5])[df.c1.astype(str).str[5]=='1']

【讨论】：

@jezrael 假期愉快~~:-)