一次将多个计算列添加到熊猫数据框中答案

【问题标题】：Add multiple calculated columns to a pandas dataframe at once一次将多个计算列添加到熊猫数据框中
【发布时间】：2017-04-27 21:08:56
【问题描述】：

我有一个看起来像这样的 pandas 数据框：

 ID1    ID2  Len1   Date1   Type1   Len2    Date2   Type2   Len_Diff    Date_Diff   Score
 123    456         1-Apr    M              6-Apr    L          
 234    567         20-Apr   S              19-Apr   S          
 345    678         10-Apr   M              1-Jan    M

我想通过从数据集中计算来填充 Len1、Len2、Len_Diff 和 Date_Diff 列。每个 ID 对应一个文本文件，可以使用 get_text 函数检索其文本，并且可以计算该文本的长度

到目前为止，我的代码可以为每一列单独执行此操作：

def len_text(key):
   text = get_text(key)
   return len(text)

df['Len1'] = df['ID1'].map(len_text)
df['Len2'] = df['ID2'].map(len_text)
df['Len_Diff'] = (abs(df['Len1'] - df['Len2']))
df['Date_Diff'] = (abs(df['Date1'] - df['Date2']))
df['Same_Type'] = np.where(df['Type1']==df['Type2'],1,0)

如何一步将所有这些列添加到数据框中。我希望它们一步到位，因为我想将代码包装在 try/except 块中，以克服因无法解码文本而导致的值错误。

try: 
    <code to add all five columns at once>
except ValueError: 
    print "Failed to decode"

在上面的每一行中添加一个 try/except 块会使它变得丑陋。
还有其他问题，例如：Changing certain values in multiple columns of a pandas DataFrame at once，处理多列，但它们都在谈论影响多列的一个计算/更改。我想要的是不同的计算来添加不同的列。

更新：从下面给出的答案中，我尝试了两种不同的方法来解决这个问题，到目前为止部分运气。这是我所做的：
方法 1：

# Add calculated columns Len1, Len2, Len_Diff, Date_Diff and Same_Type
def len_text(key):
    try:
        text = get_text(key)
        return len(text)
    except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
        return 0

df.loc[:, ['Len1','Len2','Len_Diff','Date_Diff','Same_Type']] = pd.DataFrame([
        df['ID1'].map(len_text),
        df['ID2'].map(len_text),
        np.abs(df['ID1'].map(len_text) - df['ID2'].map(len_text)),
        np.abs(df['Date1']- df['Date2'])
        np.where(df['Type1']==df['Type2'],1,0)
    ])

print df.info()

结果1：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 570 entries, 0 to 569
df columns (total 10 columns):
ID1                  570 non-null int64
Date1                570 non-null int64
Type1                566 non-null object     
Len1                 0 non-null float64
ID2                  570 non-null int64
Date2                570 non-null int64
Type2                570 non-null object     
Len2                 0 non-null float64     
Date_Diff            0 non-null float64   
Len_Diff             0 non-null float64
dtypes: float64(4), int64(4), object(2)
memory usage: 58.0+ KB
None

方法2：

def len_text(col):
    try:
        return col.map(get_text).str.len()
    except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
        return 0

formulas = """
     Len1 = @len_text(ID1)
     Len2 = @len_text(ID2)
     Len_Diff = Len1 - Len2
     Len_Diff = Len_Diff.abs()
     Same_Type = (Type1 == Type2) * 1
     """
try:
    df.eval(formulas, inplace=True, engine='python')
except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError, requests.exceptions.Timeout, ValueError) as e:
    print e

print df.info()

结果2：

"__pd_eval_local_len_text" is not a supported function
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 570 entries, 0 to 569
df columns (total 7 columns):
ID1             570 non-null int64
Date1           570 non-null int64
Type1           566 non-null object
ID2             570 non-null int64
Date2           570 non-null int64
Type2           570 non-null object
Len1            570 non-null int64
dtypes: int64(5), object(2)
memory usage: 31.2+ KB
None
/Users/.../anaconda2/lib/python2.7/site-packages/pandas/computation/eval.py:289:
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  target[parsed_expr.assigner] = ret

【问题讨论】：

为什么不把所有五行放在同一个try/except中？
@Jerome，我试过了。当我这样做时，只有第一列被填充。

标签： python pandas dataframe httprequest text-analysis

【解决方案1】：

这样的事情应该做的工作

编辑 2：在 一个作业中 多次评估 Len1 和 Len2 实际上是一种非常讨厌的方法。

df.loc[:, ['Len1', 'Len2', 'Len_Diff', 'Date_Diff', 'Same_Type']] = \ 
    pd.DataFrame([
        df['ID1'].map(len_text),
        df['ID2'].map(len_text),
        np.abs(df['ID1'].map(len_text) - df['ID2'].map(len_text)),
        np.abs(df['Date1'] - df['Date2']),
        np.where(df['Type1']==df['Type2'],1,0)
    ])

但是，它的可读性远不如原始版本。

编辑：这是一种更好的方法，只需 2 行即可。

df.loc[:, ['Len1', 'Len2']] = \ 
    pd.DataFrame([
        df['ID1'].map(len_text),
        df['ID2'].map(len_text)
    ])

df.loc[:, [ 'Len_Diff', 'Date_Diff', 'Same_Type'] = \
    pd.DataFrame([
        np.abs(df['Len1'] - df['Len2']),
        np.abs(df['Date1'] - df['Date2']),
        np.where(df['Type1']==df['Type2'],1,0)
    ])

【讨论】：

我不确定这是否可行.... RHS 的评估将在分配之前进行...例如，abs(df['Len1'] - df['Len2']) 将使用错误的数据。跨度>
哦，原来如此，那样的话，不可能一行行。
好吧，也许不是不可能，但肯定不是干净；)
我喜欢您的 EDIT2 解决方案。一件小事 - 我认为我们应该使用 np.abs() 而不是 abs()
@matusko，Edit2 的工作就像一个魅力。但是，有一个问题：Len_Diff 没有被填充。不知道为什么。

【解决方案2】：

你可以使用DataFrame.eval()方法：

In [254]: x
Out[254]:
   ID1  ID2   Date1 Type1   Date2 Type2
0  123  456   1-Apr     M   6-Apr     L
1  234  567  20-Apr     S  19-Apr     S
2  345  678  10-Apr     M   1-Jan     M

In [255]: formulas = """
     ...: Len1 = @len_text(ID1)
     ...: Len2 = @len_text(ID2)
     ...: Len_Diff = Len1 - Len2
     ...: Len_Diff = Len_Diff.abs()
     ...: Same_Type = (Type1 == Type2) * 1
     ...: """
     ...:

In [256]: x.eval(formulas, inplace=False, engine='python')
Out[256]:
   ID1  ID2   Date1 Type1   Date2 Type2  Len1  Len2  Len_Diff  Same_Type
0  123  456   1-Apr     M   6-Apr     L     3     3         0          0
1  234  567  20-Apr     S  19-Apr     S     3     3         0          1
2  345  678  10-Apr     M   1-Jan     M     3     3         0          1

PS 这个解决方案假设len_text() 函数可以接受一个列（Pandas.Series）。例如：

def len_text(col):
    return col.map(get_text).str.len()

【讨论】：

但是Len1 不是ID1 的长度...ID1 的值被传递给get_text，它返回一个字符串，它是 that 感兴趣的字符串。
@juanpa.arrivillaga，感谢您的评论。我已经修好了
@MaxU，这很棒。看起来也很干净。除了，我仍然有同样的问题。 Len_Diff 没有像之前的答案那样被填充。
@MaxU，事实上，只有Len1 被填充。没有其他列。
@Mnu，确保formulas 中的每个新列都在新行上——这很重要。除此之外，另一个新列发生了什么 - 错误，它们是空的，还是别的？

【解决方案3】：

下面是一个示例，说明如何做到这一点：

>>> df
      a  b     c
0  None  1  None
1  None  2  None
2  None  3  None
3  None  4  None
>>> def f(val):
...     return random.randint(1,10)
...
>>> df.loc[:,['a','c']] = df[['a','c']].applymap(f)
>>> df
    a  b   c
0   3  1   7
1  10  2  10
2   6  3   4
3   4  4   8

所以，在你的情况下：

df.loc[:,['Len1', 'Len2']] = df[['ID1','ID2']].applymap(len_text)

但是，坦率地说，使用丑陋的版本可能会更好，因为这样你就会知道哪一列给你带来了错误。

【讨论】：

我也试过了，但我得到一个错误KeyError: ('ID1', 'ID2')
@Minus 这很奇怪。你确定列名就是那些吗？例如。没有错误的空格？
如果您想应用各种功能（这正是这个问题的情况），这不会很好地概括。对吗？
@matusko 啊，好点，是的，但我相信 OP 只是想避免将每一行包装在 try-except 中，但这些列实际上会引发该错误！
列名正确。没有空格。 @matusko 我同意这不是一概而论，但我只是好奇为什么它不起作用。