将函数应用于数据框中的每两列并用输出替换原始列答案

【问题标题】：Apply function to every two columns in dataframe and replace original columns with output将函数应用于数据框中的每两列并用输出替换原始列
【发布时间】：2021-02-17 03:01:54
【问题描述】：

我有一个数据框，其中包含如下列中的 X 和 Y 数据：

df_cols = ['x1', 'y1', 'x2', 'y2', 'x3', 'y3']

np.random.seed(365)
df = pd.DataFrame(np.random.randint(0,10,size=(10, 6)), columns=df_cols)

   x1  y1  x2  y2  x3  y3
0   2   4   1   5   2   2
1   9   8   4   0   3   3
2   7   7   7   0   8   4
3   3   2   6   2   6   8
4   9   6   1   6   5   7
5   7   6   5   9   3   8
6   7   9   9   0   1   4
7   0   9   6   5   6   9
8   5   3   2   7   9   2
9   6   6   3   7   7   1

我需要调用一个函数，该函数一次接收一对 X 和 Y 并返回并更新 X 和 Y 对（相同长度），然后将该数据保存到具有原始列名的新数据框中，或者替换旧 X 和 Y 数据与新数据并保留原始列名。

例如下面这个函数：

def samplefunc(x, y):
    x = x*y
    y = x/10
    return x, y

# Apply function to each x & y pair 
x1, y1 = samplefunc(df.x1, df.y1)
x2, y2 = samplefunc(df.x2, df.y2)
x3, y3 = samplefunc(df.x3, df.y3)

 # Save new/updated x & y pairs into new dataframe, preserving the original column names 
df_updated = pd.DataFrame({'x1': x1, 'y1': y1, 'x2': x2, 'y2': y2, 'x3': x3, 'y3': y3})

# Desired result:
In [36]: df_updated
Out[36]: 
   x1   y1  x2   y2  x3   y3
0   8  0.8   5  0.5   4  0.4
1  72  7.2   0  0.0   9  0.9
2  49  4.9   0  0.0  32  3.2
3   6  0.6  12  1.2  48  4.8
4  54  5.4   6  0.6  35  3.5
5  42  4.2  45  4.5  24  2.4
6  63  6.3   0  0.0   4  0.4
7   0  0.0  30  3.0  54  5.4
8  15  1.5  14  1.4  18  1.8
9  36  3.6  21  2.1   7  0.7

但是对于庞大的数据集，这样做显然非常乏味且不可能。我发现的类似/相关问题对数据执行简单的转换而不是调用函数，或者它们向数据框添加新列而不是替换原始列。

我尝试将@PaulH 的答案应用于我的数据集，但它们都不起作用，因为不清楚如何在任一方法中实际调用函数。

# Method 1
array = np.array(my_actual_df)
df_cols = my_actual_df.columns
dist = 0.04 # a parameter I need for my function 
df = (
    pandas.DataFrame(array, columns=df_cols)
        .rename_axis(index='idx', columns='label')
        .stack()
        .to_frame('value')
        .reset_index()
        .assign(value=lambda df: numpy.select(
            [df['label'].str.startswith('x'), df['label'].str.startswith('y')],

            # Call the function (not working): 
            [df['value'], df['value']] = samplefunc(df['value'], df['value']),
        ))
        .pivot(index='idx', columns='label', values='value')
        .loc[:, df_cols]
)



# Method 2
df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .stack(level='group')
         
        # Call the function (not working)
        .assign(df['x'], df['y'] = samplefunc(df['x'], df['y']))
        .unstack(level='group')
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

我需要调用的实际函数来自 Arty 对这个问题的回答：Resample trajectory to have equal euclidean distance in each sample

【问题讨论】：

这能回答你的问题吗？ How to apply a function to two columns of Pandas dataframe
如果函数对 x 和 y 列执行单独的操作，您可以添加一个条件来检查列名并为 x 和 y 列选择不同的函数。这使整个过程变得更加容易
@VirtualScooter 谢谢，但它没有回答我的问题，因为它在原始数据框中创建了一个新列，而不是用输出替换原始数据。添加新数据时，它也不会保留列名。
@AmirMaleki 我使用的实际函数需要同时输入 x 和 y 值，并返回更新后的 x 和 y
请为您的随机化添加一个种子，以便数据保持不变

标签： python pandas dataframe numpy

【解决方案1】：

使用切片并对这些切片应用操作。

def samplefunc(x, y):
    x = x**2
    y = y/10
    return x, y

arr = df.to_numpy().astype(object) 
e_col = arr[:, ::2]
o_col =  arr[:, 1::2]
e_col, o_col = samplefunc(e_col, o_col)
arr[:, ::2] = e_col 
arr[:, 1::2] = o_col 
out = pd.DataFrame(arr, columns=df.columns)

   x1   y1  x2   y2  x3   y3
0   4  0.4   1  0.5   4  0.2
1  81  0.8  16  0.0   9  0.3
2  49  0.7  49  0.0  64  0.4
3   9  0.2  36  0.2  36  0.8
4  81  0.6   1  0.6  25  0.7
5  49  0.6  25  0.9   9  0.8
6  49  0.9  81  0.0   1  0.4
7   0  0.9  36  0.5  36  0.9
8  25  0.3   4  0.7  81  0.2
9  36  0.6   9  0.7  49  0.1

【讨论】：

您可以编辑答案以调用我提供的示例函数吗？ (samplefunc)
@CentauriAurelius 实际上不需要重塑，编辑了答案。 ;)
谢谢，这对我来说更容易理解，但唯一的问题是它一次性通过了所有偶数列和所有奇数列。我正在使用的函数要求每次传递一个 X 和 Y 对

【解决方案2】：

这里的新方法：

将列拆分为多级索引
做一个水平分组
修改您的samplefunc 以获取数据框：

def samplefunc(df, xcol='x', ycol='y'):
    x = df[xcol].to_numpy()
    y = df[ycol].to_numpy()
    
    df[xcol] = x * y
    df[ycol] = x / 10
    return df

df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .groupby(level='group', axis='columns')
        .apply(samplefunc)
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

我得到：

   x1   y1  x2   y2  x3   y3
0   8  0.8   5  0.5   4  0.4
1  72  7.2   0  0.0   9  0.9
2  49  4.9   0  0.0  32  3.2
3   6  0.6  12  1.2  48  4.8
4  54  5.4   6  0.6  35  3.5
5  42  4.2  45  4.5  24  2.4
6  63  6.3   0  0.0   4  0.4
7   0  0.0  30  3.0  54  5.4
8  15  1.5  14  1.4  18  1.8
9  36  3.6  21  2.1   7  0.7

【讨论】：

这对于简单的示例函数非常有用，但我真的希望得到一个不需要更改函数的答案（即，只需在每对 2 列上调用函数并返回两列） .我担心如果我尝试更改 resample_euclid_equidist 函数，我会破坏它，或者由于它如此庞大和复杂（而且我充其量是一个平庸的程序员），所以它需要永远调试。
@CentauriAurelius 我更改函数的方式只是将数据框列解压缩/打包到 numpy 数组中。开头有两行。最后两行。

【解决方案3】：

有几种方法可以做到这一点，具体取决于实际数据框的构造方式。

我首先想到的是完全堆叠数据框并使用numpy.select 根据标签的值计算新值。然后，您可以将数据框转回其原始形式：

import numpy
import pandas

df_cols = ['x1', 'y1', 'x2', 'y2', 'x3', 'y3']


numpy.random.seed(365)
array = numpy.random.randint(0, 10, size=(10, 6))
df = (
    pandas.DataFrame(array, columns=df_cols)
        .rename_axis(index='idx', columns='label')
        .stack()
        .to_frame('value')
        .reset_index()
        .assign(value=lambda df: numpy.select(
            [df['label'].str.startswith('x'), df['label'].str.startswith('y')],
            [df['value'] ** 2, df['value'] / 10],
        ))
        .pivot(index='idx', columns='label', values='value')
        .loc[:, df_cols]
)

label    x1   y1    x2   y2    x3   y3
idx                                   
0       4.0  0.4   1.0  0.5   4.0  0.2
1      81.0  0.8  16.0  0.0   9.0  0.3
2      49.0  0.7  49.0  0.0  64.0  0.4
3       9.0  0.2  36.0  0.2  36.0  0.8
4      81.0  0.6   1.0  0.6  25.0  0.7
5      49.0  0.6  25.0  0.9   9.0  0.8
6      49.0  0.9  81.0  0.0   1.0  0.4
7       0.0  0.9  36.0  0.5  36.0  0.9
8      25.0  0.3   4.0  0.7  81.0  0.2
9      36.0  0.6   9.0  0.7  49.0  0.1

或者，您可以将列名视为层次结构，将其转换为多级索引，然后仅堆叠该索引的第二级。这样，您最终会得到单独的 x 列和 y 列，您可以直接明确地对其进行操作

df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .stack(level='group')
        .assign(x=lambda df: df['x'] ** 2, y=lambda df: df['y'] / 10)
        .unstack(level='group')
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

【讨论】：

谢谢，我只是不确定如何在您提供的代码中包含实际调用函数。我将我的尝试添加到我的问题中。该函数要求 x 和 y 一起传递，然后一起返回更新后的 x 和 y 数组。
@CentauriAurelius 我认为第二种方法就是你想要的
它也不清楚如何在第二种方法中调用我的函数。我也在我的问题中添加了使用第二种方法的尝试。
@CentauriAurelius resample_euclid_equidist 来自哪里？
这是我需要应用于我的 df 的实际功能。我只是编辑了代码以调用“samplefunc”。