使用带有子集的 iterrows 操作数据框答案

【问题标题】：dataframe manipulation using iterrows with a subset使用带有子集的 iterrows 操作数据框
【发布时间】：2019-01-24 15:02:38
【问题描述】：

我正在尝试根据他们的 ID、初始金额和余额来操作这个数据框，这是我想要的数据框，desired_output 是我制作的列：

df = pd.DataFrame(
{"ID" : [1,1,1,2,3,3,3],
 "Initial amount": [7650,25500,56395,13000,10700,12000,27000],
"Balance": [43388,43388,43388,2617,19250,19250,19250], "desired_output": [7650,25500,10238,2617,10720,8530,0]})

这是我当前的代码：

unique_ids = list(df["ID"].unique())
new_output = []
for i,row in df.iterrows():
    this_adv = row["ID"]
    subset = df.loc[df["ID"] == this_adv,:]
    if len(subset) == 1:
        this_output = np.where(row["Balance"] >= row["Initial amount"], row["Initial amount"], row["Balance"])
        new_output.append(this_output)
    else:
        if len(subset) >= 1:
            if len(subset) == 1:
                this_output = np.where(row["Balance"] >= row["Initial amount"], row["Initial amount"], row["Balance"])
                new_output.append(this_output)
            elif row["Balance"] - sum(new_output) >= row["Initial amount"]:
                this_output = row["Initial amount"]
                new_output.append(this_output)
            else:
                this_output = row["Balance"] - sum(new_output)
                new_output.append(this_output)

new_df = pd.DataFrame({"new_output" : new_output})
final_df = pd.concat([df,new_df], axis = 1)

基本上我想要做的是，如果只有 1 个唯一 ID (len(subset) == 1)，则使用第一个 if 语句。具有超过 1 个 ID (len(subset) >= 1) 的任何其他内容都使用其他 if 语句。我没有得到我想要的输出，你们将如何解决这个问题？

谢谢！任何建议表示赞赏。

【问题讨论】：

标签： python python-3.x pandas

【解决方案1】：

看起来您的算法正在尝试计算每个ID 的Initial amount 的滚动总和，然后部分基于ID 的当前周期如何计算new_output 的每一行的值Balance 与同一 ID 的上一期滚动余额进行比较。

如果我们从您的示例数据框开始：

df = pd.DataFrame(
{"ID" : [1,1,1,2,3,3,3],
 "Initial amount": [7650,25500,56395,13000,10700,12000,27000],
"Balance": [43388,43388,43388,2617,19250,19250,19250], "desired_output": [7650,25500,10238,2617,10720,8530,0]})

我们需要首先创建临时列来存储 ID 计数（您在上面提到的 len(subset)），然后是每个 ID 的滚动余额。

val_cts = pd.DataFrame(df['ID'].value_counts().reset_index().rename({'ID': 'ID Count', 'index': 'ID'}, axis=1))
df = df.merge(val_cts, left_on='ID', right_on='ID')
df['rolling_balance'] = df.groupby(['ID'])['Initial amount'].cumsum()

我们还将创建一个包含new_output 的列：

df['new_output'] = 0

此时df 看起来像这样：

    ID  Initial amount  Balance desired_output  ID Count    rolling_balance   new_output
0   1   7650            43388             7650         3               7650            0
1   1   25500           43388            25500         3              33150            0
2   1   56395           43388            10238         3              89545            0
3   2   13000            2617             2617         1              13000            0
4   3   10700           19250            10720         3              10700            0
5   3   12000           19250             8530         3              22700            0
6   3   27000           19250                0         3              49700            0

现在是肉：我写了一个函数，我相信它封装了你试图用你的 if 语句实现的算法：

def calc_output(count, init_amt, bal, cur_roll_bal, prev_roll_bal):
    if count == 1:
        return init_amt if bal > init_amt else bal
    else:
        if bal > init_amt:
            return init_amt if bal > cur_roll_bal else bal - prev_roll_bal
        else:
            return bal-prev_roll_bal if bal-prev_roll_bal > 0 else 0

将上述算法应用于每一行：

for i,row in df.iterrows():
    # Make sure not at first row belonging to an 'ID'
    if i > 0 and df.iloc[i-1]['ID'] == row['ID']:
        prev_idx = i-1
    else:
        prev_idx = i
    row['new_output'] = calc_output(row['ID Count'], row['Initial amount'], row['Balance'], row['rolling_balance'], df.iloc[prev_idx]['rolling_balance'])

然后删除我们在计算中使用的列：df = df.drop(['ID Count', 'rolling_balance'], axis=1)

那么数据框是这样的：

    ID  Initial amount  Balance  desired_output  new_output
0   1             7650    43388            7650        7650
1   1            25500    43388           25500       25500
2   1            56395    43388           10238       10238
3   2            13000     2617            2617        2617
4   3            10700    19250           10720       10700
5   3            12000    19250            8530        8550
6   3            27000    19250               0           0

我在第 4 行中的 new_output 值小 20，而第 5 行中的 new_output 值比它们对应的 desired_output 值大 20，但我希望这是因为这些值最初被错误地输入到您上面的示例数据框中。

【讨论】：