如何修改定义的函数以计算所需的输出（Pandas）答案

【问题标题】：How to amend defined function to calculate wanted output (Pandas)如何修改定义的函数以计算所需的输出（Pandas）
【发布时间】：2023-02-01 15:08:08
【问题描述】：

我正在尝试通过使用以下条件三次循环遍历“名称”、“val_id”和“fac_id”列来计算以下“new_field”列。

1.在每个“val_id”循环中，如果“product”==“CL”，则“val_against”和“our_val_amt”的最小值，例如min( val_against (134), our_val_amt (424)) 因此'NEW FIELD' = 134。此外，如果 new_field 的总和超过“our_val_amt”，则从“our_val_amt”中减去它。例如对于 val_id“xx4”，(200 + 300 + 50) = 550 超过了 our_val_amt = 510，因此新文件 = 510 - 500（即此总和超过 our_val_amt 后的 200 + 300）= 10。

2.如果 product != 'CL' 并且在同一个 'val_id' 组中。从“our_val_amt”中减去的余数将插入到“new_field”中。例如 'our_val_amt' (424) - 来自步骤 1 (134) = 290。这插入在 'NEW FIELD' 上方。

如果 [product] 没有“CL”，它只需要在每个 [val_id] 之间分配 [our_val_amt]。例如 val_id = 'xx7' our_val_amt =700 这在插入的第一行 (650) 中展开，然后剩下 700 - 650 = 50 在下一行中插入，根据示例，以下为 0。

3. 对 val_id xx2 重复步骤。 CL = 104 和 XL = 472 - 104 = 368 的新字段计算。

目前，'name' - compx（第 0 - 9 行）的输出工作正常，并且开始无法正确计算。我也不确定这段代码是如何工作的，因为我是 Pandas 的新手，如果有人能解释定义的函数程序是如何思考的，我将不胜感激。

df = pd.DataFrame(data=[["compx","xx1","yy1",424,418,"XL"],["compx","xx1","yy2",424,134,"CL"],["compx","xx2","yy3",472,60,"DL"],["compx","xx2","yy4",472,104,"CL"], ["compx", "xx3", "yy5", 490, 50, "XL"], ["compx", "xx3", "yy6", 490, 500, "CL"], ["compx", "xx3", "yy7", 490, 200, "DL"], ["compx", "xx4", "yy8", 510, 200, "CL"], ["compx", "xx4", "yy9", 510, 300, "CL"], ["compx", "xx4", "yy10", 510, 50, "CL"], ["compy", "xx5", "yy11", 510, 200, "CL"], ["compy", "xx5", "yy12", 510, 300, "CL"], ["compy", "xx5", "yy12", 510, 50, "CL"], ["compy", "xx5", "yy13", 510, 30, "DL"], ["compz", "xx6", "yy14", 350, 200, "CL"], ["compz", "xx6", "yy15", 350, 100, "CL"], ["compz", "xx6", "yy16", 350, 50, "XL"], ["compz", "xx6", "yy17", 350, 50, "DL"], ["compz", "xx7", "yy18", 700, 650, "DL"], ["compz", "xx7", "yy19", 700, 200, "DL"], ["compz", "xx7", "yy20", 700, 400, "XL"] ], columns=["name","val_id","fac_id","our_val_amt","val_against","product"])
df


# Compute tuple of "our_val_amt", "val_against" and "product" for easy processing as one column. It is hard to process multiple columns with "transform()".
df["the_tuple"] = df[["our_val_amt", "val_against", "product"]].apply(tuple, axis=1)

def compute_new_field_for_cl(g):
  # df_g is a tuple ("our_val_amt", "val_against", "product") indexed as (0, 1, 2).
  df_g = g.apply(pd.Series)
  df_g["new_field"] = df_g.apply(lambda row: min(row[0], row[1]) if row[2] == "CL" else 0, axis=1)
  df_g["cumsum"] = df_g["new_field"].cumsum()
  df_g["new_field"] = df_g.apply(lambda row: 0 if row["cumsum"] > row[0] else row["new_field"], axis=1)
  df_g["max_cumsum"] = df_g["new_field"].cumsum()
  df_g["new_field"] = df_g.apply(lambda row: row[0] - row["max_cumsum"] if row["cumsum"] > row[0] else row["new_field"], axis=1)
  return df_g["new_field"]

# Apply above function and compute new field values for "CL".
df["new_field"] = df.groupby("val_id")[["the_tuple"]].transform(compute_new_field_for_cl)

# Re-compute tuple of "our_val_amt", "new_field" and "product".
df["the_tuple"] = df[["our_val_amt", "new_field", "product"]].apply(tuple, axis=1)

def compute_new_field_for_not_cl(g):
  # df_g is a tuple ("our_val_amt", "new_field", "product") indexed as (0, 1, 2).
  df_g = g.apply(pd.Series)
  result_sr = df_g.where(df_g[2] != "CL")[0] - df_g[df_g[2] == "CL"][1].sum()
  result_sr = result_sr.fillna(0) + df_g[1]
  return result_sr

# Apply above function and compute new field values for "CL".
df["new_field"] = df.groupby("val_id")[["the_tuple"]].transform(compute_new_field_for_not_cl)

df = df.drop("the_tuple", axis=1)
df

Dataset和new_field的输出试图实现。

name    |val_id |fac_id     |   our_val_amt |   val_against |   product |   new_field
compx   |   xx1 |   yy1     |   424         |   418         |   XL      |   290
compx   |   xx1 |   yy2     |   424         |   134         |   CL      |   134
compx   |   xx2 |   yy3     |   472         |   60          |   DL      |   368
compx   |   xx2 |   yy4     |   472         |   104         |   CL      |   104
compx   |   xx3 |   yy5     |   490         |   50          |   XL      |   0
compx   |   xx3 |   yy6     |   490         |   500         |   CL      |   490
compx   |   xx3 |   yy7     |   490         |   200         |   DL      |   0
compx   |   xx4 |   yy8     |   510         |   200         |   CL      |   200
compx   |   xx4 |   yy9     |   510         |   300         |   CL      |   300
compx   |   xx4 |   yy10    |   510         |   50          |   CL      |   10
compy   |   xx5 |   yy11    |   510         |   200         |   CL      |   200
compy   |   xx5 |   yy12    |   510         |   300         |   CL      |   300
compy   |   xx5 |   yy12    |   510         |   50          |   CL      |   10
compy   |   xx5 |   yy13    |   510         |   30          |   DL      |   0
compz   |   xx6 |   yy14    |   350         |   200         |   CL      |   200
compz   |   xx6 |   yy15    |   350         |   100         |   CL      |   100
compz   |   xx6 |   yy16    |   350         |   50          |   XL      |   50
compz   |   xx6 |   yy17    |   350         |   50          |   DL      |   0
compz   |   xx7 |   yy18    |   700         |   650         |   DL      |   650
compz   |   xx7 |   yy19    |   700         |   200         |   DL      |   50
compz   |   xx7 |   yy20    |   700         |   400         |   XL      |   0

我当前获得的数据集和 new_field 输出

name    |val_id |fac_id     |   our_val_amt |   val_against |   product |   new_field
compx   |   xx1 |   yy1     |   424         |   418         |   XL      |   290
compx   |   xx1 |   yy2     |   424         |   134         |   CL      |   134
compx   |   xx2 |   yy3     |   472         |   60          |   DL      |   368
compx   |   xx2 |   yy4     |   472         |   104         |   CL      |   104
compx   |   xx3 |   yy5     |   490         |   50          |   XL      |   0
compx   |   xx3 |   yy6     |   490         |   500         |   CL      |   490
compx   |   xx3 |   yy7     |   490         |   200         |   DL      |   0
compx   |   xx4 |   yy8     |   510         |   200         |   CL      |   200
compx   |   xx4 |   yy9     |   510         |   300         |   CL      |   300
compx   |   xx4 |   yy10    |   510         |   50          |   CL      |   10
compy   |   xx5 |   yy11    |   510         |   200         |   CL      |   200
compy   |   xx5 |   yy12    |   510         |   300         |   CL      |   300
compy   |   xx5 |   yy12    |   510         |   50          |   CL      |   10
compy   |   xx5 |   yy13    |   510         |   30          |   DL      |   10
compz   |   xx6 |   yy14    |   350         |   200         |   CL      |   200
compz   |   xx6 |   yy15    |   350         |   100         |   CL      |   100
compz   |   xx6 |   yy16    |   350         |   50          |   XL      |   50
compz   |   xx6 |   yy17    |   350         |   50          |   DL      |   50
compz   |   xx7 |   yy18    |   700         |   650         |   DL      |   700
compz   |   xx7 |   yy19    |   700         |   200         |   DL      |   700
compz   |   xx7 |   yy20    |   700         |   400         |   XL      |   700

【问题讨论】：

您的解释与 val_id="xx7" 的预期值 (650、50、0) 冲突。在描述中，如果 product !="CL"，您希望从 our_val_amt 中减去 new_field 值；但是在预期的输出中你没有从 700 中减去任何东西；而是复制了val_against。这还不清楚。你如何计算 xx7 的值？
嗨，Azhar，很抱歉造成您的困惑。如果产品“CL”在 [val_id] 内，我预计会发生这种情况。 val_id = 'xx7' 的示例没有 [product] = 'CL'。如果 [product] 没有“CL”，它只需要在每个 [val_id] 之间分配 [our_val_amt]。例如 val_id = 'xx7' our_val_amt =700 这在插入的第一行 (650) 中展开，然后剩下 700 - 650 = 50 在下一行中插入，根据示例，以下为 0。
很抱歉您正在查看代码输出的内容。请看《Dataset和new_field输出的尝试实现》。

标签： python pandas dataframe analytics data-analysis

【解决方案1】：

该代码在存在多个非 CL 产品的极端情况下失败，并且必须传播 our_val_amt，以便最后几个产品可能获得 0 值。我在最后一个/前面的问题中询问了这个用例；但它没有得到答复。你可能有一些像这样的极端情况，需要进行全面的测试。

以下是更新后的逻辑。在每个处理行之前添加 cmets 以解释它的作用。

df = pd.DataFrame(data=[["compx","xx1","yy2",424,134,"CL",134],["compx","xx1","yy1",424,418,"XL",290],["compx","xx2","yy4",472,104,"CL",104],["compx","xx2","yy3",472,60,"DL",368],["compx","xx3","yy6",490,500,"CL",490],["compx","xx3","yy5",490,50,"XL",0],["compx","xx3","yy7",490,200,"DL",0],["compx","xx4","yy8",510,200,"CL",200],["compx","xx4","yy9",510,300,"CL",300],["compx","xx4","yy10",510,50,"CL",10],["compy","xx5","yy11",510,200,"CL",200],["compy","xx5","yy12",510,300,"CL",300],["compy","xx5","yy12",510,50,"CL",10],["compy","xx5","yy13",510,30,"DL",0],["compz","xx6","yy14",350,200,"CL",200],["compz","xx6","yy15",350,100,"CL",100],["compz","xx6","yy16",350,50,"XL",50],["compz","xx6","yy17",350,50,"DL",0],["compz","xx7","yy18",700,650,"DL",650],["compz","xx7","yy19",700,200,"DL",50],["compz","xx7","yy20",700,400,"XL",0]], columns=["name","val_id","fac_id","our_val_amt","val_against","product","new_field_expected"])


# Compute tuple of "our_val_amt", "val_against" and "product" for easy processing as one column. It is hard to process multiple columns with "transform()".
df["the_tuple"] = df[["our_val_amt", "val_against", "product"]].apply(tuple, axis=1)

def compute_new_field_for_cl(g):
  # df_g is a tuple ("our_val_amt", "val_against", "product") indexed as (0, 1, 2).
  df_g = g.apply(pd.Series)
  df_g["new_field"] = df_g.apply(lambda row: min(row[0], row[1]) if row[2] == "CL" else 0, axis=1)
  # Cumulative sum for comparison
  df_g["cumsum"] = df_g["new_field"].cumsum()
  # Previous row's sum for comparison
  df_g["cumsum_prev"] = df_g["cumsum"].shift(periods=1)
  # if our_val_amt >= sum then use min(our_val_amt, val_against)
  # else if our_val_amt < sum then take partial of first record such that our_val_amt == sum else take `0` for the rest records
  df_g["new_field"] = df_g.apply(lambda row: 0 if row["cumsum_prev"] > row[0] else row[0] - row["cumsum_prev"] if row["cumsum"] > row[0] else row["new_field"], axis=1)
  return df_g["new_field"]

# Apply above function and compute new field values for "CL".
df["new_field"] = df.groupby("val_id")[["the_tuple"]].transform(compute_new_field_for_cl)


# Re-compute tuple of "our_val_amt", "val_against", "new_field" and "product".
df["the_tuple"] = df[["our_val_amt", "val_against", "new_field", "product"]].apply(tuple, axis=1)

def compute_new_field_for_not_cl(g):
  # df_g is a tuple ("our_val_amt", "val_against", "new_field", "product") indexed as (0, 1, 2, 3).
  df_g = g.apply(pd.Series)
  # print(df_g)
  cl_sum = df_g[df_g[3] == "CL"][2].sum()
  if cl_sum > 0:
    df_g["new_field"] = df_g.where(df_g[3] != "CL")[0] - df_g[df_g[3] == "CL"][2].sum()
    df_g["new_field"] = df_g["new_field"].fillna(df_g[2])
    # Cumulative sum for comparison
    df_g["cumsum"] = df_g["new_field"].cumsum()
    # if our_val_amt < sum then take diff (our_val_amt - sum) else take `0` for the rest records
    df_g["new_field"] = df_g.apply(lambda row: 0 if row["cumsum"] > row[0] else row["new_field"], axis=1)
  else:
    df_g["new_field"] = df_g.apply(lambda row: min(row[0], row[1]) if row[3] != "CL" else row[2], axis=1)
    # Cumulative sum for comparison
    df_g["cumsum"] = df_g["new_field"].cumsum()
    # Previous row's sum for comparison
    df_g["cumsum_prev"] = df_g["cumsum"].shift(periods=1)
    # if our_val_amt >= sum then use min(our_val_amt, val_against)
    # else if our_val_amt < sum then take partial of first record such that our_val_amt == sum else take `0` for the rest records
    df_g["new_field"] = df_g.apply(lambda row: 0 if row["cumsum_prev"] > row[0] else row[0] - row["cumsum_prev"] if row["cumsum"] > row[0] else row["new_field"], axis=1)

  return df_g["new_field"]

# Apply above function and compute new field values for "CL".
df["new_field"] = df.groupby("val_id")[["the_tuple"]].transform(compute_new_field_for_not_cl)


df = df.drop("the_tuple", axis=1)

print(df)

输出：

     name val_id fac_id  our_val_amt  val_against product  new_field_expected  new_field
0   compx    xx1    yy2          424          134      CL                 134     134.00
1   compx    xx1    yy1          424          418      XL                 290     290.00
2   compx    xx2    yy4          472          104      CL                 104     104.00
3   compx    xx2    yy3          472           60      DL                 368     368.00
4   compx    xx3    yy6          490          500      CL                 490     490.00
5   compx    xx3    yy5          490           50      XL                   0       0.00
6   compx    xx3    yy7          490          200      DL                   0       0.00
7   compx    xx4    yy8          510          200      CL                 200     200.00
8   compx    xx4    yy9          510          300      CL                 300     300.00
9   compx    xx4   yy10          510           50      CL                  10      10.00
10  compy    xx5   yy11          510          200      CL                 200     200.00
11  compy    xx5   yy12          510          300      CL                 300     300.00
12  compy    xx5   yy12          510           50      CL                  10      10.00
13  compy    xx5   yy13          510           30      DL                   0       0.00
14  compz    xx6   yy14          350          200      CL                 200     200.00
15  compz    xx6   yy15          350          100      CL                 100     100.00
16  compz    xx6   yy16          350           50      XL                  50      50.00
17  compz    xx6   yy17          350           50      DL                   0       0.00
18  compz    xx7   yy18          700          650      DL                 650     650.00
19  compz    xx7   yy19          700          200      DL                  50      50.00
20  compz    xx7   yy20          700          400      XL                   0       0.00

【讨论】：