【发布时间】:2023-02-01 15:08:08
【问题描述】:
我正在尝试通过使用以下条件三次循环遍历“名称”、“val_id”和“fac_id”列来计算以下“new_field”列。
1.在每个“val_id”循环中,如果“product”==“CL”,则“val_against”和“our_val_amt”的最小值,例如min( val_against (134), our_val_amt (424)) 因此'NEW FIELD' = 134。此外,如果 new_field 的总和超过“our_val_amt”,则从“our_val_amt”中减去它。例如对于 val_id“xx4”,(200 + 300 + 50) = 550 超过了 our_val_amt = 510,因此新文件 = 510 - 500(即此总和超过 our_val_amt 后的 200 + 300)= 10。
2.如果 product != 'CL' 并且在同一个 'val_id' 组中。从“our_val_amt”中减去的余数将插入到“new_field”中。例如 'our_val_amt' (424) - 来自步骤 1 (134) = 290。这插入在 'NEW FIELD' 上方。
如果 [product] 没有“CL”,它只需要在每个 [val_id] 之间分配 [our_val_amt]。例如 val_id = 'xx7' our_val_amt =700 这在插入的第一行 (650) 中展开,然后剩下 700 - 650 = 50 在下一行中插入,根据示例,以下为 0。
3. 对 val_id xx2 重复步骤。 CL = 104 和 XL = 472 - 104 = 368 的新字段计算。
目前,'name' - compx(第 0 - 9 行)的输出工作正常,并且开始无法正确计算。我也不确定这段代码是如何工作的,因为我是 Pandas 的新手,如果有人能解释定义的函数程序是如何思考的,我将不胜感激。
df = pd.DataFrame(data=[["compx","xx1","yy1",424,418,"XL"],["compx","xx1","yy2",424,134,"CL"],["compx","xx2","yy3",472,60,"DL"],["compx","xx2","yy4",472,104,"CL"], ["compx", "xx3", "yy5", 490, 50, "XL"], ["compx", "xx3", "yy6", 490, 500, "CL"], ["compx", "xx3", "yy7", 490, 200, "DL"], ["compx", "xx4", "yy8", 510, 200, "CL"], ["compx", "xx4", "yy9", 510, 300, "CL"], ["compx", "xx4", "yy10", 510, 50, "CL"], ["compy", "xx5", "yy11", 510, 200, "CL"], ["compy", "xx5", "yy12", 510, 300, "CL"], ["compy", "xx5", "yy12", 510, 50, "CL"], ["compy", "xx5", "yy13", 510, 30, "DL"], ["compz", "xx6", "yy14", 350, 200, "CL"], ["compz", "xx6", "yy15", 350, 100, "CL"], ["compz", "xx6", "yy16", 350, 50, "XL"], ["compz", "xx6", "yy17", 350, 50, "DL"], ["compz", "xx7", "yy18", 700, 650, "DL"], ["compz", "xx7", "yy19", 700, 200, "DL"], ["compz", "xx7", "yy20", 700, 400, "XL"] ], columns=["name","val_id","fac_id","our_val_amt","val_against","product"])
df
# Compute tuple of "our_val_amt", "val_against" and "product" for easy processing as one column. It is hard to process multiple columns with "transform()".
df["the_tuple"] = df[["our_val_amt", "val_against", "product"]].apply(tuple, axis=1)
def compute_new_field_for_cl(g):
# df_g is a tuple ("our_val_amt", "val_against", "product") indexed as (0, 1, 2).
df_g = g.apply(pd.Series)
df_g["new_field"] = df_g.apply(lambda row: min(row[0], row[1]) if row[2] == "CL" else 0, axis=1)
df_g["cumsum"] = df_g["new_field"].cumsum()
df_g["new_field"] = df_g.apply(lambda row: 0 if row["cumsum"] > row[0] else row["new_field"], axis=1)
df_g["max_cumsum"] = df_g["new_field"].cumsum()
df_g["new_field"] = df_g.apply(lambda row: row[0] - row["max_cumsum"] if row["cumsum"] > row[0] else row["new_field"], axis=1)
return df_g["new_field"]
# Apply above function and compute new field values for "CL".
df["new_field"] = df.groupby("val_id")[["the_tuple"]].transform(compute_new_field_for_cl)
# Re-compute tuple of "our_val_amt", "new_field" and "product".
df["the_tuple"] = df[["our_val_amt", "new_field", "product"]].apply(tuple, axis=1)
def compute_new_field_for_not_cl(g):
# df_g is a tuple ("our_val_amt", "new_field", "product") indexed as (0, 1, 2).
df_g = g.apply(pd.Series)
result_sr = df_g.where(df_g[2] != "CL")[0] - df_g[df_g[2] == "CL"][1].sum()
result_sr = result_sr.fillna(0) + df_g[1]
return result_sr
# Apply above function and compute new field values for "CL".
df["new_field"] = df.groupby("val_id")[["the_tuple"]].transform(compute_new_field_for_not_cl)
df = df.drop("the_tuple", axis=1)
df
Dataset和new_field的输出试图实现。
name |val_id |fac_id | our_val_amt | val_against | product | new_field
compx | xx1 | yy1 | 424 | 418 | XL | 290
compx | xx1 | yy2 | 424 | 134 | CL | 134
compx | xx2 | yy3 | 472 | 60 | DL | 368
compx | xx2 | yy4 | 472 | 104 | CL | 104
compx | xx3 | yy5 | 490 | 50 | XL | 0
compx | xx3 | yy6 | 490 | 500 | CL | 490
compx | xx3 | yy7 | 490 | 200 | DL | 0
compx | xx4 | yy8 | 510 | 200 | CL | 200
compx | xx4 | yy9 | 510 | 300 | CL | 300
compx | xx4 | yy10 | 510 | 50 | CL | 10
compy | xx5 | yy11 | 510 | 200 | CL | 200
compy | xx5 | yy12 | 510 | 300 | CL | 300
compy | xx5 | yy12 | 510 | 50 | CL | 10
compy | xx5 | yy13 | 510 | 30 | DL | 0
compz | xx6 | yy14 | 350 | 200 | CL | 200
compz | xx6 | yy15 | 350 | 100 | CL | 100
compz | xx6 | yy16 | 350 | 50 | XL | 50
compz | xx6 | yy17 | 350 | 50 | DL | 0
compz | xx7 | yy18 | 700 | 650 | DL | 650
compz | xx7 | yy19 | 700 | 200 | DL | 50
compz | xx7 | yy20 | 700 | 400 | XL | 0
我当前获得的数据集和 new_field 输出
name |val_id |fac_id | our_val_amt | val_against | product | new_field
compx | xx1 | yy1 | 424 | 418 | XL | 290
compx | xx1 | yy2 | 424 | 134 | CL | 134
compx | xx2 | yy3 | 472 | 60 | DL | 368
compx | xx2 | yy4 | 472 | 104 | CL | 104
compx | xx3 | yy5 | 490 | 50 | XL | 0
compx | xx3 | yy6 | 490 | 500 | CL | 490
compx | xx3 | yy7 | 490 | 200 | DL | 0
compx | xx4 | yy8 | 510 | 200 | CL | 200
compx | xx4 | yy9 | 510 | 300 | CL | 300
compx | xx4 | yy10 | 510 | 50 | CL | 10
compy | xx5 | yy11 | 510 | 200 | CL | 200
compy | xx5 | yy12 | 510 | 300 | CL | 300
compy | xx5 | yy12 | 510 | 50 | CL | 10
compy | xx5 | yy13 | 510 | 30 | DL | 10
compz | xx6 | yy14 | 350 | 200 | CL | 200
compz | xx6 | yy15 | 350 | 100 | CL | 100
compz | xx6 | yy16 | 350 | 50 | XL | 50
compz | xx6 | yy17 | 350 | 50 | DL | 50
compz | xx7 | yy18 | 700 | 650 | DL | 700
compz | xx7 | yy19 | 700 | 200 | DL | 700
compz | xx7 | yy20 | 700 | 400 | XL | 700
【问题讨论】:
-
您的解释与 val_id="xx7" 的预期值 (650、50、0) 冲突。在描述中,如果 product !="CL",您希望从
our_val_amt中减去new_field值;但是在预期的输出中你没有从 700 中减去任何东西;而是复制了val_against。这还不清楚。你如何计算 xx7 的值? -
嗨,Azhar,很抱歉造成您的困惑。如果产品“CL”在 [val_id] 内,我预计会发生这种情况。 val_id = 'xx7' 的示例没有 [product] = 'CL'。如果 [product] 没有“CL”,它只需要在每个 [val_id] 之间分配 [our_val_amt]。例如 val_id = 'xx7' our_val_amt =700 这在插入的第一行 (650) 中展开,然后剩下 700 - 650 = 50 在下一行中插入,根据示例,以下为 0。
-
很抱歉您正在查看代码输出的内容。请看《Dataset和new_field输出的尝试实现》。
标签: python pandas dataframe analytics data-analysis