【发布时间】:2021-04-21 16:53:54
【问题描述】:
我有一个包含模型信息(df_model,约 700 行,约 50 种类型)的数据框和一个包含组件信息(df_components,约 400.000 行)的数据框。
df_model:模型类型列表,每个模型类型有几行带有特定组件:
type mainclass subclass param1 param2 param3 param4
0 E2 C/C C IT NaN GR NaN
1 E2 W W IT NaN NaN AC-8
2 E2 P 13.3 NaN NaN NaN NaN
3 P1 C/C C IT NaN TU NaN
4 P1 GR G NAI XSM NaN NaN
df_component:可用组件列表+每个组件的交付时间:
mainclass subclass param1 param2 param3 param4 component_name delivery_time
0 C/C C IT X GR NaN CM1 8
1 C/C C IT NaN GR RAD CM2 9
2 W W IT NaN NaN AC-8 CM3 3
3 P 13.3 NaN NaN NaN NaN CM4 13
4 P 15.3 NaN NaN NaN NaN CM5 20
5 B BR NaN KRL NaN NaN CM6 5
我想编写一个函数,该函数返回 - 对于特定模型类型 - model_components 中每个匹配组件的交付时间 + 最短交付时间。
例如:对于模型类型“E2”,“C/C C IT NaN GR NaN”有 2 个匹配组件。一个的交货时间是 8,其中一个是 9。8 9 被删除,因为我们想要每个模型组件的最佳交货时间。
问题:如果模型组件的列中有“NaN”,则它是通配符 = 相应组件列中的任何内容都应该匹配(如果这样,我可以将通配符的 np.nan 更改为其他任何内容帮助)。
例如df_model的第一行:
type mainclass subclass param1 param2 param3 param4
0 E2 C/C C IT NaN GR NaN
列 "param2" = "NaN" 所以从 df_component 这 2 行匹配,因为列 "param2" 无关紧要:
mainclass subclass param1 param2 param3 param4 component_name delivery_time
0 C/C C IT X GR NaN CM1 8
1 C/C C IT NaN GR RAD CM2 9
所以对于模型类型“E2”,最终结果应该是一个字典/JSON:
{
"leadtime_min": 13,
"CM1": 8,
"CM3": 3,
"CM4": 13,
}
到目前为止我的函数方法:
- 选择给定的模型类型
- 遍历每个模型组件,搜索匹配的组件
- 获取每个组件的最短交付时间(如果有多个匹配的组件。如果多个匹配的组件具有相同的交付时间,则它们都需要添加到结果 JSON 中)
我的函数中的查询代码返回所有行,而不仅仅是匹配的行,我需要解决这个问题。
到目前为止我的代码:
df_model = pd.DataFrame(
{
"type": ["E2", "E2", "E2", "P1", "P1"],
"mainclass": ["C/C", "W", "P", "C/C", "GR"],
"subclass": ["C", "W", "13.3", "C", "G"],
"param1": ["IT", "IT", np.nan, "IT", "NAI"],
"param2": [np.nan, np.nan, np.nan, np.nan, "XSM"],
"param3": ["GR", np.nan, np.nan, "TU", np.nan],
"param4": [np.nan, "AC-8", np.nan, np.nan, np.nan],
}
)
df_components = pd.DataFrame(
{
"mainclass": [
"C/C",
"C/C",
"W",
"P",
"P",
"B",
],
"subclass": ["C", "C", "W", "13.3", "15.3", "BR"],
"param1": ["IT", "IT", "IT", np.nan, np.nan, np.nan],
"param2": ["X", np.nan, np.nan, np.nan, np.nan, "KRL"],
"param3": ["GR", "GR", np.nan, np.nan, np.nan, np.nan],
"param4": [np.nan, "RAD", "AC-8", np.nan, np.nan, np.nan],
"component_name": ["CM1", "CM2", "CM3", "CM4", "CM5", "CM6"],
"delivery_time": [8, 9, 3, 13, 20, 5],
}
)
def get_delivery_time_best(model_type: str, model: pd.DataFrame, component: pd.DataFrame)
model = model.query("type == @model_type")
fitting_components = {"delivery_time_min": sys.maxsize}
delivery_times_all = []
for model_component in df_model.itertuples():
model_mainclass = model_component.mainclass
model_subclass = model_component.subclass
model_param1 = model_component.param1
model_param2 = model_component.param2
model_param3 = model_component.param3
model_param4 = model_component.param4
checked_components = components.query(
"""
mainclass == @model_mainclass and \
subclass == @model_subclass \
and ( param1 == @model_param1 | @model_param1 != @model_param1 ) \
and ( param2 == @model_param2 | @model_param2 != @model_param2 ) \
and ( param3 == @model_param3 | @model_param3 != @model_param3 ) \
and ( param4 == @model_param4 | @model_param4 != @model_param4 ) \
"""
)
component_descriptions = checked_components"component_name" ].tolist()
delivery_times = checked_components["delivery_time"].tolist()
delivery_time_min = min(delivery_times )
delivery_times_all.append(delivery_time_min )
for component_description, delivery_time in zip(component_descriptions, delivery_times ):
if delivery_time == delivery_time_min :
fitting_components[component_description] = delivery_time
fitting_components["delivery_time_min "] = max(delivery_times_all )
print(fitting_components)
get_delivery_time_best(model_type="E2", model=df_model, components=df_components)
但这会返回 df_components 中的所有行,而不是 2 个匹配的行?
【问题讨论】: