【发布时间】:2021-07-22 00:01:40
【问题描述】:
我必须创建列来计算数据集中其他列的自然对数。列(功能)太多,我想让它自动运行,但是我尝试过的 for 循环不起作用。这是我称之为“功能”的列列表:
features=['price_seat',
'days_length_of_stay',
'days_to_departure',
'distance',
'unit_cost_brute',
'unit_cost_clip',
'unit_cost_mean',
'unit_cost',
'org_country_gdp_per_capita',
'dst_country_gdp_per_capita',
'competing_airline',
#'yield',
'price_seat_cluster',
'yield_cluster',
'low_cost',
#'PAX',
#'REVENUE',
'LOCAL_PAX',
'BEHIND_PAX',
'BEYOND_PAX',
'BRIDGE_PAX',
'LOCAL_REVENUE',
'BEHIND_REVENUE',
'BEYOND_REVENUE',
'BRIDGE_REVENUE',
'REVENUE_WITH_TAXES',
'LOCAL_REVENUE_WITH_TAXES',
'BRIDGE_REVENUE_WITH_TAXES',
'BEHIND_REVENUE_WITH_TAXES',
'BEYOND_REVENUE_WITH_TAXES',
'PERIOD',
'n_flights_month',
'avg_flights_month',
'flights_month',
#'pax_flight',
'revenue_flight',
#'revenue_pax',
'WTI',
'Brent',
'Jet_fuel',
'OilPrice_USD_bbl',
'FuelPrice_USD_USgal',
'Density',
'Cf_USD_kg',
'd_fr24',
'distance_fr']
这是我使用的代码,它可以工作:
df=df9.withColumn('ln_price_seat', F.log('price_seat'))\
.withColumn('ln_days_length_of_stay',F.log('days_length_of_stay'))\
.withColumn('ln_days_to_departure',F.log('days_to_departure'))\
.withColumn('ln_distance',F.log('distance'))\
.withColumn('ln_unit_cost_brute',F.log('unit_cost_brute'))\
.withColumn('ln_unit_cost_clip',F.log('unit_cost_clip'))\
.withColumn('ln_unit_cost_mean',F.log('unit_cost_mean'))
但这对于这么多功能来说太“手动”了,我将来可能会更改这些功能,所以我需要一些可以处理的东西。最重要的是,我的数据框非常大,大约 50M 或更多。在执行此操作之前,我能够执行此过程:
def get_log_features(self,df):
features=['price_seat',
'days_length_of_stay',
'days_to_departure',
'distance',
'unit_cost_brute',
'unit_cost_clip',
'unit_cost_mean',
'unit_cost',
'org_country_gdp_per_capita',
'dst_country_gdp_per_capita',
'competing_airline',
'price_seat_cluster',
'yield_cluster',
'low_cost',
'LOCAL_PAX',
'BEHIND_PAX',
'BEYOND_PAX',
'BRIDGE_PAX',
'LOCAL_REVENUE',
'BEHIND_REVENUE',
'BEYOND_REVENUE',
'BRIDGE_REVENUE',
'REVENUE_WITH_TAXES',
'LOCAL_REVENUE_WITH_TAXES',
'BRIDGE_REVENUE_WITH_TAXES',
'BEHIND_REVENUE_WITH_TAXES',
'BEYOND_REVENUE_WITH_TAXES',
'PERIOD',
'n_flights_month',
'avg_flights_month',
'flights_month',
'revenue_flight',
'WTI',
'Brent',
'Jet_fuel',
'OilPrice_USD_bbl',
'FuelPrice_USD_USgal',
'Density',
'Cf_USD_kg',
'd_fr24',
'distance_fr']
features_for_log=features
df_log= (df.select(*features_for_log,'org_airport','dst_airport','d_year','d_month'))
for new_col in features_for_log:
df_log = df_log.withColumn('ln_'+ new_col, F.log(F.col(new_col)))
df_log= (df_log.drop(*features_for_log))
df=(df.join(df_log,['org_airport','dst_airport','d_year','d_month'],how='outer'))
但是当我调用这个函数时,它需要几个小时,它的计算成本太高,这就是为什么我想用特征列表定义的列的自然对数“附加”原始数据帧,这样可能会更便宜。
你有什么建议吗?
【问题讨论】: