【发布时间】:2021-01-11 10:13:42
【问题描述】:
#Pandas 代码
temp = df_merge[['subscription_id', 'cancelleddate', 'subscriptionstartdate', 'termenddate']].drop_duplicates()
df_merge['mean_cancelled_sub_duration'] = (temp['cancelleddate']-temp['subscriptionstartdate']).dt.days.dropna().mean()/ 365
df_merge['mean_sub_duration'] = (temp['termenddate']-temp['subscriptionstartdate']).dt.days.dropna().mean()/365``
如何在 pyspark 中实现与 pandas 代码相同的逻辑,虽然我尝试在 pyspark 中这样做但它没有帮助我,我们删除了行并且计算错误:
名称中带有日期的列属于日期类型。
#Failed Pyspark 转换
temp = df_merge.select('subscription_id', 'cancelleddate', 'subscriptionstartdate', 'termenddate').dropDuplicates()
temp = temp.withColumn("cancelled_sub_duration", datediff(temp.cancelleddate,temp.subscriptionstartdate)).withColumn("sub_duration", datediff(temp.termenddate,temp.subscriptionstartdate))
temp = temp.na.drop(subset=['cancelled_sub_duration','sub_duration'])
spec3 = Window.partitionBy("subscription_id")
temp = temp.withColumn('mean_cancelled_sub_duration',(mean("cancelled_sub_duration").over(spec3))/365).withColumn('mean_sub_duration',(mean("sub_duration").over(spec3))/365)
temp = temp.select(col('subscription_id').alias('subsid'), col('mean_cancelled_sub_duration'), col('mean_sub_duration'))
df_merge = df_merge.join(broadcast(temp), df_merge.subscription_id==temp.subsid,"left").drop(col('subsid'))
【问题讨论】:
标签: python pandas apache-spark pyspark apache-spark-sql