【发布时间】:2021-12-16 06:00:11
【问题描述】:
这是我的代码:
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
spark_df1.count( ) # This command took around 1.40 min for exectuion
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
test_data = spark_df1.sample(fraction=0.001)
spark_df2 = spark_df1.subtract(test_data)
spark_df2.count() #This command is taking more than 20 min for execution. Can any one help why
#its taking long time for same count command?
为什么count()在使用subtract命令前后需要很长时间?
【问题讨论】:
-
鉴于这是一个与性能相关的问题,请关注this guide 以更好地构建问题。
标签: pyspark databricks