【发布时间】:2021-12-21 08:46:18
【问题描述】:
我有一个如下的df:
-
手机号码是
different:|applicantkey| first_reg_date|utmcontent| latest_signin_date|mobilenumber| +------------+-------------------+----------+-------------------+------------+ | 1234|2021-01-03 06:05:43| Android|2021-01-03 06:05:43| 987| | 1234|2021-04-03 07:05:43| Android|2021-10-03 06:05:43| 986| +------------+-------------------+----------+-------------------+------------+ -
手机号码是
same:|applicantkey| first_reg_date|utmcontent| latest_signin_date|mobilenumber| +------------+-------------------+----------+-------------------+------------+ | 1234|2021-01-03 06:05:43| Android|2021-01-03 06:05:43| 987| | 1234|2021-04-03 07:05:43| Android|2021-10-03 06:05:43| 987| +------------+-------------------+----------+-------------------+------------+
现在,我想获取first_reg_date 的min 和latest_signin_date 的max,并替换数据集中这两列的值。所以我的预期输出应该如下所示:
+------------+-------------------+----------+-------------------+------------+
|applicantkey|first_reg_date |utmcontent|latest_signin_date |mobilenumber|
+------------+-------------------+----------+-------------------+------------+
|1234 |2021-01-03 06:05:43|Android |2021-10-03 06:05:43|987 |
|1234 |2021-01-03 06:05:43|Android |2021-10-03 06:05:43|986 |
+------------+-------------------+----------+-------------------+------------+
我尝试了以下查询,但它给出的输出如下所示:
spark.sql(
"select applicantkey,min(first_reg_date) first_reg_date,utmcontent,max(latest_signin_date) latest_signin_date,mobilenumber from df group by applicantkey,utmcontent,mobilenumber").show(truncate=False)
+------------+-------------------+----------+-------------------+------------+
|applicantkey|first_reg_date |utmcontent|latest_signin_date |mobilenumber|
+------------+-------------------+----------+-------------------+------------+
|1234 |2021-01-03 06:05:43|Android |2021-01-03 06:05:43|987 |
|1234 |2021-04-03 07:05:43|Android |2021-10-03 06:05:43|986 |
+------------+-------------------+----------+-------------------+------------+
AND
+------------+-------------------+----------+-------------------+------------+
|applicantkey|first_reg_date |utmcontent|latest_signin_date |mobilenumber|
+------------+-------------------+----------+-------------------+------------+
|1234 |2021-01-03 06:05:43|Android |2021-10-03 06:05:43|987 |
+------------+-------------------+----------+-------------------+------------+
第二个输出正确但第一个输出错误。
所以,我尝试了以下方法,它可以帮助我获得正确的结果,但是当手机号码相同时,我会得到重复:
df1 = spark.sql(
"select applicantkey,min(first_reg_date) first_reg_date, max(latest_signin_date) latest_signin_date from df group by applicantkey")
df2 = spark.sql("select applicantkey,utmcontent,mobilenumber from df")
df3 = df1.join(df2, "applicantkey", "left_outer")
df3.show(truncate=False)
+------------+-------------------+-------------------+----------+------------+
|applicantkey|first_reg_date |latest_signin_date |utmcontent|mobilenumber|
+------------+-------------------+-------------------+----------+------------+
|1234 |2021-01-03 06:05:43|2021-10-03 06:05:43|Android |987 |
|1234 |2021-01-03 06:05:43|2021-10-03 06:05:43|Android |987 |
+------------+-------------------+-------------------+----------+------------+
我不想在最后使用DISTINCT()。那么,我到底做错了什么?
【问题讨论】:
-
对于相同的
applicantkey记录,我想要min(first_reg_date)和max(latest_signin_date),其余值应该按原样拾取。
标签: sql apache-spark pyspark apache-spark-sql