【发布时间】:2020-06-13 06:57:59
【问题描述】:
这是pyspark sql Add different Qtr start_date, End_date for exploded rows 中问题的延续。谢谢。
我有以下数据框,其中有一个数组列表作为列。
+--------------+------------+----------+----------+---+---------+-----------+----------+
customer_number|sales_target|start_date|end_date |noq|cf_values|new_sdt |new_edate |
+--------------+------------+----------+----------+---+---------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|2020-10-01 |2020-12-31|
+--------------+------------+----------+----------+---+---------------------+----------+
我需要有一列,每行有一个 cf_values,将 withcolumn 添加到现有记录中。如果我使用爆炸,我会得到重复的记录,所以最终会得到 16 条记录。
+--------------+------------+----------+----------+---+---------+------+-----------+----------+
customer_number|sales_target|start_date|end_date |noq|cf_values|cf_new|new_sdt |new_edate |
+--------------+------------+----------+----------+---+---------+------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|3 |2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|3 |2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|3 |2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|3 |2020-10-01 |2020-12-30|
+--------------+------------+----------+----------+---+---------+------------------+----------+
预期结果: 具有 4 个不同 cf_values 的 4 条记录,新 start_date new_end_date。
+--------------+------------+----------+----------+---+------+-----------+----------+
customer_number|sales_target|start_date|end_date |noq|cf_new|new_sdt |new_edate |
+--------------+------------+----------+----------+---+------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|4 |4 |2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |4 |2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |4 |2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |3 |2020-10-01 |2020-12-31|
+--------------+------------+----------+----------+---+------------------+----------+
【问题讨论】:
-
从数组中过滤单个值的条件是什么?
-
不确定我是否正确理解了您的问题,但是对于给定的客户编号,基于开始日期和结束日期列值,我使用序列和分解功能创建了单独的行。 cf_new 数组长度将等于给定客户编号的行数/季度数。每个数组元素都应该按照数组的顺序插入每一行。谢谢。有一个链接在开头,您可以参考更多信息。
-
[请参阅答案 mvasyliv。两个DataFrame然后加入](stackoverflow.com/questions/62344243/…)
标签: python-3.x pyspark apache-spark-sql databricks pyspark-dataframes