【发布时间】:2021-05-03 22:19:29
【问题描述】:
我在pyspark 中有一个数据框,如下所示
df = spark.createDataFrame([('123', '2021-01-01', 1815, 9876),
('123', '2021-01-01', 1820, 9877) ,
('123', '2021-01-01', 1828, 9878) ,
('123', '2021-02-01', 1815, 9876) ,
('123', '2021-02-01', 1820, 9877) ,
('123', '2021-02-01', 1828, 9878) ,
('223', '2021-01-01', 1815, 9876) ,
('223', '2021-01-01', 1820, 9877) ,
('223', '2021-01-01', 1828, 9878) ,
('223', '2021-02-01', 1815, 9876) ,
('223', '2021-02-01', 1820, 9877) ,
('223', '2021-02-01', 1828, 9878)],['number','date', 'sorter', 'key'])
df.show()
+------+----------+------+----+
|number| date|sorter| key|
+------+----------+------+----+
| 123|2021-01-01| 1815|9876|
| 123|2021-01-01| 1820|9877|
| 123|2021-01-01| 1828|9878|
| 123|2021-02-01| 1815|9876|
| 123|2021-02-01| 1820|9877|
| 123|2021-02-01| 1828|9878|
| 223|2021-01-01| 1815|9876|
| 223|2021-01-01| 1820|9877|
| 223|2021-01-01| 1828|9878|
| 223|2021-02-01| 1815|9876|
| 223|2021-02-01| 1820|9877|
| 223|2021-02-01| 1828|9878|
+------+----------+------+----+
此数据框根据sorter 列排序
现在使用上面的数据框我想创建一个新的数据框。基于以下
1) For each group where number and date is same I want to concatenate the `key` value.
2) In each group the first record will have its own `key` as `joined_key`
3) From second record onwards it should have its own `key` and the `joined_key` of previous record
expected result
df1.show()
+------+----------+------+----+---------------+
|number| date|sorter| key| Joined_key|
+------+----------+------+----+---------------+
| 123|2021-01-01| 1815|9876| 9876|
| 123|2021-01-01| 1820|9877| 9877~9876|
| 123|2021-01-01| 1828|9878| 9878~9877~9876|
| 123|2021-02-01| 1815|9876| 9876|
| 123|2021-02-01| 1820|9877| 9877~9876|
| 123|2021-02-01| 1828|9878| 9878~9877~9876|
| 223|2021-01-01| 1815|9876| 9876|
| 223|2021-01-01| 1820|9877| 9877~9876|
| 223|2021-01-01| 1828|9878| 9878~9877~9876|
| 223|2021-02-01| 1815|9876| 9876|
| 223|2021-02-01| 1820|9877| 9877~9876|
| 223|2021-02-01| 1828|9878| 9878~9877~9876|
+------+----------+------+----+---------------+
我已经尝试过如下方法,但无法继续进行
df1 = df.groupby("number", "date").agg(collect_list('key').alias('joined_key'))
df1.show()
+------+----------+------------------+
|number| date| joined_key|
+------+----------+------------------+
| 223|2021-02-01|[9878, 9876, 9877]|
| 123|2021-01-01|[9878, 9876, 9877]|
| 223|2021-01-01|[9878, 9876, 9877]|
| 123|2021-02-01|[9876, 9877, 9878]|
+------+----------+------------------+
我怎样才能实现我想要的?
【问题讨论】:
标签: apache-spark pyspark