【发布时间】:2018-10-07 10:03:42
【问题描述】:
我有一个 DataFrame,我将其数据粘贴在下面:
+---------------+--------------+----------+------------+----------+
|name | DateTime| Seq|sessionCount|row_number|
+---------------+--------------+----------+------------+----------+
| abc| 1521572913344| 17| 5| 1|
| xyz| 1521572916109| 17| 5| 2|
| rafa| 1521572916118| 17| 5| 3|
| {}| 1521572916129| 17| 5| 4|
| experience| 1521572917816| 17| 5| 5|
+---------------+--------------+----------+------------+----------+
'name' 列是字符串类型。我想要一个新列"effective_name",它将包含"name" 的增量值,如下所示:
+---------------+--------------+----------+------------+----------+-------------------------+
|name | DateTime |sessionSeq|sessionCount|row_number |effective_name|
+---------------+--------------+----------+------------+----------+-------------------------+
|abc |1521572913344 |17 |5 |1 |abc |
|xyz |1521572916109 |17 |5 |2 |abcxyz |
|rafa |1521572916118 |17 |5 |3 |abcxyzrafa |
|{} |1521572916129 |17 |5 |4 |abcxyzrafa{} |
|experience |1521572917816 |17 |5 |5 |abcxyzrafa{}experience |
+---------------+--------------+----------+------------+----------+-------------------------+
新列包含 name 列的先前值的增量串联。
【问题讨论】:
-
您是通过
clientDateTime还是row_number订购的?有groupBy()s 吗? -
@Chaitanya 我回滚了你的编辑。不要发帖pictures of code or data。
-
到目前为止你做了什么?
-
@pault- 数据是虚拟的
-
@AshishAcharya 我正在尝试使用滞后函数进行连接
标签: python apache-spark dataframe pyspark