PySpark - 将上一行和下一行附加到当前行答案

【问题标题】：PySpark - Append previous and next row to current rowPySpark - 将上一行和下一行附加到当前行
【发布时间】：2018-07-10 15:03:06
【问题描述】：

假设我有一个像这样的 PySpark 数据框：

1 0 1 0
0 0 1 1
0 1 0 1

如何将一行的最后一列和下一列附加到当前行，如下所示：

1 0 1 0 0 0 0 0 0 0 1 1
0 0 1 1 1 0 1 0 0 1 0 1
0 1 0 1 0 0 1 1 0 0 0 0

我熟悉用于添加列的.withColumn() 方法，但不确定我会在该字段中输入什么。

"0 0 0 0" 是占位符值，因为在这些行之前和之后没有之前或之后的行。

【问题讨论】：

你能举一个更真实的例子吗？一般来说，将a、1 和! 放在一个列中并不是一个好主意，其他列也是如此。话虽如此，withColumn、lead 和 lag 应该可以满足您的需求。
我相信您可以在我的示例中想象更真实的占位符。我只是为了便于区分而把它们放在一起。
您的示例令人困惑。 a, b, c, d 是列名吗？ 0 0 0 0 来自哪里？见how to create good reproducible apache spark dataframe examples@Chris。
@pault 好的，我将更改示例。 “0 0 0 0”是占位符值，因为在这些行之前和之后没有之前或之后的行。

标签： python apache-spark dataframe pyspark apache-spark-sql

【解决方案1】：

您可以使用pyspark.sql.functions.lead() 和pyspark.sql.functions.lag()，但首先您需要一种对行进行排序的方法。如果您还没有决定顺序的列，您可以使用pyspark.sql.functions.monotonically_increasing_id()创建一个

然后将它与Window 函数结合使用。

例如，如果您有以下 DataFrame df：

df.show()
#+---+---+---+---+
#|  a|  b|  c|  d|
#+---+---+---+---+
#|  1|  0|  1|  0|
#|  0|  0|  1|  1|
#|  0|  1|  0|  1|
#+---+---+---+---+

你可以这样做：

from pyspark.sql import Window
import pyspark.sql.functions as f

cols = df.columns
df = df.withColumn("id", f.monotonically_increasing_id())
df.select(
    "*", 
    *([f.lag(f.col(c),default=0).over(Window.orderBy("id")).alias("prev_"+c) for c in cols] + 
      [f.lead(f.col(c),default=0).over(Window.orderBy("id")).alias("next_"+c) for c in cols])
).drop("id").show()
#+---+---+---+---+------+------+------+------+------+------+------+------+
#|  a|  b|  c|  d|prev_a|prev_b|prev_c|prev_d|next_a|next_b|next_c|next_d|
#+---+---+---+---+------+------+------+------+------+------+------+------+
#|  1|  0|  1|  0|     0|     0|     0|     0|     0|     0|     1|     1|
#|  0|  0|  1|  1|     1|     0|     1|     0|     0|     1|     0|     1|
#|  0|  1|  0|  1|     0|     0|     1|     1|     0|     0|     0|     0|
#+---+---+---+---+------+------+------+------+------+------+------+------+

【讨论】：