我认为解决方案的关键是 lag 函数。试试这个(为简单起见,我假设所有列的数据都是整数):
首先,将列向上移动一天
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
columns = ['date', 'petp', 'status']
data = [(0, 0, 0), (1, 1, 1), (2, 2, 2), (3,3,3), (4,4,4), (5,5,5)]
pd_data = pd.DataFrame.from_records(data=data, columns=columns)
spark_data = spark.createDataFrame(pd_data)
spark_data_with_lag = spark_data.withColumn("status_last_day", F.lag("status", 1, 0).over(Window.orderBy("date")))
spark_data_with_lag.show()
+----+----+------+---------------+
|date|petp|status|status_last_day|
+----+----+------+---------------+
| 1| 1| 1| 0|
| 2| 2| 2| 1|
| 3| 3| 3| 2|
| 4| 4| 4| 3|
| 5| 5| 5| 4|
+----+----+------+---------------+
然后在条件中使用该数据
status2 = spark_data_with_lag.withColumn("status2", F.when(F.col("date") > 0, F.col("petp") + F.col("status_last_day")).otherwise(0))
status2.show()
+----+----+------+---------------+-------+
|date|petp|status|status_last_day|status2|
+----+----+------+---------------+-------+
| 1| 1| 1| 0| 1|
| 2| 2| 2| 1| 3|
| 3| 3| 3| 2| 5|
| 4| 4| 4| 3| 7|
| 5| 5| 5| 4| 9|
+----+----+------+---------------+-------+
我希望这就是你想要的。