【问题标题】:Adding previous row with current row using Window function使用 Window 函数将前一行与当前行相加
【发布时间】:2018-05-02 22:13:37
【问题描述】:

我有一个 spark 数据框,我想根据当前行 Amount 值和基于 groupid 和 id 的 Amount 值的上一行总和来计算运行总计。让我把df放出来

import findspark
findspark.init()
import pyspark 
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd


 sc = spark.sparkContext
data1 = {'date': {0: '2018-04-03', 1: '2018-04-04', 2: '2018-04-05', 3: '2018-04-06', 4: '2018-04-07'},
         'id': {0: 'id1', 1: 'id2', 2: 'id1', 3: 'id3', 4: 'id2'},
         'group': {0: '1', 1: '1', 2: '1', 3: '2', 4: '1'},
         'amount': {0: 50, 1: 40, 2: 50, 3: 55, 4: 20}}
df1_pd = pd.DataFrame(data1, columns=data1.keys())

df1 = spark.createDataFrame(df1_pd)
df1.show()


+----------+---+-----+------+
|      date| id|group|amount|
+----------+---+-----+------+
|2018-04-03|id1|    1|    50|
|2018-04-04|id2|    1|    40|
|2018-04-05|id1|    1|    50|
|2018-04-06|id3|    2|    55|
|2018-04-07|id2|    1|    20|
+----------+---+-----+------+

我正在寻找的输出

+----------+---+-----+------+---+
|      date| id|group|amount|sum|
+----------+---+-----+------+---+
|2018-04-03|id1|    1|    50|50 |
|2018-04-04|id2|    1|    40|90 |
|2018-04-05|id1|    1|    50|140|
|2018-04-06|id3|    2|    55|55 |
|2018-04-07|id2|    1|    20|160|
+----------+---+-----+------+---+

【问题讨论】:

    标签: apache-spark pyspark spark-dataframe


    【解决方案1】:

    窗口定义:

    from pyspark.sql.window import Window
    from pyspark.sql.functions import sum
    
    w = Window.partitionBy("group").orderBy("date").rowsBetween(
        Window.unboundedPreceding,  # Take all rows from the beginning of frame
        Window.currentRow           # To current row
    )
    

    总和:

    (df1.withColumn("sum", sum("amount").over(w))
        .orderBy("date")   # Sort for easy inspection. Not necessary
        .show())
    

    结果:

    +----------+---+-----+------+---+      
    |      date| id|group|amount|sum|
    +----------+---+-----+------+---+
    |2018-04-03|id1|    1|    50| 50|
    |2018-04-04|id2|    1|    40| 90|
    |2018-04-05|id1|    1|    50|140|
    |2018-04-06|id3|    2|    55| 55|
    |2018-04-07|id2|    1|    20|160|
    +----------+---+-----+------+---+
    

    【讨论】:

    • scala 中是否有“unboundedPreceding”的等效函数?
    • "unboundedPreceding" 仅适用于 spark 2.1,我使用的是 spark 2.0.1
    猜你喜欢
    • 2021-04-10
    • 2016-01-17
    • 1970-01-01
    • 1970-01-01
    • 2020-05-10
    • 2015-06-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多