【问题标题】:pyspark window function from current row to a row with specific valuepyspark窗口函数从当前行到具有特定值的行
【发布时间】:2021-06-05 16:44:24
【问题描述】:

我有如下数据,并在 claim_start_date 上排序。

arrayData = [
  ('abc','PN1','SN1','2021-02-03 10:20:11','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:20:15','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:20:19','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:21:11','2021-02-03 10:21:19','Success'),
  ('abc','PN1','SN1','2021-02-03 10:22:19','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:22:29','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:22:39','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:22:49','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:22:59','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:31:11','2021-02-03 10:31:19','Success'),
  ('abc','PN1','SN1','2021-02-03 10:31:21','2021-02-03 10:32:19','Success'),
  ('abc','PN1','SN1','2021-02-03 11:32:49','','Fail'),
  ('abc','PN1','SN1','2021-02-03 11:34:59','','Fail'),
  ('abc','PN1','SN2','2021-02-03 10:22:49','','Fail'),
  ('abc','PN1','SN2','2021-02-03 10:22:59','','Fail')
]
root
 |-- event: string (nullable = true)
 |-- PN: string (nullable = true)
 |-- SN: string (nullable = true)
 |-- Claim_Start: string (nullable = true)
 |-- Claim_End: string (nullable = true)
 |-- Status: string (nullable = true)

+-----+---+---+-------------------+-------------------+-------+
|event| PN| SN|        Claim_Start|          Claim_End| Status|
+-----+---+---+-------------------+-------------------+-------+
|  abc|PN1|SN1|2021-02-03 10:20:11|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:20:15|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:20:19|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:21:11|2021-02-03 10:21:19|Success|
|  abc|PN1|SN1|2021-02-03 10:22:19|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:22:29|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:22:39|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:22:49|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:22:59|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:31:11|2021-02-03 10:31:19|Success|
|  abc|PN1|SN1|2021-02-03 10:31:21|2021-02-03 10:32:19|Success|
|  abc|PN1|SN1|2021-02-03 11:32:49|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 11:34:59|                   |   Fail|
|  abc|PN1|SN2|2021-02-03 10:22:49|                   |   Fail|
|  abc|PN1|SN2|2021-02-03 10:22:59|                   |   Fail|
+-----+---+---+-------------------+-------------------+-------+

我只想从当前行遍历到上一个成功的行,即状态为成功的地方,以便我可以计算重试次数以使其成功。

有什么办法可以吗。

【问题讨论】:

  • 这是this的重复吗?

标签: apache-spark pyspark apache-spark-sql


【解决方案1】:

如果要计算每次成功尝试的重试次数,可以添加下一次成功时间的列并按该列分组,例如

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'grp', 
    F.first(
        F.when(F.col('Claim_End') != '', F.col('Claim_End')), 
        True
    ).over(
        Window.partitionBy('event', 'PN', 'SN')
              .orderBy('Claim_Start')
              .rowsBetween(0, Window.unboundedFollowing)
    )
).withColumn(
    'cnt', 
    F.count('*').over(Window.partitionBy('event', 'PN', 'SN', 'grp'))
)

df2.show()
+-----+---+---+-------------------+-------------------+-------+-------------------+---+
|event| PN| SN|        Claim_Start|          Claim_End| Status|                grp|cnt|
+-----+---+---+-------------------+-------------------+-------+-------------------+---+
|  abc|PN1|SN1|2021-02-03 10:20:11|                   |   Fail|2021-02-03 10:21:19|  4|
|  abc|PN1|SN1|2021-02-03 10:20:15|                   |   Fail|2021-02-03 10:21:19|  4|
|  abc|PN1|SN1|2021-02-03 10:20:19|                   |   Fail|2021-02-03 10:21:19|  4|
|  abc|PN1|SN1|2021-02-03 10:21:11|2021-02-03 10:21:19|Success|2021-02-03 10:21:19|  4|
|  abc|PN1|SN1|2021-02-03 10:22:19|                   |   Fail|2021-02-03 10:31:19|  6|
|  abc|PN1|SN1|2021-02-03 10:22:29|                   |   Fail|2021-02-03 10:31:19|  6|
|  abc|PN1|SN1|2021-02-03 10:22:39|                   |   Fail|2021-02-03 10:31:19|  6|
|  abc|PN1|SN1|2021-02-03 10:22:49|                   |   Fail|2021-02-03 10:31:19|  6|
|  abc|PN1|SN1|2021-02-03 10:22:59|                   |   Fail|2021-02-03 10:31:19|  6|
|  abc|PN1|SN1|2021-02-03 10:31:11|2021-02-03 10:31:19|Success|2021-02-03 10:31:19|  6|
|  abc|PN1|SN1|2021-02-03 10:31:21|2021-02-03 10:32:19|Success|2021-02-03 10:32:19|  1|
|  abc|PN1|SN1|2021-02-03 11:32:49|                   |   Fail|               null|  2|
|  abc|PN1|SN1|2021-02-03 11:34:59|                   |   Fail|               null|  2|
|  abc|PN1|SN2|2021-02-03 10:22:49|                   |   Fail|               null|  2|
|  abc|PN1|SN2|2021-02-03 10:22:59|                   |   Fail|               null|  2|
+-----+---+---+-------------------+-------------------+-------+-------------------+---+

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-06-09
    • 2021-12-03
    • 1970-01-01
    • 1970-01-01
    • 2017-08-14
    • 2020-06-24
    相关资源
    最近更新 更多