【发布时间】:2022-01-23 04:28:07
【问题描述】:
场景:
- 票证有
StartDate和EndDate,如果存在StartDate和EndDate,则创建一个新的数据框,如下面的所需输出所示。
Pyspark 数据集如下所示
#base Schema for Testing purpose
#Dataset
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
#Create User defined Custom Schema using StructType
schema = StructType([ StructField('CaseNumber', StringType(), True)\
,StructField('StartTime', StringType(), True)\
,StructField('EndTime', StringType(), True)])
data = [
{"CaseNumber": 'Ticket1', "StartTime": '1/22/19 10:00', "EndTime": ''},
{"CaseNumber": 'Ticket1', "StartTime": '', "EndTime": '1/23/19 11:00'},
{"CaseNumber": 'Ticket1', "StartTime": '1/25/19 7:00', "EndTime": ''},
{"CaseNumber": 'Ticket1', "StartTime": '1/27/19 3:00', "EndTime": ''},
{"CaseNumber": 'Ticket2', "StartTime": '1/29/19 10:00', "EndTime": ''},
{"CaseNumber": 'Ticket2', "StartTime": '', "EndTime": '2/23/19 2:00'},
{"CaseNumber": 'Ticket2', "StartTime": '3/25/19 7:00', "EndTime": ''},
{"CaseNumber": 'Ticket2', "StartTime": '', "EndTime": '3/27/19 8:00'},
{"CaseNumber": 'Ticket2', "StartTime": '', "EndTime": '3/27/19 10:00'},
{"CaseNumber": 'Ticket3', "StartTime": '4/25/19 1:00', "EndTime": ''}
]
from pyspark.sql import SparkSession
#Create PySpark SparkSession
spark = SparkSession.builder \
.master('local[1]') \
.appName('SparkByExamples.com') \
.getOrCreate()
# Creation of a dummy dataframe:
df1 = spark.createDataFrame(data,schema=schema)
df1.show()
已创建数据集:
+----------+-------------+-------------+
|CaseNumber| StartTime| EndTime|
+----------+-------------+-------------+
| Ticket1|1/22/19 10:00| NaN|
| Ticket1| NaN|1/23/19 11:00|
| Ticket1| 1/25/19 7:00| NaN|
| Ticket1| 1/27/19 3:00| NaN|
| Ticket2|1/29/19 10:00| NaN|
| Ticket2| NaN| 2/23/19 2:00|
| Ticket2| 3/25/19 7:00| NaN|
| Ticket2| NaN| 3/27/19 8:00|
| Ticket2| NaN|3/27/19 10:00|
| Ticket3| 4/25/19 1:00| NaN|
+----------+-------------+-------------+
所需的输出应该是:
+----------+-------------+-------------+
|CaseNumber| StartTime| EndTime|
+----------+-------------+-------------+
| Ticket1|1/22/19 10:00|1/23/19 11:00|
| Ticket2|1/29/19 10:00| 2/23/19 2:00|
| Ticket2| 3/25/19 7:00| 3/27/19 8:00|
+----------+-------------+-------------+
应用Lead函数查看,票证是否存在endtime
from pyspark.sql.window import Window
import pyspark.sql.functions as psf
windowSpec = Window.partitionBy("CaseNumber").orderBy("CaseNumber")
df = df1.withColumn("lead",lead("EndTime",1).over(windowSpec))
df.show()
pysparkdf = df.toPandas()
import pandas as pd
tickets = pysparkdf.groupby('CaseNumber')
def isLeadnull(e):
return e['lead'] != None
my_list = []
for i,ticket in tickets:
for j,e in ticket.iterrows() :
if isLeadnull(e):
my_list.append({'CaseNumber': e['CaseNumber'] ,'Start': e['StartTime'], 'EndTime': e['lead']})
else:
print(e['lead'],'Do nothing as condition not met')
这个函数之后的输出是:
[{'CaseNumber': 'Ticket1',
'Start': '1/22/19 10:00',
'EndTime': '1/23/19 11:00'},
{'CaseNumber': 'Ticket1', 'Start': 'NaN', 'EndTime': 'NaN'},
{'CaseNumber': 'Ticket1', 'Start': '1/25/19 7:00', 'EndTime': 'NaN'},
{'CaseNumber': 'Ticket2',
'Start': '1/29/19 10:00',
'EndTime': '2/23/19 2:00'},
{'CaseNumber': 'Ticket2', 'Start': 'NaN', 'EndTime': 'NaN'},
{'CaseNumber': 'Ticket2', 'Start': '3/25/19 7:00', 'EndTime': '3/27/19 8:00'},
{'CaseNumber': 'Ticket2', 'Start': 'NaN', 'EndTime': '3/27/19 10:00'}]
【问题讨论】:
标签: python dataframe apache-spark pyspark apache-spark-sql