【发布时间】:2020-10-06 02:39:41
【问题描述】:
我正在尝试使用 AWS Glue 将 csv 数据从 AWS S3 移动到 AWS Redshift。我正在移动的数据使用非标准格式记录每个条目的时间戳(例如 01-JAN-2020 01.02.03),因此我的胶水爬虫将此列作为字符串拾取。
在我的作业脚本中,我正在使用 pyspark 中的“to_timestamp”函数将此列转换为时间戳,这似乎工作正常。但是,因此,数据类型为“long”的列不会转移到 redshift,并且这些列的值都是空的。
当我在不转换时间戳列(即仅生成的脚本)的情况下运行脚本时,数据类型为“long”的列没有此问题,并且它们正确显示在 redshift 中。
这是我的代码:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import to_timestamp, col
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "telenors3csvdata", table_name = "gprs_reports", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "telenors3csvdata", table_name = "gprs_reports", transformation_ctx = "datasource0")
# Convert to data frame and perform ETL
dataFrame = datasource0.toDF().withColumn("rec_open_ts", to_timestamp(col("rec_open_ts"),"dd-MMM-yyyy HH.mm.ss"))
# Convert back to a dynamic frame
editedData = DynamicFrame.fromDF(dataFrame, glueContext, "editedData")
## @type: ApplyMapping
## @args: [mapping = [("rec_open_ts", "timestamp", "rec_open_ts", "timestamp"), ("chg_id", "long", "chg_id", "long"), ("rec_seq_num", "long", "rec_seq_num", "long"), ("imsi", "long", "imsi", "long"), ("msisdn", "long", "msisdn", "long"), ("terminal_ip_address", "string", "terminal_ip_address", "string"), ("pdp_type", "long", "pdp_type", "long"), ("ggsn_ip_address", "string", "ggsn_ip_address", "string"), ("sgsn_ip_address", "string", "sgsn_ip_address", "string"), ("country", "string", "country", "string"), ("operator", "string", "operator", "string"), ("apn", "string", "apn", "string"), ("duration", "long", "duration", "long"), ("record_close_cause_code", "long", "record_close_cause_code", "long"), ("uploaded_data(b)", "long", "uploaded_data(b)", "long"), ("downloaded_data(b)", "long", "downloaded_data(b)", "long")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = editedData, mappings = [("rec_open_ts", "timestamp", "rec_open_ts", "timestamp"), ("chg_id", "long", "chg_id", "long"), ("rec_seq_num", "long", "rec_seq_num", "long"), ("imsi", "long", "imsi", "long"), ("msisdn", "long", "msisdn", "long"), ("terminal_ip_address", "string", "terminal_ip_address", "string"), ("pdp_type", "long", "pdp_type", "long"), ("ggsn_ip_address", "string", "ggsn_ip_address", "string"), ("sgsn_ip_address", "string", "sgsn_ip_address", "string"), ("country", "string", "country", "string"), ("operator", "string", "operator", "string"), ("apn", "string", "apn", "string"), ("duration", "long", "duration", "long"), ("record_close_cause_code", "long", "record_close_cause_code", "long"), ("uploaded_data(b)", "long", "uploaded_data(b)", "long"), ("downloaded_data(b)", "long", "downloaded_data(b)", "long")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "RedshiftCluster", connection_options = {"dbtable": "gprs_reports", "database": "telenordatasync"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "RedshiftCluster", connection_options = {"dbtable": "gprs_reports", "database": "telenordatasync"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()
我在这里有什么明显的遗漏吗?非常感谢!
编辑:
运行editedData.PrintSchema() 后显示的架构是:
|-- rec_open_ts: timestamp |-- chg_id: struct | |-- long: long | |-- string: string |-- rec_seq_num: struct | |-- long: long | |-- string: string |-- imsi: struct | |-- long: long | |-- string: string |-- msisdn: struct | |-- long: long | |-- string: string |-- terminal_ip_address: string |-- pdp_type: struct | |-- long: long | |-- string: string |-- ggsn_ip_address: string |-- sgsn_ip_address: string |-- country: string |-- operator: string |-- apn: string |-- duration: struct | |-- long: long | |-- string: string |-- record_close_cause_code: struct | |-- long: long | |-- string: string |-- uploaded_data(b): struct | |-- long: long | |-- string: string |-- downloaded_data(b): struct | |-- long: long | |-- string: string
(long 是结构的一部分?)
运行editedData.Show(10) 后,应显示在redshift 中的数据。长列之一的示例:
"chg_id": {"long": 123456789, "string": null}
编辑 2:
在没有 ETL(时间戳保留为字符串)的情况下运行 datasource0.printSchema() 后,架构为:
|-- rec_open_ts: string |-- chg_id: choice | |-- long | |-- string |-- rec_seq_num: choice | |-- long | |-- string |-- imsi: choice | |-- long | |-- string |-- msisdn: choice | |-- long | |-- string |-- terminal_ip_address: string |-- pdp_type: choice | |-- long | |-- string |-- ggsn_ip_address: string |-- sgsn_ip_address: string |-- country: string |-- operator: string |-- apn: string |-- duration: choice | |-- long | |-- string |-- record_close_cause_code: choice | |-- long | |-- string |-- uploaded_data(b): choice | |-- long | |-- string |-- downloaded_data(b): choice | |-- long | |-- string
似乎当我转换时间戳列时,长列变成了结构。这是为什么呢?
【问题讨论】:
-
你能用editedData.printSchena()和editedData.show(10)验证editedData的数据类型和实际数据吗?
-
是的 - 我会添加结果
标签: amazon-web-services apache-spark pyspark aws-glue