【发布时间】:2021-11-10 14:35:48
【问题描述】:
我是 AWS 胶水的新手,我正在尝试使用 pyspark 运行一些转换过程。我成功运行了我的 ETL,但我正在寻找另一种将数据帧转换为动态帧的方法。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# load data from crawler
students = glueContext.create_dynamic_frame.from_catalog(database="example_db", table_name="samp_csv")
# move data into a new variable for transformation
students_trans = students
# convert dynamicframe(students_trans) to dataframe
students_= students_trans.toDF()
# run transformation change column names/ drop columns
students_1= students_.withColumnRenamed("state","County").withColumnRenamed("capital","cap").drop("municipal",'metropolitan')
#students_1.printSchema()
#convert df back to dynamicframe
from awsglue.dynamicframe import DynamicFrame
students_trans = students_trans.fromDF(students_1, glueContext, "students_trans")
#load into s3 bucket
glueContext.write_dynamic_frame.from_options(frame = students_trans,
connection_type = "s3",
connection_options = {"path": "s3://kingb/target/"},
format = "csv")
【问题讨论】: