将 PySpark 数据帧列表写入 S3 存储桶答案

【问题标题】：Writing a List of PySpark DataFrames to S3 Bucket将 PySpark 数据帧列表写入 S3 存储桶
【发布时间】：2021-06-29 03:14:12
【问题描述】：

在this 帖子中，提供了有关如何将列表存储在 S3 buckwt 中的说明：

import boto3
import pickle

s3 = boto3.client('s3')
myList=[1,2,3,4,5]

#Serialize the object 
serializedListObject = pickle.dumps(myList)

#Write to Bucket named 'mytestbucket' and 
#Store the list using key myList001

s3.put_object(Bucket='mytestbucket',Key='myList001',Body=serializedListObject)

现在假设我们想要将 PySpark 数据帧列表存储在 S3 存储桶中。我收到以下错误：Py4JError: An error occurred while calling o19570.__getstate__. Trace:

我错过了什么？

【问题讨论】：

标签： python pyspark pickle boto

【解决方案1】：

您应该将df.rdd.saveAsPickleFile(filename) 与io.BytesIO 并排使用以腌制DF。有关文档，请参阅 here。

【讨论】：

【解决方案2】：

s3fs 模块也可以用来做同样的事情：

import s3fs  
import pickle

file='abc.pkl'
myList=[1,2,3,4,5]

s3=s3fs.S3FileSystem()
with s3.open(f's3:///{bucket_name}/{file}', 'wb') as f:
    pickle.dump(myList, f)

【讨论】：