【发布时间】:2021-05-19 15:44:20
【问题描述】:
我有一个表示 CSV 文件的字典列表,我想将它们写入 S3,但是出现内存错误。这里是我的代码:
import csv
import io
dicts = [] # populated with about 1,000,000 dictionaries representing a CSV
f = io.StringIO()
writer = csv.DictWriter(f, fieldnames=dicts[0].keys())
writer.writeheader()
for k in dicts:
writer.writerow(k)
print("Writing to S3...")
response = s3.upload_fileobj(Bucket='mybucket', Key=f"key.csv", Fileobj=f.getvalue())
f.close()
但是,当我运行它时,我收到以下错误:
[ERROR] MemoryErrorTraceback (most recent call last):
File "/var/task/lambda_function.py", line 85, in lambda_handler
response = s3.upload_fileobj(Bucket='mybucket', Key=f"key.csv", Fileobj=f.getvalue())
如何以更节省内存的方式将其写入 S3? CSV 文件大小约为 400mb,大约有 1,000,000 行。
编辑:
我有最大可用内存量,这是来自 lambda 的报告:
REPORT RequestId: c8f651cf-9869-4217-921f-52edcf577234
Duration: 123484.03 ms
Billed Duration: 123485 ms
Memory Size: 10240 MB
Max Memory Used: 10043 MB
Init Duration: 453.23 ms
我已经运行了一个内存分析器,并且绝大多数内存都用于写入f 和f.getvalue(),这并不奇怪
编辑:
这里是完整的 lambda 函数代码:
for i in event['files']:
try:
file = s3.get_object(Bucket="incomingbucket", Key=i)
print(file)
except Exception as e:
print(e, i)
file_id = str(uuid.uuid4())
jsonRootLs = i.split(".")
if len(jsonRootLs) > 1:
jsonRoot = '.'.join(j for j in jsonRootLs[0:len(jsonRootLs)-1])
jsonFileName = f"{jsonRoot}.json"
else:
jsonRoot = jsonRootLs[0]
jsonFileName = f"{jsonRoot}.json"
mapper = s3.get_object(Key=jsonFileName, Bucket='slm-addressfile-incoming')
mapperJSON = json.loads(mapper['Body'].read().decode('utf-8'))
dicts = modelerFile(file, mapperJSON)
for j in dicts:
j['mail_filename'] = i
j['file_id'] = file_id
dictsToSend.extend(dicts)
print("Records added to list")
f = io.StringIO()
writer = csv.DictWriter(f, fieldnames=dicts[0].keys())
writer.writeheader()
for k in dicts:
writer.writerow(k)
print("Writing to S3...")
response = s3.upload_fileobj(Bucket='slm-test-bucket-transactional', Key=f"{jsonRoot}.csv", Fileobj=f.getvalue())
f.close()
# Function to re map columns
def customFile(file, mapperjson):
NCOAFields = mapperjson['mappings']
lines1 = []
for line in file['Body'].iter_lines():
lines1.append(line.decode('utf-8', errors='ignore'))
fieldnames = lines1[0].replace('"','').split(',')
jlist1 = (dict(row) for row in csv.DictReader(lines1[1:], fieldnames))
dicts = []
for i in jlist1:
d = {}
metadata = {}
for k, v in i.items():
if k in NCOAFields:
d[NCOAFields[k]] = v
else:
metadata[k] = v
if len(metadata) > 0:
d['metadata'] = metadata
d['individual_id'] = str(uuid.uuid4())
dicts.append(d)
del jlist1
return dicts
基本上它会读取一个 CSV rom S3,它还有一个 JSON 映射文件,用于将列名更改为我们的目标架构
【问题讨论】:
-
目前Lambda函数的内存设置是什么?您是否尝试过简单地增加可用内存? aws.amazon.com/about-aws/whats-new/2020/12/…
-
是的,我有最大的内存量。将更新帖子
-
Uhhh 我怀疑文件大小是你的问题。你的文件是 400mb,你的 Lambda 内存是 10gb……这意味着 25 倍的差异。换句话说,有 9.6gb 的 RAM 下落不明。好多啊。这似乎是内存泄漏。
-
@MyStackRunnethOver 我会用完整的功能代码更新帖子
-
什么是
dictsToSend?它只出现一次,你什么也不用做
标签: python csv amazon-s3 aws-lambda