【问题标题】:How to yield multiple list objects by appending values?如何通过附加值产生多个列表对象?
【发布时间】:2020-12-06 00:08:50
【问题描述】:

我有以下代码,可从所有 AWS 支持区域的名为 resourcegroupstaggingapi 的 AWS 服务中获取资源信息。现在我正在遍历多个区域并成功获取记录,但我的问题是处理需要很长时间,CPU memory 也被大量使用并且执行时间非常长,我在40 million records 左右过程。有人能告诉我优化此代码的最佳方法是什么吗?我看到生成器提高了性能执行速度,但我不知道如何append 和@ 987654325@ 多个值。我也是 Python 新手,谁能指导我如何改进以下代码:

import boto3, os, json
from credentials import AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY

AWS_SUPPORTED_REGIONS = ["ap-northeast-1", "ap-northeast-2", "ap-south-1", "ap-southeast-1", "ap-southeast-2",
                         "ca-central-1", "eu-central-1", "eu-north-1", "eu-west-1", "eu-west-2", "eu-west-3",
                         "sa-east-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2"]


def services_info():
    services_info = []
    services_info_no_owner = []
    for region in AWS_SUPPORTED_REGIONS:
        client = boto3.client('resourcegroupstaggingapi', region_name=region,
                              aws_access_key_id=AWS_ACCESS_KEY_ID,
                              aws_secret_access_key=AWS_SECRET_ACCESS_KEY
                              )
        paginator = client.get_paginator('get_resources')
        resources = []
        for page in paginator.paginate():
            resources.extend(page["ResourceTagMappingList"])
        for resource in resources:
            resource_arn = resource.get("ResourceARN")
            arn_split = resource_arn.split(':')
            service_name = arn_split[2]
            resource_owner_info = arn_split[3]
            services_info.append({
                "resource_arn": resource_arn,
                "service_name": service_name,
                "region": region,
                "owner_info": resource_owner_info
            })
            if services_info_no_owner.isspace():
                services_info_no_owner.append({
                    "resource_arn": resource_arn,
                    "service_name": service_name,
                    "region": region,
                    "owner_info": resource_owner_info
                })
    return services_info, services_info_no_owner


services_info, services_info_no_owner = services_info()

try:
    with open("services_info.json", 'w') as output:
        json.dump(services_info, output, sort_keys=True, indent=4)
except Exception as e:
    print("Exception occurred while writing to file")

try:
    with open("services_info_no_owner.json", 'w') as output:
        json.dump(services_info_no_owner, output, sort_keys=True, indent=4)
except Exception as e:
    print("Exception occurred while writing to file")

【问题讨论】:

    标签: python python-3.x python-requests generator boto3


    【解决方案1】:
    1. 我删除了那些你不断定义新变量的行,通过对它应用新的东西来定义新变量,而是将它们全部放在一行上,这样可以释放内存。
    2. 我尝试将您的代码转换为生成器,因为它们更加优化
    import boto3, os, json
    from credentials import AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
    
    AWS_SUPPORTED_REGIONS = ["ap-northeast-1", "ap-northeast-2", "ap-south-1", "ap-southeast-1", "ap-southeast-2",
                             "ca-central-1", "eu-central-1", "eu-north-1", "eu-west-1", "eu-west-2", "eu-west-3",
                             "sa-east-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2"]
    
    def services_info():
        services_info = []
        services_info_no_owner = []
    
        def go(region):
            resources = [page["ResourceTagMappingList"] for page in boto3.client('resourcegroupstaggingapi', region_name=region,
                                  aws_access_key_id=AWS_ACCESS_KEY_ID,
                                  aws_secret_access_key=AWS_SECRET_ACCESS_KEY
                                  ).get_paginator('get_resources').paginate()]
    
            [(services_info.append({
                    "resource_arn": resource.get("ResourceARN").split(':'),
                    "service_name": resource.get("ResourceARN").split(':')[2],
                    "region": region,
                    "owner_info":  resource.get("ResourceARN").split(':')[3]
                }), services_info_no_owner.append({
                        "resource_arn": resource.get("ResourceARN"),
                        "service_name": resource.get("ResourceARN").split(':')[2],
                        "region": region,
                        "owner_info": resource.get("ResourceARN").split(':')[3]
                    }))
    
             if services_info_no_owner.isspace()
    
             else services_info.append({
                    "resource_arn": resource.get("ResourceARN").split(':'),
                    "service_name": resource.get("ResourceARN").split(':')[2],
                    "region": region,
                    "owner_info":  resource.get("ResourceARN").split(':')[3]
                })  for resource in resources]
    
        list(map(lambda region: go(region),AWS_SUPPORTED_REGIONS))
    
        return services_info, services_info_no_owner
    
    
    services_info, services_info_no_owner = services_info()
    
    try:
        with open("services_info.json", 'w') as output:
            json.dump(services_info, output, sort_keys=True, indent=4)
    except Exception as e:
        print("Exception occurred while writing to file")
    
    try:
        with open("services_info_no_owner.json", 'w') as output:
            json.dump(services_info_no_owner, output, sort_keys=True, indent=4)
    except Exception as e:
        print("Exception occurred while writing to file")
    

    【讨论】:

    • 感谢您的宝贵时间,让我检查并更新答案
    • 没问题,我想可以多删一些变量。
    【解决方案2】:

    首先,代码似乎不是正确的代码,因为 isspace() 函数将在列表 services_info_no_owner 上失败,因为 AttributeError: 'list' 对象没有属性 'isspace'

    创建代码速度慢的主要原因之一 与说列表/元组相比非常慢的字典项

    4000 万次您在文件中写入(列)标题。 “资源_arn” “服务名称” “地区” “所有者信息” 想象时间和空间被用于写作 40 百万 * 大约 40 字节 = 16 亿字节 所以 json 不是正确的格式。一个建议是使用熊猫 数据框,然后使用 to_csv() 写入 csv 文件或 只需使用列表并手动写入 csv。 主要好处是您不必将字典附加到列表中

    使用现有代码,您可以使用列表推导替换第一个 for 循环

    for page in paginator.paginate():
        resources.extend(page["ResourceTagMappingList"])
            
    

    resources.extend([page["ResourceTagMappingList"] for page in paginator.paginate()])
    

    如下替换第二个 for 循环。使用 cmets 弥补失去的可读性。 service_name 和 resource_owner_info 已经在您的 resource_arn 中,那么就没有 需要分开存放。 region 也将在 resource_arn 中,因此需要 也存储它。

    for resource in resources:
        resource_arn = resource.get("ResourceARN")
        arn_split = resource_arn.split(':')
        service_name = arn_split[2]
        resource_owner_info = arn_split[3]
        services_info.append({"resource_arn": resource_arn,"service_name": service_name,"region": region,"owner_info": resource_owner_info})
    

    services_info = [resource.get("ResourceARN") for resource in resources]
    

    我知道以上两个建议都需要与 您的 json 文件的用户,但是当您有 4000 万条记录时 所取得的进步值得付出努力。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-01-17
      • 1970-01-01
      • 2023-03-17
      • 1970-01-01
      • 2019-11-16
      • 1970-01-01
      相关资源
      最近更新 更多