【问题标题】:How to extract and count values from a nested JSON?如何从嵌套的 JSON 中提取和计算值?
【发布时间】:2021-01-02 17:03:45
【问题描述】:

我正在尝试遍历 json 列表并从每个 json 返回的字典中提取一些信息。大约 99% 的时间,每个 json 字典的第三层包含 5 个 'name' 值,其中 2 个是 xml 文件名。但是,文件不是每次都以相同的顺序出现,选择几次,只有一个xml文件。

在代码进入第二个循环之前,我构建了一个循环来使用搜索字符串计算 xml 文件的数量。这可确保我在每个循环中创建的 xml_dict 具有正确数量的值 (2)。

“预计数器”有效,但确实减慢了执行速度。有没有办法更好地结合 xml 计数器来提高性能?另外,我不知道我是否需要'else: continue'。

示例 json 链接:https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json

json_list = [all_forms['Link'][x] for x in all_forms.index if all_forms['Form Type'][x] == '13F-HR']
link_list = []
lcounter = 0
for json in json_list:
    decode = requests.get(json).json()
    xml_dict = {}
    xml_count = 0
    for dic in decode['directory']['item'][0:]:
        for v in dic.values(): 
            if ".xml" in v.lower():
                xml_count += 1
            else:
                continue
    for dic in decode['directory']['item'][0:]:
        if "primary_doc.xml" in dic['name'] and xml_count > 1:
            xml_dict['doc_xml'] = json.replace('index.json', '') + dic['name']
        elif ".xml" in dic['name'].lower() and "primary_doc" not in dic['name']:
            xml_dict['hold_xml'] = json.replace('index.json', '') + dic['name']
        else:
            continue
    if xml_dict:
        link_list.append(xml_dict)
    lcounter += 1
    if lcounter % 100 == 0:
        print("Processed {} forms".format(lcounter))

【问题讨论】:

    标签: python loops json-normalize


    【解决方案1】:
    • 我认为将pandas与矢量化函数一起使用会更容易更快
      • 这是获取所有计数的 5 行代码,而且速度很快。
    • 一旦 xml 文件计数和所有 .xml 文件的路径可用,请考虑查看 How to convert an XML file to nice pandas dataframe? 以自动处理这些文件。
    import pandas as pd
    
    # list to index.json for Archives
    paths = ['https://www.sec.gov/Archives/edgar/data/1736260/000119312515118890/index.json',
             'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json',
             'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/index.json']
    
    # download and each json and join it into a single dataframe
    # reset the index, so each row has a unique index number
    df = pd.concat([pd.read_json(path, orient='index') for path in paths]).reset_index()
    
    # item is a list of dictionaries that can be exploded to separate columns
    dfe = df.explode('item').reset_index(drop=True)
    
    # each dictionary now has a separate row
    # normalize the dicts, so each key is a column name and each value is in the row
    # rename 'name' to 'item_name', this is the column containing file names like .xml
    # join this back to the main dataframe and drop the item row
    dfj = dfe.join(pd.json_normalize(dfe.item).rename(columns={'name': 'item_name'})).drop(columns=['item'])
    
    # find the rows with .xml in item_name
    # groupby name, which is the archive path with CIK and Accession Number
    # count the number of xml files
    dfg = dfj.item_name[dfj.item_name.str.contains('.xml', case=False)].groupby(dfj.name).count().reset_index().rename(columns={'item_name': 'xml_count'})
    
    # display(dfg)
                                                  name  xml_count
    0  /Archives/edgar/data/1736260/000173626020000004          2
    1    /Archives/edgar/data/51143/000104746917001061          6
    
    • 打印一个数据框,其中包含所有 xml 文件名以及数据框中的相应索引
    print(dfj[['name', 'item_name']][dfj.item_name.str.contains('.xml')].reset_index())
    
    [out]:
       index                                             name                item_name
    0     43  /Archives/edgar/data/1736260/000173626020000004  cpia2ndqtr202013fhr.xml
    1     44  /Archives/edgar/data/1736260/000173626020000004          primary_doc.xml
    2     66    /Archives/edgar/data/51143/000104746917001061        FilingSummary.xml
    3     74    /Archives/edgar/data/51143/000104746917001061         ibm-20161231.xml
    4     76    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_cal.xml
    5     77    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_def.xml
    6     78    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_lab.xml
    7     79    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_pre.xml
    
    • 仅使用 xml 文件创建数据框,并添加包含这些文件的完整路径的列
    xml_files = dfj[dfj.item_name.str.contains('.xml', case=False)].copy()
    
    # add a column that creates a full path to the xml files
    xml_files['file_path'] = xml_files[['name', 'item_name']].apply(lambda x: f'https://www.sec.gov{x[0]}/{x[1]}', axis=1)
    
    # disply(xml_files)
            index                                             name                    parent-dir        last-modified                item_name      type      size                                                                                   file_path
    43  directory  /Archives/edgar/data/1736260/000173626020000004  /Archives/edgar/data/1736260  2020-07-24 09:38:30  cpia2ndqtr202013fhr.xml  text.gif     72804  https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml
    44  directory  /Archives/edgar/data/1736260/000173626020000004  /Archives/edgar/data/1736260  2020-07-24 09:38:30          primary_doc.xml  text.gif      1931          https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml
    66  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36        FilingSummary.xml  text.gif     91940          https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml
    74  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36         ibm-20161231.xml  text.gif  11684003           https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml
    76  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_cal.xml  text.gif    185502       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml
    77  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_def.xml  text.gif    801568       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml
    78  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_lab.xml  text.gif   1356108       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml
    79  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_pre.xml  text.gif   1314064       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml
    
    # create a list of just the file paths
    path_to_xml_files = xml_files.file_path.tolist()
    
    print(path_to_xml_files)
    [out]: 
    ['https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml',
     'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml',
     'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml',
     'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml',
     'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml',
     'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml',
     'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml',
     'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml']
    

    【讨论】:

    • 特伦顿,这太不可思议了,非常感谢。我还是个新手,所以我必须把这一切都解开,但我想我可以按照你使用熊猫的观点。
    猜你喜欢
    • 2023-03-11
    • 1970-01-01
    • 2014-08-20
    • 2016-03-12
    • 1970-01-01
    • 2014-05-16
    • 1970-01-01
    • 1970-01-01
    • 2015-07-24
    相关资源
    最近更新 更多