【问题标题】:How to extract objects from nested lists from a Json file with Python?如何使用 Python 从 Json 文件中的嵌套列表中提取对象?
【发布时间】:2020-05-15 15:46:18
【问题描述】:

我收到了来自 Lobbyview 的 json 格式的响应。我试图把它放在数据框中只访问一些变量,但没有成功。如何以可导出到 .dta 的格式仅访问一些变量,例如 id 和委员会?这是我尝试过的代码。

import requests, json
query = {"naics": "424430"}
results = requests.post('https://www.lobbyview.org/public/api/reports',
data = json.dumps(query))
print(results.json())

import pandas as pd
b = pd.DataFrame(results.json())

_id = data["_id"]
committee = data["_source"]["specific_issues"][0]["bills_by_algo"][0]["committees"]

对 json 的观察如下所示:

"_score": 4.421936, 
"_type": "object", 
"_id": "5EZUMbQp3hGKH8Uq2Vxuke", 
"_source": 
    {
    "issue_codes": ["CPT"], 
    "received": 1214320148, 
    "client_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION", 
    "amount": 240000, 
    "client": 
        {
        "legal_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION", 
        "name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION", 
        "naics": null, 
        "gvkey": null, 
        "ticker": "Unlisted", 
        "id": null, 
        "bvdid": "US131283992L"}, 
    "specific_issues": [
        {
        "text": "H.R. 34, H.R. 1908, H.R. 2336, H.R. 3093  S. 522, S. 681, S. 1145, S. 1745", 
        "bills_by_algo": [
            {
            "titles": ["To amend title 35, United States Code, to provide for patent reform.", "Patent Reform Act of 2007", "Patent Reform Act of 2007", "Patent Reform Act of 2007"], 
            "top_terms": ["Commerce", "Administrative fees"], 
            "sponsor": 
                {
                "firstname": "Howard", 
                "district": 28, 
                "title": "rep", 
                "id": 400025
                }, 
            "committees": ["House Judiciary"], 
            "introduced": 1176868800, 
            "type": "HR", "id": "110_HR1908"}, 
            {
            "titles": ["To amend title 35, United States Code, relating to the funding of the United States Patent and Trademark Office."], 
            "top_terms": ["Commerce", "Administrative fees"], 
            "sponsor": 
                {
                "firstname": "Howard", 
                "district": 28, 
                "title": "rep", 
                "id": 400025
                }, 
            "committees": ["House Judiciary"], 
            "introduced": 1179288000, 
            "type": "HR", 
            "id": "110_HR2336"
        }],

        "gov_entities": ["U.S. House of Representatives", "Patent and Trademark Office (USPTO)", "U.S. Senate", "UNDETERMINED", "U.S. Trade Representative (USTR)"], 
        "lobbyists": ["Valente, Thomas Silvio", "Wamsley, Herbert C"], 
        "year": 2007, 
        "issue": "CPT", 
        "id": "S4nijtRn9Q5NACAmbqFjvZ"}], 
    "year": 2007, 
    "is_latest_amendment": true,
     "type": "MID-YEAR AMENDMENT", 
    "id": "1466CDCD-BA3D-41CE-B7A1-F9566573611A", 
    "alternate_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION"
    }, 
"_index": "collapsed"}```

【问题讨论】:

    标签: python json list api object


    【解决方案1】:

    由于您指定的数据在 JSON 响应中嵌套得非常深,因此您必须遍历它并将其临时保存到列表中。为了更好地理解响应数据,我建议您使用一些工具来查看 JSON 结构,例如 online JSON-Viewer。并非 JSON 中的每个条目都包含必要的数据,因此我尝试通过 tryexcept 捕获错误。为了确保idcommittees 正确匹配,我选择将它们作为小字典添加到列表中。然后可以轻松地将这个列表读入 Pandas。保存到 .dta 需要您将 committees 列中的列表转换为字符串,而不是您可能还希望另存为 .csv 以获得更普遍可用的格式。

    import requests, json
    import pandas as pd
    
    
    query = {"naics": "424430"}
    results = requests.post(
        "https://www.lobbyview.org/public/api/reports", data=json.dumps(query)
    )
    
    
    json_response = results.json()["result"]
    
    # to save the JSON response
    # with open("data.json", "w") as outfile:
    #     json.dump(results.json()["result"], outfile)
    
    resulting_data = []
    
    # loop through the response
    for data in json_response:
        # try to find entries with specific issues, bills_by_algo and committees
        try:
            # loop through the special issues
            for special_issue in data["specific_issues"]:
                _id = special_issue["id"]
                # loop through the bills_by_algo's
                for x in special_issue["bills_by_algo"]:
                    # append the id and committees in a dict
                    resulting_data.append(({"id": _id, "committees": x["committees"]}))
    
        except KeyError as e:
            print(e, "not found in entry.")
            continue
    
    
    # create a DataFrame
    df = pd.DataFrame(resulting_data)
    # export of list objects in the column is not supported by .dta, therefore we convert
    # to strings with ";" as delimiter
    df["committees"] = ["; ".join(map(str, l)) for l in df["committees"]]
    print(df)
    df.to_stata("result.dta")
    
    
    

    结果

                             id                                         committees
    0    D8BxG5664FFb8AVc6KTphJ                                    House Judiciary
    1    D8BxG5664FFb8AVc6KTphJ                                   Senate Judiciary
    2    8XQE5wu3mU7qvVPDpUWaGP                                  House Agriculture
    3    8XQE5wu3mU7qvVPDpUWaGP        Senate Agriculture, Nutrition, and Forestry
    4    kzZRLAHdMK4YCUQtQAdCPY                                  House Agriculture
    ..                      ...                                                ...
    406  ZxXooeLGVAKec9W2i32hL5                                  House Agriculture
    407  ZxXooeLGVAKec9W2i32hL5  Senate Agriculture, Nutrition, and Forestry; H...
    408  ZxXooeLGVAKec9W2i32hL5        House Appropriations; Senate Appropriations
    409  ahmmafKLfRP8wZay9o8GRf                                  House Agriculture
    410  ahmmafKLfRP8wZay9o8GRf        Senate Agriculture, Nutrition, and Forestry
    
    [411 rows x 2 columns]
    

    【讨论】:

    • 非常感谢 Markus 非常准确的回答。实际上我正在查看数据,我注意到这样选择的“id”在路径 special_issues>bills_by_algo 内。我实际上是在尝试提取路径之外的一个并定义观察结果(“_id”而不是“id”)。是否可以提取该值?
    • 我在 JSON 文件中没有找到任何_id,我现在已更改代码以使用特刊中的id。为此,您必须将其保存在 for x in special_issue["bills_by_algo"]: 之前的 for 循环中。
    • 谢谢马库斯。当我们将文件转换为 stata 时,有没有办法将每个提交的值放在不同的变量中?对于其他请求,一些委员会的值超过 244 个字符,因此不受 stata 支持。
    • 是的,您可以将字符串列表转换为虚拟列,您可以在以下问题中阅读:stackoverflow.com/questions/29034928/…
    猜你喜欢
    • 2022-01-07
    • 2021-12-21
    • 1970-01-01
    • 2021-12-10
    • 2021-10-25
    • 1970-01-01
    • 2018-12-18
    • 2018-07-09
    • 2011-09-29
    相关资源
    最近更新 更多