【问题标题】:How do I remove nested elements within json data to extract the subnested data?如何删除 json 数据中的嵌套元素以提取子嵌套数据?
【发布时间】:2021-01-14 18:49:26
【问题描述】:

我有从 API 提取的数据,该 API 以以下格式输出 json 数据。如果您注意到,有一个名为“user”的嵌套元素。当我将这个嵌套元素导出到另一个源系统时,它会创建重复值。 我的目标是从用户元素中提取数据(id、名字等)并将数据保存在“用户”元素中。

这是 API 生成的原始 json 格式:

[{
"enrollment_id": 12,
"content_type": "sample",
"user": {
"id": 1,
"first_name": "Sarah",
"last_name": "Kis",
"email": "s_kis@aol.com"
},
"campaign_name": "camp1",
"policy_acknowledged": false
    },
"enrollment_id": 13,
"content_type": "samplee",
"user": {
"id": 2,
"first_name": "Sarahe",
"last_name": "Kiss",
"email": "s_kiss@aol.com"
},
"campaign_name": "camp2",
"policy_acknowledged": false
}]

这是我想要的输出或类似的东西:

 [{
"enrollment_id": 12,
"content_type": "sample",
"id": 1,
"first_name": "Sarah",
"last_name": "Kis",
"email": "s_kis@aol.com",
"campaign_name": "camp1",
"policy_acknowledged": false
},"enrollment_id": 13,
"content_type": "samplee",
"id": 2,
"first_name": "Sarahe",
"last_name": "Kiss",
"email": "s_kiss@aol.com",
"campaign_name": "camp2",
"policy_acknowledged": false
}]

**注意“用户”元素中的数据现在是如何被提取到 json 文件中的。我知道这可能是一个简单的快速修复,但我花了几个小时试图解决这个问题但无济于事。 **

这是我目前拥有的代码(见下文)。需要注意的是,这会完全从 json 文件中删除用户元素。不过,我想将数据保留在元素中。

 path1 = '/Users/t1_{0}.json'
 path2 = '/Users/t2_{0}.json'
    
 with open(path1, 'r') as the_list:
        data = json.load(the_list)
    
 for element in data:
        element.pop('user', None)
    
  with open(path2, 'w') as the_list:
        data = json.dump(data, the_list)

这是我的完整代码供参考:

def load_pst_rec_data(proxy=my_proxy, api_header=api_header,
                      url=rec_url, path=my_path):

    all_psts = ['2011676', '2345729']  # List of items i am filtering in the subsequent data
    the_list = []
    s = requests.Session()  # Create API session
    s.proxies = my_proxy

    for obj in all_psts:  # Loop through the items inside the all_pst variable
        for i in range(1, 10000000):  # Due to pagination of the API, we have to loops through each page to collect data
            try:
                response = requests_retry_session(session=s). \
                    get(url + '{0}/recipients?page={1}&per_page=500'.format(obj, i), headers=api_header,
                        verify=False)  # Connect to the API
                resp = response.json()
            except Exception as e:
                print('It failed :(', e.__class__.__name__)
            else:
                print('It eventually worked', response.status_code)
                if resp:  # Consider using while resp: ______
                    the_list.extend(resp)  # Loop through results and add it to a list
                elif not resp:
                    last_page = str(i)  # Get the last page
                    print("Should stop and go to next object")
                    break
            finally:
                print('process done!')

    # This section attempts to load the data collected to a json file
    try:
        print('Beginning Json process')
    except Exception as e:
        print(e)
    else:
        path1 = '/Users/t1_{0}.json'
        path2 = '/Users/t2_{0}.json'

        with open(path1, 'r') as the_list:
            data = json.load(the_list)

        for element in data:
            element.pop('user', None)

        with open(path2, 'w') as the_list:
            data = json.dump(data, the_list)

【问题讨论】:

  • 与其尝试编辑现有的数据结构,不如只使用您想要传播的数据创建一个新的(扁平化的)结构?
  • 在代码中会是什么样子?
  • flat_dict = {k: old_dict[k] for k in list_of_keys_you_want}; result = {**flat_dict, **old_dict['user']}; return json.dumps(result) 实际上 kirk strauser 的回答是一样的,但更好。
  • 我该把代码放在我的脚本中的什么位置?

标签: python json python-3.x python-requests


【解决方案1】:

那个数据结构是固定的吗?就像您正在尝试解决这一特定问题并且不需要更灵活的解决方案一样?

data = {
    "enrollment_id": 12,
    "content_type": "sample",
    "user": {
        "id": 1,
        "first_name": "Sarah",
        "last_name": "Kis",
        "email": "s_kis@aol.com"
    },
    "campaign_name": "camp1",
    "policy_acknowledged": False
}

user_info = data.pop("user")
data.update(user_info)

【讨论】:

  • 解决方案可以很灵活。只要我把它放入 splunk 时,没有重复或未展平的值
【解决方案2】:
# an example list
data = [
    {"a": 1, "x": { "b": 2, "c": 3 }},
    {"a": 4, "x": { "b": 5, "c": 6 }},
]

# if you want to modify it in-place (without creating a new list)
for element in data:
    # pop removes the item and returns it to you
    # if it doesn't exist, it returns None by default, but here I've asked
    # it to return an empty dictionary
    x = element.pop("x", {})
    # update the parent dictionary with all the contents of x
    element.update(x)

print(data)

输出:

[{'a': 1, 'b': 2, 'c': 3}, {'a': 4, 'b': 5, 'c': 6}]

在您的情况下,将“x”替换为“用户”。

看看dictionary.popdictionary.update

【讨论】:

    【解决方案3】:
    def remove_nested(d):
        dd ={}
        for i in d:
            if isinstance(d[i],(dict)):
                key = d[i]
                dd.update({k:d[i][k] for k in d[i]})
            else:
                dd[i] = d[i]
        return dd
    

    【讨论】:

    • 请把它与我提供的代码联系起来好吗?
    • 首先将该定义放入您的代码中,然后放入循环中,然后将文件转储到 path2 中,将该函数称为 element = remove_nested(element) 而不是 element.pop('user', None)。
    【解决方案4】:

    如果你使用 pandas 来处理,可能效率不高,但你可以让代码更易于阅读。

    import pandas as pd
    import json
    
    data = '''
    [{
        "enrollment_id": 12,
        "content_type": "sample",
        "user": {
            "id": 1,
            "first_name": "Sarah",
            "last_name": "Kis",
            "email": "s_kis@aol.com"
        },
        "campaign_name": "camp1",
        "policy_acknowledged": false
    }, {
        "enrollment_id": 13,
        "content_type": "samplee",
        "user": {
            "id": 2,
            "first_name": "Sarahe",
            "last_name": "Kiss",
            "email": "s_kiss@aol.com"
        },
        "campaign_name": "camp2",
        "policy_acknowledged": false
    }]
    '''
    data = json.loads(data)
    
    # if the json format is fix
    df = pd.json_normalize(data)
    # save user.id, user.first_name... as the key
    data_new = df.to_json(orient='records') 
    
    
    # strip 'user.'
    df.columns = df.columns.str.split(r'user\.').str[-1]
    # if you want the type data2 is list, not str, use df.to_dict instead
    data2 = df.to_dict(orient='records') 
    

    输出:

    data_new (type: str)
    
    [
      {
        "enrollment_id": 12,
        "content_type": "sample",
        "campaign_name": "camp1",
        "policy_acknowledged": false,
        "user.id": 1,
        "user.first_name": "Sarah",
        "user.last_name": "Kis",
        "user.email": "s_kis@aol.com"
      },
      {
        "enrollment_id": 13,
        "content_type": "samplee",
        "campaign_name": "camp2",
        "policy_acknowledged": false,
        "user.id": 2,
        "user.first_name": "Sarahe",
        "user.last_name": "Kiss",
        "user.email": "s_kiss@aol.com"
      }
    ]
    
    
    data2 (type: list)
    
    [{'enrollment_id': 12,
      'content_type': 'sample',
      'campaign_name': 'camp1',
      'policy_acknowledged': False,
      'id': 1,
      'first_name': 'Sarah',
      'last_name': 'Kis',
      'email': 's_kis@aol.com'},
     {'enrollment_id': 13,
      'content_type': 'samplee',
      'campaign_name': 'camp2',
      'policy_acknowledged': False,
      'id': 2,
      'first_name': 'Sarahe',
      'last_name': 'Kiss',
      'email': 's_kiss@aol.com'}]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-10-24
      • 2021-10-01
      • 1970-01-01
      • 2021-07-09
      • 2016-04-28
      • 1970-01-01
      • 2019-09-16
      • 2013-11-22
      相关资源
      最近更新 更多