【问题标题】:How to create hierarchical data frame using list of dictionaries如何使用字典列表创建分层数据框
【发布时间】:2021-01-26 10:24:39
【问题描述】:

我有以下要使用 python 展平的字典列表。数据最初来自 xero,如下所示:

这是我使用 API 提取的示例数据:

my_dict = [{'RowType': 'Section', 'Title': 'Income', 'Rows': []},{'RowType': 'Section', 'Title': 'Income from Rents', 'Rows': []},
 {'RowType': 'Section',
  'Title': 'Rent Received',
  'Rows': [{'RowType': 'Row',
    'Cells': [{'Value': 'Contract Rent',
      'Attributes': [{'Value': '5',
        'Id': 'account'},
       {'Value': '5', 'Id': 'groupID'}]},
     {'Value': '721093.92',
      'Attributes': [{'Value': '5',
        'Id': 'account'},
       {'Value': '5', 'Id': 'groupID'}]}]},
   {'RowType': 'Row',
    'Cells': [{'Value': 'Rent  - Carparks',
      'Attributes': [{'Value': '95',
        'Id': 'account'}]},
     {'Value': '3523.33',
      'Attributes': [{'Value': '95',
        'Id': 'account'}]}]},
   {'RowType': 'Row',
    'Cells': [{'Value': 'Vacant Tenancies',
      'Attributes': [{'Value': '53',
        'Id': 'account'}]},
     {'Value': '-22226.50',
      'Attributes': [{'Value': '53',
        'Id': 'account'}]}]},
   {'RowType': 'SummaryRow',
    'Cells': [{'Value': 'Total Rent Received'}, {'Value': '702390.75'}]}]},
 {'RowType': 'Section',
  'Title': 'Rent Reductions',
  'Rows': [{'RowType': 'Row',
    'Cells': [{'Value': 'COVID-19 Rent reduction',
      'Attributes': [{'Value': '40',
        'Id': 'account'}]},
     {'Value': '-132478.03',
      'Attributes': [{'Value': '40',
        'Id': 'account'}]}]},
   {'RowType': 'Row',
    'Cells': [{'Value': 'Rent Holiday',
      'Attributes': [{'Value': '4d',
        'Id': 'account'}]},

         {'Value': '-14451.58',
          'Attributes': [{'Value': '4d',
            'Id': 'account'}]}]},
       {'RowType': 'SummaryRow',
        'Cells': [{'Value': 'Total Rent Reductions'}, {'Value': '-146929.61'}]}]}]

想要的输出如下:

          Name        Amount    Hierarchy_level_3   Hierarchy_level_1   Hierarchy_level_2
0   Contract Rent   721093.92   Rent Received            Income        Income from Rents
1   Rent - Carparks 3523.33     Rent Receive             Income        Income from Rents
2   Vacant Tenancies -22226.50  Rent Received            Income        Income from Rents
3   Total Rent Received 702390.75           
4   COVID-19 Rent reduction -132478.03  Rent Reduction   Income        Income from Rents
     .                .              .                       .          .          .
     .                .              .                       .          .           .

谁能帮我解决这个问题?这里的示例数据是我从 api 获得的格式。不知道如何展平这个文件。我对 Python 比较陌生。

【问题讨论】:

  • 离题: 谨防在 SO(即互联网)上发布公司受限/机密数据。它可能(或可能不会)导致您与雇主之间的问题。当然,您发布的数据可能完全是您自己的财产,我不知道(?)。您可以用虚构的数据替换数字,用一些任意但仍然相关的名称替换键(单元格名称)。
  • ID 已加密,我的个人数据也已加密。这应该不是问题。

标签: python json pandas dictionary for-loop


【解决方案1】:

假设您示例中的4 行的Hierarchy_level_3Rent Received 而不是Rent Reduction,并且您的示例中具有4 级层次结构,这是一个解决方案。我添加了级别编号和级别名称,因为我认为这些可能比“层次结构级别”更有用,但可以随意删除

import pandas as pd
hierarchy = {f'Hierarchy_level_{i+1}': d['Title'] for i, d in enumerate(my_dict)}
all_data = []

for level, d in enumerate(my_dict):
    for row in d['Rows']:
        cells = row['Cells']
        all_data.append({
            'Name': cells[0]['Value'],
            'Amount': cells[1]['Value'],
            'Level': level,
            'Level_name': hierarchy[f'Hierarchy_level_{level+1}'],
            **hierarchy
        })
df = pd.DataFrame(all_data)

输出:

                   Name      Amount  Level       Level_name Hierarchy_level_1  Hierarchy_level_2 Hierarchy_level_3 Hierarchy_level_4
0            Contract Rent   721093.92      2    Rent Received            Income  Income from Rents     Rent Received   Rent Reductions
1         Rent  - Carparks     3523.33      2    Rent Received            Income  Income from Rents     Rent Received   Rent Reductions
2         Vacant Tenancies   -22226.50      2    Rent Received            Income  Income from Rents     Rent Received   Rent Reductions
3      Total Rent Received   702390.75      2    Rent Received            Income  Income from Rents     Rent Received   Rent Reductions
4  COVID-19 Rent reduction  -132478.03      3  Rent Reductions            Income  Income from Rents     Rent Received   Rent Reductions
5             Rent Holiday   -14451.58      3  Rent Reductions            Income  Income from Rents     Rent Received   Rent Reductions
6    Total Rent Reductions  -146929.61      3  Rent Reductions            Income  Income from Rents     Rent Received   Rent Reductions

--- 编辑 由于只需要 3 级层次结构:

import pandas as pd
hierarchy = {f'Hierarchy_level_{i+1}': d['Title'] for i, d in enumerate(my_dict)}
all_data = []

for level, d in enumerate(my_dict):
    for row in d['Rows']:
        cells = row['Cells']
        all_data.append({
            'Name': cells[0]['Value'],
            'Amount': cells[1]['Value'],
            'Hierarchy_level_1': hierarchy[f'Hierarchy_level_1'],
            'Hierarchy_level_2': hierarchy[f'Hierarchy_level_2'],
            'Hierarchy_level_3': hierarchy[f'Hierarchy_level_{level+1}'],
        })
df = pd.DataFrame(all_data)

输出:

Name      Amount Hierarchy_level_1  Hierarchy_level_2 Hierarchy_level_3
0            Contract Rent   721093.92            Income  Income from Rents     Rent Received
1         Rent  - Carparks     3523.33            Income  Income from Rents     Rent Received
2         Vacant Tenancies   -22226.50            Income  Income from Rents     Rent Received
3      Total Rent Received   702390.75            Income  Income from Rents     Rent Received
4  COVID-19 Rent reduction  -132478.03            Income  Income from Rents   Rent Reductions
5             Rent Holiday   -14451.58            Income  Income from Rents   Rent Reductions
6    Total Rent Reductions  -146929.61            Income  Income from Rents   Rent Reductions

【讨论】:

  • 非常感谢您的帮助。这真的很有帮助。不过,只有一个问题,减租应该是hierarchy_level_3 而不是level hierarchy_level_4。一般来说,最多应该只有三个层次结构。
  • 您可以在**hierarchy 之后添加"Hierarchy_level_3": hierarchy[f'Hierarchy_level_{level+1}'],这将使用正确的级别覆盖级别3
  • 天哪,对此我感激不尽。从字面上看,我花了好几个小时来解决这个问题。多谢了。 :)
  • 给你一个问题,如果我想在数据框中显示帐户 ID,我该如何包含帐户 ID?你能帮忙吗? `
  • 可以加'AccountId': cells[0].get('Attributes', [{'Value': 'N/A'}])[0]['Value']
猜你喜欢
  • 2021-08-18
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-02-05
  • 2020-11-12
  • 2022-09-27
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多