【问题标题】:how do i unflatten a dataframe back to json/xml format?如何将数据帧展平回 json/xml 格式?
【发布时间】:2021-09-09 18:00:59
【问题描述】:

我正在对半结构化数据进行分析,为此我不得不将 xml 和 json 文件展平为 pandas 数据框,现在当分析完成后,我会进行改进,例如删除空值并修复一些数据错误需要生成xml或json文件(取决于用户输入的格式)。

这就是我用来扁平化 xml 的方法:

import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd


def flatten_xml(node, key_prefix=()):
    """
    Walk an XML node, generating tuples of key parts and values.
    """

    # Copy tag content if any
    text = (node.text or '').strip()
    if text:
        yield key_prefix, text

    # Copy attributes
    for attr, value in node.items():
        yield key_prefix + (attr,), value

    # Recurse into children
    for child in node:
        yield from flatten_xml(child, key_prefix + (child.tag,))


def dictify_key_pairs(pairs, key_sep='.'):
    """
    Dictify key pairs from flatten_xml, taking care of duplicate keys.
    """
    out = {}

    # Group by candidate key.
    key_map = defaultdict(list)
    for key_parts, value in pairs:
        key_map[key_sep.join(key_parts)].append(value)

    # Figure out the final dict with suffixes if required.
    for key, values in key_map.items():
        if len(values) == 1:  # No need to suffix keys.
            out[key] = values[0]
        else:  # More than one value for this key.
            for suffix, value in enumerate(values, 1):
                out[f'{key}{key_sep}{suffix}'] = value

    return out


# Parse XML with etree
tree = et.parse('NCT00571389.xml').iter()

# Generate flat rows out of the root nodes in the tree
rows = [dictify_key_pairs(flatten_xml(row)) for row in tree]
df = pd.DataFrame(rows)

这就是我用来扁平化 json 的方法:

from collections import defaultdict
import pandas as pd
import json

def flatten_json(nested_json, exclude=['']):
    out = {}

    def flatten(x, name='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude: flatten(x[a], name + a + '.')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

f = open('employee_data.json') 
this_dict = json.load(f)
df = pd.DataFrame([flatten_json(x) for x in this_dict[list(this_dict.keys())[0]]])

我需要知道如何从数据框转到文件的原始结构,请帮忙?

编辑: 这是我正在使用的 json 文件的示例:

{"features": [{"candidate": {"first_name": "Margaret", "last_name": "Mcdonald", "skills": ["skLearn", "Java", "R", "SQL", "Spark", "C++"], "state": "AL", "specialty": "Database", "experience": "Mid", "relocation": "no"}}, {"candidate": {"first_name": "Michael", "last_name": "Carter", "skills": ["TensorFlow", "R", "Spark", "MongoDB", "C++", "SQL"], "state": "AR", "specialty": "Statistics", "experience": "Senior", "relocation": "yes"}}]}

这是我将它们展平后的列:

candidate.first_name
candidate.last_name
candidate.skills.0
candidate.skills.1
candidate.skills.2
candidate.skills.3
candidate.skills.4
candidate.skills.5
candidate.state
candidate.specialty
candidate.experience
candidate.relocation
candidate.skills.6
candidate.skills.7
candidate.skills.8

【问题讨论】:

  • 您检查过pandas.DataFrame.to_json 方法吗? pandas.pydata.org/docs/reference/api/…
  • @FlavioMoraes 它不起作用,因为我的列看起来像 candidate.first_name candidate.skills.0,所以我需要拆分它们然后将具有相同开头的键连接在一起
  • 你能提供一个json的例子吗?更容易看到扁平化前后的样子
  • @FlavioMoraes 你好,我刚刚添加了一个例子

标签: python json pandas xml


【解决方案1】:

好的,这并不容易,我应该指导你而不是为你编码,但这是我所做的:

json = {"features": [{"candidate": {"first_name": "Margaret", "last_name": "Mcdonald", "skills": ["skLearn", "Java", "R", "SQL", "Spark", "C++"], "state": "AL", "specialty": "Database", "experience": "Mid", "relocation": "no"}}, {"candidate": {"first_name": "Michael", "last_name": "Carter", "skills": ["TensorFlow", "R", "Spark", "MongoDB", "C++", "SQL"], "state": "AR", "specialty": "Statistics", "experience": "Senior", "relocation": "yes"}}]}
df = pd.DataFrame([flatten_json(x) for x in json[list(json.keys())[0]]])


import re
header = df.columns
print(header)
regex = r'(\w+)\.(\w+)\.?(\d+)?'
m=re.findall(regex,'\n'.join(header))

def make_json(json,feature,pos,value):
    if pos+1 == len(feature):
        json[feature[pos]] = value
        return json
    elif feature[pos+1] == '':
        json[feature[pos]] = value
        return json
    elif feature[pos+1].isdigit():
        if feature[pos+1] == '0':
            json[feature[pos]] = [value]
            return json
        else:
            json[feature[pos]].append(value)
            return json
    else:
        if feature[pos] not in json:
            json[feature[pos]] = make_json({},feature,pos+1,value)
            return json
        else:
            json[feature[pos]] = make_json(json[feature[pos]],feature,pos+1,value)
            return json

json = {'features': []}
for row in range(len(df)):
    cadidate = {}
    for col, feature in enumerate(m):
        cadidate = make_json(cadidate,feature,0,df.iloc[row][header[col]])
    json['features'].append(cadidate)

print(json)

您知道我想以递归方式制作它,以便它可以用于更复杂的 json,只要您正确定义正则表达式。对于您的具体示例,它可能会更简单。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2012-07-05
    • 1970-01-01
    • 2021-04-19
    • 2023-03-04
    • 2013-04-19
    • 1970-01-01
    • 2017-08-08
    • 2020-08-30
    相关资源
    最近更新 更多